# Fetching Git data Git is the most popular source code management system. It is usually used to track versions of source code files. Transactions on a git repositories are called “commits”. Each commit is an atomic change to the files in the repository. For each commit, git maintains data for tracking what changed, and some metadata about who committed the change, when, which files were affected, etc. Perceval retrieves this information, and produces a JSON document (a dictionary when using it from Python) for each commit. ### Using Perceval as a program Let’s use Perceval to retrieve metadata from the git repository for the Perceval source code (do you appreciate the nice recursion here?). We will start by using the `perceval` command that we used to check the installation of Perceval in the previous section. ```bash (perceval) $ perceval git https://github.com/grimoirelab/perceval.git [2016-10-03 00:47:46,632] - Sir Perceval is on his quest. [2016-10-03 00:47:46,633] - Fetching commits: 'https://github.com/grimoirelab/perceval.git' git repository from 1970-01-01 00:00:00+00:00; all branches { "backend_name": "Git", "backend_version": "0.3.0", "data": { "Author": "Santiago Due\u00f1as ", "AuthorDate": "Tue Aug 18 18:08:27 2015 +0200", "Commit": "Santiago Due\u00f1as ", "CommitDate": "Tue Aug 18 18:08:27 2015 +0200", "commit": "dc78c254e464ff334892e0448a23e4cfbfc637a3", "files": [ { "action": "A", "added": "10", "file": ".gitignore", "indexes": [ "0000000...", "ceaedd5..." ], "modes": [ "000000", "100644" ], "removed": "0" }, { "action": "A", "added": "1", "file": "AUTHORS", "indexes": [ "0000000...", "a67f214..." ], "modes": [ "000000", "100644" ], "removed": "0" }, { "action": "A", "added": "674", "file": "LICENSE", "indexes": [ "0000000...", "94a9ed0..." ], "modes": [ "000000", "100644" ], "removed": "0" } ], "message": "Initial import", "parents": [], "refs": [] }, "origin": "https://github.com/grimoirelab/perceval.git", "perceval_version": "0.3.0", "timestamp": 1475448330.809561, "updated_on": 1439914107.0, "uuid": "29ddd736146e278feb5d84e9dcc1fd310ff50007" } ... [2016-10-03 00:47:47,861] - Fetch process completed: 356 commits fetched [2016-10-03 00:47:47,862] - Sir Perceval completed his quest. ``` Your output will vary depending on the exact version of Perceval you have, and when you run it. But you will get something similar to this start (with the first commit in Perceval), followed by many more commits, and the final messages. In addition, by redirecting `stdout` you can notice that JSON documents are actually written to `stdout`, while progress messages are written in `stderr`. This makes it easy to get a file with all commits (one JSON document per commit), or to pipe them to some other command. For example: ```bash (perceval) $ perceval git https://github.com/grimoirelab/perceval.git > /tmp/perceval.test [2016-10-03 00:53:59,235] - Sir Perceval is on his quest. [2016-10-03 00:53:59,236] - Fetching commits: 'https://github.com/grimoirelab/perceval.git' git repository from 1970-01-01 00:00:00+00:00; all branches [2016-10-03 00:54:00,349] - Fetch process completed: 356 commits fetched [2016-10-03 00:54:00,349] - Sir Perceval completed his quest. ``` This will produce the file `/tmp/perceval.test` with all the retrieved commits. To produce this result, Perceval cloned the git repository to analyze, and got information for all its commits by using the `git log` command under the hoods. Therefore, you need to have git installed, but if you’re are in the business of developing software, it would be weird if you didn’t have it. One interesting detail of this behavior is that Perceval is cloning the git repository once and again, to analyze it. You can tell Perceval where to store it, and reuse it the next time. You will probably notice the difference if you use the time command: ```bash (perceval) $ time perceval git https://github.com/grimoirelab/perceval.git \ --git-path /tmp/perceval.git > /tmp/perceval.test [2016-10-03 01:01:55,360] - Sir Perceval is on his quest. [2016-10-03 01:01:55,361] - Fetching commits: 'https://github.com/grimoirelab/perceval.git' git repository from 1970-01-01 00:00:00+00:00; all branches [2016-10-03 01:01:58,195] - Fetch process completed: 356 commits fetched [2016-10-03 01:01:58,195] - Sir Perceval completed his quest. real 0m2.991s user 0m0.544s sys 0m0.100s (perceval) $ time perceval git https://github.com/grimoirelab/perceval.git \ --git-path /tmp/perceval.git > /tmp/perceval.test [2016-10-03 01:02:00,319] - Sir Perceval is on his quest. [2016-10-03 01:02:00,321] - Fetching commits: 'https://github.com/grimoirelab/perceval.git' git repository from 1970-01-01 00:00:00+00:00; all branches [2016-10-03 01:02:01,323] - Fetch process completed: 356 commits fetched [2016-10-03 01:02:01,323] - Sir Perceval completed his quest. real 0m1.151s user 0m0.432s sys 0m0.032s ``` Of course, differences will be longer for larger repositories. ### Using Perceval as a Python module But we know that Perceval is a Python library. So, let’s use it as a Python library, from a Python script `perceval_git_1.py`: ```py #! /usr/bin/env python3 from perceval.backends.core.git import Git # url for the git repo to analyze repo_url = 'http://github.com/grimoirelab/perceval.git' # directory for letting Perceval clone the git repo repo_dir = '/tmp/perceval.git' # create a Git object, pointing to repo_url, using repo_dir for cloning repo = Git(uri=repo_url, gitpath=repo_dir) # fetch all commits as an iterator, and iterate it printing each hash for commit in repo.fetch(): print(commit['data']['commit']) ``` This code imports the `perceval.backends` module, and then produces an object of the `perceval.backends.git.Git` class. All classes of this kind include a method for fetching the items retrieved by the Perceval backend, as an iterator: `fetch()`. In the last two lines of the script, we iterate through that iterator, printing the hash for all commmits fetched. The output of the script looks like: ``` (perceval) $ python3 perceval_git_1.py ... 26bad088db3df0701f095c7cd45f89e2d9948a7a bfe38f2e61d2f9743ad5f648880c493085f485b8 18e639396a7fb9a01c4d374baa473fdf7f8b1e10 fdf511b0144cb7707cae1a6b8905e83004cf003b dd0aec7170367160766a1e155b37db5fa2ae61d9 cedc42d8d897d1bf64e999b91fb9ce34464440c9 d7bef8060648f96000a575b1c2af6bc63f9a0ad3 ``` ### Fetch commits from private Git repositories If you want to fetch commits from a private github repository, you must pass the credentials (`username` and `password`/`api-token`) directly in the URL. ``` $ perceval git https://:@github.com/chaoss/grimoirelab-perceval ``` ### An example: counting commits in a repository After the first example of using Perceval to get data from git repositories, we can write a slightly useful Python script: one that counts commits in a git repository. The complete script is described in the Tools and Tips chapter, and the code can be accessed as `perceval_git_counter.py`. Its skeleton is as follows: ```py import argparse from perceval.backends.core.git import Git parser = argparse.ArgumentParser(description = "Count commits in a git repo") parser.add_argument("repo", help = "Repository url") parser.add_argument("dir", help = "Directory for cloning the repository") parser.add_argument("--print", action='store_true', help = "Print hashes") args = parser.parse_args() repo = Git(uri=args.repo, gitpath=args.dir) count = 0 for commit in repo.fetch(): if args.print: print(commit['data']['commit']) count += 1 print("Number of commmits: %d." % count) ``` The script includes a simple parser that will read the repository url and the directory to clone it from the command line, and another one (optional) to print commit hashes. The last lines of the script are quite similar to the previous example: get all the commits from the generator provided by the Git class, count them, and print the total count. To run it, just provide those two command line arguments: repository url and directory to clone: ``` $ python3 perceval_git_counter.py https://github.com/grimoirelab/perceval.git /tmp/clonedir ```