Fetching Git data

Git is the most popular source code management system. It is usually used to track versions of source code files. Transactions on a git repositories are called “commits”. Each commit is an atomic change to the files in the repository. For each commit, git maintains data for tracking what changed, and some metadata about who committed the change, when, which files were affected, etc. Perceval retrieves this information, and produces a JSON document (a dictionary when using it from Python) for each commit.

Using Perceval as a program

Let’s use Perceval to retrieve metadata from the git repository for the Perceval source code (do you appreciate the nice recursion here?). We will start by using the perceval command that we used to check the installation of Perceval in the previous section.

(perceval) $ perceval git https://github.com/grimoirelab/perceval.git
[2016-10-03 00:47:46,632] - Sir Perceval is on his quest.
[2016-10-03 00:47:46,633] - Fetching commits: 'https://github.com/grimoirelab/perceval.git' git repository from 1970-01-01 00:00:00+00:00; all branches
{
    "backend_name": "Git",
    "backend_version": "0.3.0",
    "data": {
        "Author": "Santiago Due\u00f1as <sduenas@bitergia.com>",
        "AuthorDate": "Tue Aug 18 18:08:27 2015 +0200",
        "Commit": "Santiago Due\u00f1as <sduenas@bitergia.com>",
        "CommitDate": "Tue Aug 18 18:08:27 2015 +0200",
        "commit": "dc78c254e464ff334892e0448a23e4cfbfc637a3",
        "files": [
            {
                "action": "A",
                "added": "10",
                "file": ".gitignore",
                "indexes": [
                    "0000000...",
                    "ceaedd5..."
                ],
                "modes": [
                    "000000",
                    "100644"
                ],
                "removed": "0"
            },
            {
                "action": "A",
                "added": "1",
                "file": "AUTHORS",
                "indexes": [
                    "0000000...",
                    "a67f214..."
                ],
                "modes": [
                    "000000",
                    "100644"
                ],
                "removed": "0"
            },
            {
                "action": "A",
                "added": "674",
                "file": "LICENSE",
                "indexes": [
                    "0000000...",
                    "94a9ed0..."
                ],
                "modes": [
                    "000000",
                    "100644"
                ],
                "removed": "0"
            }
        ],
        "message": "Initial import",
        "parents": [],
        "refs": []
    },
    "origin": "https://github.com/grimoirelab/perceval.git",
    "perceval_version": "0.3.0",
    "timestamp": 1475448330.809561,
    "updated_on": 1439914107.0,
    "uuid": "29ddd736146e278feb5d84e9dcc1fd310ff50007"
}
...
[2016-10-03 00:47:47,861] - Fetch process completed: 356 commits fetched
[2016-10-03 00:47:47,862] - Sir Perceval completed his quest.

Your output will vary depending on the exact version of Perceval you have, and when you run it. But you will get something similar to this start (with the first commit in Perceval), followed by many more commits, and the final messages. In addition, by redirecting stdout you can notice that JSON documents are actually written to stdout, while progress messages are written in stderr. This makes it easy to get a file with all commits (one JSON document per commit), or to pipe them to some other command. For example:

(perceval) $ perceval git https://github.com/grimoirelab/perceval.git > /tmp/perceval.test
[2016-10-03 00:53:59,235] - Sir Perceval is on his quest.
[2016-10-03 00:53:59,236] - Fetching commits: 'https://github.com/grimoirelab/perceval.git' git repository from 1970-01-01 00:00:00+00:00; all branches
[2016-10-03 00:54:00,349] - Fetch process completed: 356 commits fetched
[2016-10-03 00:54:00,349] - Sir Perceval completed his quest.

This will produce the file /tmp/perceval.test with all the retrieved commits.

To produce this result, Perceval cloned the git repository to analyze, and got information for all its commits by using the git log command under the hoods. Therefore, you need to have git installed, but if you’re are in the business of developing software, it would be weird if you didn’t have it.

One interesting detail of this behavior is that Perceval is cloning the git repository once and again, to analyze it. You can tell Perceval where to store it, and reuse it the next time. You will probably notice the difference if you use the time command:

(perceval) $ time perceval git https://github.com/grimoirelab/perceval.git \
  --git-path /tmp/perceval.git > /tmp/perceval.test
[2016-10-03 01:01:55,360] - Sir Perceval is on his quest.
[2016-10-03 01:01:55,361] - Fetching commits: 'https://github.com/grimoirelab/perceval.git' git repository from 1970-01-01 00:00:00+00:00; all branches
[2016-10-03 01:01:58,195] - Fetch process completed: 356 commits fetched
[2016-10-03 01:01:58,195] - Sir Perceval completed his quest.

real    0m2.991s
user    0m0.544s
sys    0m0.100s

(perceval) $ time perceval git https://github.com/grimoirelab/perceval.git \
  --git-path /tmp/perceval.git > /tmp/perceval.test
[2016-10-03 01:02:00,319] - Sir Perceval is on his quest.
[2016-10-03 01:02:00,321] - Fetching commits: 'https://github.com/grimoirelab/perceval.git' git repository from 1970-01-01 00:00:00+00:00; all branches
[2016-10-03 01:02:01,323] - Fetch process completed: 356 commits fetched
[2016-10-03 01:02:01,323] - Sir Perceval completed his quest.

real    0m1.151s
user    0m0.432s
sys    0m0.032s

Of course, differences will be longer for larger repositories.

Using Perceval as a Python module

But we know that Perceval is a Python library. So, let’s use it as a Python library, from a Python script perceval_git_1.py:

#! /usr/bin/env python3

from perceval.backends.core.git import Git

# url for the git repo to analyze
repo_url = 'http://github.com/grimoirelab/perceval.git'
# directory for letting Perceval clone the git repo
repo_dir = '/tmp/perceval.git'

# create a Git object, pointing to repo_url, using repo_dir for cloning
repo = Git(uri=repo_url, gitpath=repo_dir)
# fetch all commits as an iterator, and iterate it printing each hash
for commit in repo.fetch():
    print(commit['data']['commit'])

This code imports the perceval.backends module, and then produces an object of the perceval.backends.git.Git class. All classes of this kind include a method for fetching the items retrieved by the Perceval backend, as an iterator: fetch(). In the last two lines of the script, we iterate through that iterator, printing the hash for all commmits fetched. The output of the script looks like:

(perceval) $ python3 perceval_git_1.py 
...
26bad088db3df0701f095c7cd45f89e2d9948a7a
bfe38f2e61d2f9743ad5f648880c493085f485b8
18e639396a7fb9a01c4d374baa473fdf7f8b1e10
fdf511b0144cb7707cae1a6b8905e83004cf003b
dd0aec7170367160766a1e155b37db5fa2ae61d9
cedc42d8d897d1bf64e999b91fb9ce34464440c9
d7bef8060648f96000a575b1c2af6bc63f9a0ad3

Fetch commits from private Git repositories

If you want to fetch commits from a private github repository, you must pass the credentials (username and password/api-token) directly in the URL.

$ perceval git https://<username>:<api-token>@github.com/chaoss/grimoirelab-perceval

An example: counting commits in a repository

After the first example of using Perceval to get data from git repositories, we can write a slightly useful Python script: one that counts commits in a git repository. The complete script is described in the Tools and Tips chapter, and the code can be accessed as perceval_git_counter.py. Its skeleton is as follows:

import argparse
from perceval.backends.core.git import Git

parser = argparse.ArgumentParser(description = "Count commits in a git repo")
parser.add_argument("repo", help = "Repository url")
parser.add_argument("dir", help = "Directory for cloning the repository")
parser.add_argument("--print", action='store_true', help = "Print hashes")
args = parser.parse_args()

repo = Git(uri=args.repo, gitpath=args.dir)
count = 0
for commit in repo.fetch():
    if args.print:
        print(commit['data']['commit'])
    count += 1
print("Number of commmits: %d." % count)

The script includes a simple parser that will read the repository url and the directory to clone it from the command line, and another one (optional) to print commit hashes. The last lines of the script are quite similar to the previous example: get all the commits from the generator provided by the Git class, count them, and print the total count.

To run it, just provide those two command line arguments: repository url and directory to clone:

$ python3 perceval_git_counter.py https://github.com/grimoirelab/perceval.git /tmp/clonedir