Fetching data from Mail Archives

Many software development projects use mailing lists as a mean for coordination. Mailing lists can be archived in many different ways, but maybe the most classical is using the mbox format. This format is simple: messages are stored in a file with the beginning of each one indicated by a line starting with the string “From ”. Perceval has a backend for supporting mbox archives, with the imaginative name mbox. Unfortunately, there are several variations of the basic format: Perceval does its best for parsing all those variations.

Parsing mbox archives

As in other cases, we can start by asking Perceval for some help:

(perceval) $ perceval mbox --help

From the banner it produces, we learn that the most simple usage is specifying the uri for the mailing list to analyze, and a directory with its archives. The uri is used for annotation purposes, and can really be any string (although it should usually be a link to the mailing list webpage). The directory needs to be filled with files, each of them in mbox format. So, let’s start by getting one archive:

(perceval) $ mkdir archives
(perceval) $ wget -P archives http://mail-archives.apache.org/mod_mbox/httpd-announce/201607.mbox

These two lines (assuming we already have wget installed), will retrieve the archive corresponding to July 2016 of the mailing list httpd-announce, of the Apache project. The option -P archives to wget will ensure that the file is stored in the archives directory, which we created in the previous line.

Once we have the archive, we can analyze it:

(perceval) $ perceval mbox httpd-announce archives > perceval.log
[2016-11-23 02:12:02,476] - Sir Perceval is on his quest.
[2016-11-23 02:12:02,477] - Looking for messages from 'httpd-announce' on 'archives' since 1970-01-01 00:00:00+00:00
[2016-11-23 02:12:02,488] - Done. 4/4 messages fetched; 0 ignored
[2016-11-23 02:12:02,488] - Fetch process completed
[2016-11-23 02:12:02,488] - Sir Perceval completed his quest.

The above message show how the archives directory was parsed looking for mbox files, how 4 messages were found, of which none was ignored. Since the output was redirected to perceval.log, now we have the JSON documents produced by Perceval in that file:

{
    "backend_name": "MBox",
    "backend_version": "0.6.0",
    "category": "message",
    "data": {
        "Authentication-Results": "spamd4-us-west.apache.org (amavisd-new);\n\tdkim=pass (2048-bit key) header.d=comcast.net",
        "Content-Transfer-Encoding": "7bit",
        "Content-Type": "text/plain; charset=us-ascii",
        "DKIM-Signature": "v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net;\n\ts=q20140121; t=1467724082;\n\tbh=+4noOLzzrCDUMpdmYJUqt/JMcTXlHPAr2vhKyFryBUY=;\n\th=Received:Received:From:Content-Type:Subject:Message-Id:Date:To:\n\t Mime-Version;\n\tb=jlfQ9jFzyv9EP/ioD4B3TgJF7U3S60MygklSXCmpSftTp78gxYY502XgMsV5WAYaK\n\t t9a2O7Hssmbfi5U+rZ8R0hhtFqDyfsbE6xxUvfHvSyHAjJ7XISwxQnvEJ/EhLeN3G7\n\t Ht/mIz9uim8atrnxSaZDyO09t5JoM70aPFBmbTSE9+3bWJDi8M/Apvsj/q+Zu1jHJ1\n\t buxk9iitgmFegKUfSktydc6tFE4y8yObF41n4EAHC2uuURPbtXwWHWRH/nap4sK/aI\n\t FwIMTEbbNyEC0/wEqy0dktUYX2pnakh8DdH+TX34ozKKr9exGAFYwgoGQEvnPAhRJi\n\t FdxJf5QfRfMeg==",
        "Date": "Tue, 5 Jul 2016 09:08:01 -0400",
        "Delivered-To": "moderator for announce@httpd.apache.org",
        "From": "Jim Jagielski <jim@apache.org>",
        "List-Id": "<announce.httpd.apache.org>",
        "List-Post": "<mailto:announce@httpd.apache.org>",
        "Mailing-List": "contact announce-help@httpd.apache.org; run by ezmlm",
        ...
        "body": {
            "plain": "\n          Apache HTTP Server 2.4.23 Released\n\nThe Apache Software Foundation and the Apache HTTP Server Project\nare pleased to announce the release of version 2.4.23 of the Apache\nHTTP Server (\"Apache\"). 
            ...
...

We can see the usual structure of a Perceval JSON document, with some metainformation (such as backend_name), and all the content the corresponding message in the data field. The structure of that content is one field per header, with the same name the header has in the message. For the body of the message, the field body is used.

If we have several mbox files in the directory, all of them will be analyzed at once. For example, we can add a new archive to the archives directory above, and run Perceval again:

(perceval) $ wget -P archives http://mail-archives.apache.org/mod_mbox/httpd-announce/201608.mbox
(perceval) $ perceval mbox httpd-announce archives > perceval.log
[2016-11-23 11:12:37,795] - Sir Perceval is on his quest.
[2016-11-23 11:12:37,797] - Looking for messages from 'httpd-announce' on 'archives' since 1970-01-01 00:00:00+00:00
[2016-11-23 11:12:37,814] - Done. 5/5 messages fetched; 0 ignored
[2016-11-23 11:12:37,814] - Fetch process completed
[2016-11-23 11:12:37,814] - Sir Perceval completed his quest.

Now, 5 messages were analyzed, since the new archive (for August 2016) contains just one, and we already had 4 in the first archive we downloaded (for July 2016).

In this case, we can also see a small difference on the body of the messages. For the last one we obtain in perceval.log, we can see how the body field is a dictionary with a field named html. That’s because the content is labeled in the original message as being in HTML format. Compare this to the first example above, where the body field contains a field named plain, because the content is in plan (unformatted) format.

"body": {
            "html": "<head >\n<STYLE>\n .headerTop { background-color:#FFCC66;

Analyzing messages with Python

As usual, we can use Perceval as a Python module for analyzing messages in mbox files. Using the same two archives we downloaded above, in the archives directory, we can for example show the subject for all messages (code below is in perceval_mbox_1.py]:

#! /usr/bin/env python3

from perceval.backends.core.mbox import MBox

# uri (label) for the mailing list to analyze
mbox_uri = 'http://mail-archives.apache.org/mod_mbox/httpd-announce/'
# directory for letting Perceval where mbox archives are
# you need to have the archives to analyzed there before running the script
mbox_dir = 'archives'

# create a mbox object, using mbox_uri as label, mbox_dir as directory to scan
repo = MBox(uri=mbox_uri, dirpath=mbox_dir)
# fetch all messages as an iteratoir, and iterate it printing each subject
for message in repo.fetch():
    print(message['data']['Subject'])

To run the script, just move to the parent of the archives directory, that has our mbox archives, and run:

(perceval) $ python3 perceval_mbox_1.py 
[ANNOUNCE] Apache HTTP Server 2.4.23 Released
CVE-2016-4979: HTTPD webserver - X509 Client certificate ba
PC Prfoessional per Scuole e Enti Pubblici da 90 Euro
Web Designing Services at Lowest Prices!!
Vai in vacanza con l'iPhone e le Beats

Which shows us how some spam got into the Apache httpd-announce mailing list, by the way.