perceval package

Submodules

perceval.archive module

class perceval.archive.Archive(archive_path)[source]

Bases: object

Basic class for archiving raw items fetched by Perceval.

This class allows to archive raw items - usually HTML pages or JSON documents - for a further recovery. These raw items will be fetched, stored and retrieved back by a backend.

Each stored item will have a hash code used as unique identifier. Hash codes are generated using URIs and other parameters needed to fetch raw items.

When an instance of Archive is initialized it will expect to access an existing archive file. To create a new and empty archive used create class method instead. Metadata must be initialized calling to init_metadata method after creating a new archive.

Parameters:archive_path – path where this archive is stored
Raises:ArchiveError – when the archive does not exist or is invalid
ARCHIVE_CREATE_STMT = 'CREATE TABLE archive ( id INTEGER PRIMARY KEY AUTOINCREMENT, hashcode VARCHAR(256) UNIQUE NOT NULL, uri TEXT, payload BLOB, headers BLOB, data BLOB)'
ARCHIVE_TABLE = 'archive'
METADATA_CREATE_STMT = 'CREATE TABLE metadata ( origin TEXT, backend_name TEXT, backend_version TEXT, category TEXT, backend_params BLOB, created_on TEXT)'
METADATA_TABLE = 'metadata'
classmethod create(archive_path)[source]

Create a brand new archive.

Call this method to create a new and empty archive. It will initialize the storage file in the path defined by archive_path.
Parameters:archive_path – absolute path where the archive file will be created
Raises:ArchiveError – when the archive file already exists
init_metadata(origin, backend_name, backend_version, category, backend_params)[source]

Init metadata information.

Metatada is composed by basic information needed to identify where archived data came from and how it can be retrieved and built into Perceval items.

Param:origin: identifier of the repository
Param:backend_name: name of the backend
Param:backend_version: version of the backend
Param:category: category of the items fetched
Param:backend_params: dict representation of the fetch parameters

raises ArchiveError: when an error occurs initializing the metadata

static make_hashcode(uri, payload, headers)[source]

Generate a SHA1 based on the given arguments.

Hashcodes created by this method will used as unique identifiers for the raw items or resources stored by this archive.

Parameters:
  • uri – URI to the resource
  • payload – payload of the request needed to fetch the resource
  • headers – headers of the request needed to fetch the resource
Returns:

a SHA1 hash code

retrieve(uri, payload, headers)[source]

Retrieve a raw item from the archive.

The method will return the data content corresponding to the hascode derived from the given parameters.

Parameters:
  • uri – request URI
  • payload – request payload
  • headers – request headers
Returns:

the archived data

Raises:

ArchiveError – when an error occurs retrieving data

store(uri, payload, headers, data)[source]

Store a raw item in this archive.

The method will store data content in this archive. The unique identifier for that item will be generated using the rest of the parameters.

Parameters:
  • uri – request URI
  • payload – request payload
  • headers – request headers
  • data – data to store in this archive
Raises:

ArchiveError – when an error occurs storing the given data

class perceval.archive.ArchiveManager(dirpath)[source]

Bases: object

Manager for handling archives in Perceval.

This class manages the creation, deletion and access of Archive objects. Archives are stored under dirpath directory, using a random SHA1 for each file. The first byte of the hashcode will be the name of the subdirectory; the remaining bytes, the archive name.

Param:dirpath: path where the archives are stored
STORAGE_EXT = '.sqlite3'
create_archive()[source]

Create a new archive.

The method creates in the filesystem a brand new archive with a random SHA1 as its name. The first byte of the hashcode will be the name of the subdirectory; the remaining bytes, the archive name.

Returns:a new Archive object
Raises:ArchiveManagerError – when an error occurs creating the new archive
remove_archive(archive_path)[source]

Remove an archive.

This method deletes from the filesystem the archive stored in archive_path.

Parameters:archive_path – path to the archive
Raises:ArchiveManangerError – when an error occurs removing the archive
search(origin, backend_name, category, archived_after)[source]

Search archives.

Get the archives which store data based on the given parameters. These parameters define which the origin was (origin), how data was fetched (backend_name) and data type (‘category’). Only those archives created on or after archived_after will be returned.

The method returns a list with the file paths to those archives. The list is sorted by the date of creation of each archive.

Parameters:
  • origin – data origin
  • backend_name – backed used to fetch data
  • category – type of the items fetched by the backend
  • archived_after – get archives created on or after this date
Returns:

a list with archive names which match the search criteria

perceval.backend module

class perceval.backend.Backend(origin, tag=None, archive=None)[source]

Bases: object

Abstract class for backends.

Base class to fetch data from a repository. This repository will be named as ‘origin’. During the initialization, an Archive object can be provided for archiving raw data from the repositories.

Derivated classes have to implement fetch_items, has_archiving and has_resuming methods. Otherwise, NotImplementedError exception will be raised. Metadata decorator can be used together with fetch methods but requires the implementation of metadata_id, metadata_updated_on and metadata_category static methods.

The fetched items can be tagged using the tag parameter. It will be useful to trace data. When it is set to None or to an empty string, the tag will be the same that the origin attribute.

To track which version of the backend was used during the fetching process, this class provides a version attribute that each backend may override.

Parameters:
  • origin – identifier of the repository
  • tag – tag items using this label
  • archive – archive to store/retrieve data
Raises:

ValueError – raised when archive is not an instance of Archive class

CATEGORIES = []
archive
categories
fetch(category, **kwargs)[source]

Fetch items from the repository.

The method retrieves items from a repository.

Parameters:
  • category – the category of the items fetched
  • kwargs – a list of other parameters (e.g., from_date, offset, etc.

specific for each backend)

Returns:a generator of items
fetch_from_archive()[source]

Fetch the questions from an archive.

It returns the items stored within an archive. If this method is called but no archive was provided, the method will raise a ArchiveError exception.

Returns:a generator of items
Raises:ArchiveError – raised when an error occurs accessing an archive
fetch_items(category, **kwargs)[source]
classmethod has_archiving()[source]
classmethod has_resuming()[source]
metadata(item)[source]

Add metadata to an item.

It adds metadata to a given item such as how and when it was fetched. The contents from the original item will be stored under the ‘data’ keyword.

Parameters:item – an item fetched by a backend
static metadata_category(item)[source]
static metadata_id(item)[source]
static metadata_updated_on(item)[source]
origin
version = '0.7.0'
class perceval.backend.BackendCommand(*args)[source]

Bases: object

Abstract class to run backends from the command line.

When the class is initialized, it parses the given arguments using the defined argument parser on setump_cmd_parser method. Those arguments will be stored in the attribute parsed_args.

The arguments will be used to inizialize and run the Backend object assigned to this command. The backend used to run the command is stored under BACKEND class attributed. Any class derived from this and must set its own Backend class.

Moreover, the method setup_cmd_parser must be implemented to exectute the backend.

BACKEND = None
run()[source]

Fetch and write items.

This method runs the backend to fetch the items from the given origin. Items are converted to JSON objects and written to the defined output.

If fetch-archive parameter was given as an argument during the inizialization of the instance, the items will be retrieved using the archive manager.

static setup_cmd_parser()[source]
class perceval.backend.BackendCommandArgumentParser(from_date=False, to_date=False, offset=False, basic_auth=False, token_auth=False, archive=False, aliases=None)[source]

Bases: object

Manage and parse backend command arguments.

This class defines and parses a set of arguments common to backends commands. Some parameters like archive or the different types of authentication can be set during the initialization of the instance.

Parameters:
  • from_date – set from_date argument
  • to_date – set to_date argument
  • offset – set offset argument
  • basic_auth – set basic authentication arguments
  • token_auth – set token/key authentication arguments
  • archive – set archiving arguments
  • aliases – define aliases for parsed arguments
Raises:

AttributeArror – when both from_date and offset are set to True

parse(*args)[source]

Parse a list of arguments.

Parse argument strings needed to run a backend command. The result will be a argparse.Namespace object populated with the values obtained after the validation of the parameters.

Parameters:args – argument strings
Result:an object with the parsed values
perceval.backend.fetch(backend_class, backend_args, category, manager=None)[source]

Fetch items using the given backend.

Generator to get items using the given backend class. When an archive manager is given, this function will store the fetched items in an Archive. If an exception is raised, this archive will be removed to avoid corrupted archives.

The parameters needed to initialize the backend class and get the items are given using backend_args dict parameter.

Parameters:
  • backend_class – backend class to fetch items
  • backend_args – dict of arguments needed to fetch the items
  • category – category of the items to retrieve. If None, it will use the default backend category
  • manager – archive manager needed to store the items
Returns:

a generator of items

perceval.backend.fetch_from_archive(backend_class, backend_args, manager, category, archived_after)[source]

Fetch items from an archive manager.

Generator to get the items of a category (previously fetched by the given backend class) from an archive manager. Only those items archived after the given date will be returned.

The parameters needed to initialize backend and get the items are given using backend_args dict parameter.

Parameters:
  • backend_class – backend class to retrive items
  • backend_args – dict of arguments needed to retrieve the items
  • manager – archive manager where the items will be retrieved
  • category – category of the items to retrieve
  • archived_after – return items archived after this date
Returns:

a generator of archived items

perceval.backend.find_backends(top_package)[source]

Find available backends.

Look for the Perceval backends and commands under top_package and its sub-packages. When top_package defines a namespace, backends under that same namespace will be found too.

Parameters:top_package – package storing backends
Returns:a tuple with two dicts: one with Backend classes and one with BackendCommand classes
perceval.backend.uuid(*args)[source]

Generate a UUID based on the given parameters.

The UUID will be the SHA1 of the concatenation of the values from the list. The separator bewteedn these values is ‘:’. Each value must be a non-empty string, otherwise, the function will raise an exception.

Parameters:*args

list of arguments used to generate the UUID

Returns:a universal unique identifier
Raises:ValueError – when anyone of the values is not a string, is empty or None.

perceval.client module

class perceval.client.HttpClient(base_url, max_retries=5, sleep_time=1, extra_headers=None, extra_status_forcelist=None, extra_retry_after_status=None, archive=None, from_archive=False)[source]

Bases: object

Abstract class for HTTP clients.

Base class to query data sources taking care of retrying requests in case connection issues. If the data source does not send back a response after retrying a request, a RetryError exception is thrown.

Sub-classes can use the methods fetch to obtain data from the data source.

To track which version of the client was used during the fetching process, this class provides a version attribute that each client may override.

Parameters:
  • base_url – base URL of the data source
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • sleep_time – time to sleep in case of connection problems
DEFAULT_HEADERS = {'User-Agent': 'Perceval/0.11.7'}
DEFAULT_METHOD_WHITELIST = False
DEFAULT_RAISE_ON_REDIRECT = True
DEFAULT_RAISE_ON_STATUS = True
DEFAULT_RESPECT_RETRY_AFTER_HEADER = True
DEFAULT_RETRY_AFTER_STATUS_CODES = [413, 429, 503]
DEFAULT_SLEEP_TIME = 1
DEFAULT_STATUS_FORCE_LIST = [408, 423, 504]
GET = 'GET'
MAX_RETRIES = 5
MAX_RETRIES_ON_CONNECT = 5
MAX_RETRIES_ON_READ = 5
MAX_RETRIES_ON_REDIRECT = 5
MAX_RETRIES_ON_STATUS = 5
POST = 'POST'
fetch(url, payload=None, headers=None, method='GET', stream=False, verify=True)[source]

Fetch the data from a given URL.

Parameters:
  • url – link to the resource
  • payload – payload of the request
  • headers – headers of the request
  • method – type of request call (GET or POST)
  • stream – defer downloading the response body until the response content is available
  • verify – verifying the SSL certificate

:returns a response object

static sanitize_for_archive(url, headers, payload)[source]

Sanitize the URL, headers and payload of a HTTP request before storing/retrieving items. By default, this method does not modify url, headers and payload. The modifications take place within the specific backends that redefine the sanitize_for_archive.

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and payload sanitized

version = '0.1.5'
class perceval.client.RateLimitHandler[source]

Bases: object

Class to handle rate limit for HTTP clients.

Parameters:
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be rese
  • rate_limit_header – header to know the current rate limit
  • rate_limit_reset_header – header to know the next rate limit reset
MAX_RATE_LIMIT = 500
MIN_RATE_LIMIT = 10
RATE_LIMIT_HEADER = 'X-RateLimit-Remaining'
RATE_LIMIT_RESET_HEADER = 'X-RateLimit-Reset'
calculate_time_to_reset()[source]

Calculate the seconds to reset the token requests.

setup_rate_limit_handler(sleep_for_rate=False, min_rate_to_sleep=10, rate_limit_header='X-RateLimit-Remaining', rate_limit_reset_header='X-RateLimit-Reset')[source]

Setup the rate limit handler.

Parameters:
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to make the fecthing process sleep
  • rate_limit_header – header from where extract the rate limit data
  • rate_limit_reset_header – header from where extract the rate limit reset data
sleep_for_rate_limit()[source]

The fetching process sleeps until the rate limit is restored or raises a RateLimitError exception if sleep_for_rate flag is disabled.

update_rate_limit(response)[source]

Update the rate limit and the time to reset from the response headers.

Param:response: the response object
version = '0.2'

perceval.errors module

exception perceval.errors.ArchiveError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for archive objects

message = '%(cause)s'
exception perceval.errors.ArchiveManagerError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for archive manager

message = '%(cause)s'
exception perceval.errors.BackendError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for backends

message = '%(cause)s'
exception perceval.errors.BaseError(**kwargs)[source]

Bases: Exception

Base class for Perceval exceptions.

Derived classes can overwrite the error message declaring message property.

message = 'Perceval base error'
exception perceval.errors.HttpClientError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for HTTP Cient

message = '%(cause)s'
exception perceval.errors.ParseError(**kwargs)[source]

Bases: perceval.errors.BaseError

Exception raised a parsing errors occurs

message = '%(cause)s'
exception perceval.errors.RateLimitError(**kwargs)[source]

Bases: perceval.errors.BaseError

Exception raised when the rate limit is exceeded

message = '%(cause)s; %(seconds_to_reset)s seconds to rate reset'
seconds_to_reset
exception perceval.errors.RepositoryError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for repositories

message = '%(cause)s'

perceval.utils module

perceval.utils.check_compressed_file_type(filepath)[source]

Check if filename is a compressed file supported by the tool.

This function uses magic numbers (first four bytes) to determine the type of the file. Supported types are ‘gz’ and ‘bz2’. When the filetype is not supported, the function returns None.

Parameters:filepath – path to the file
Returns:‘gz’ or ‘bz2’; None if the type is not supported
perceval.utils.message_to_dict(msg)[source]

Convert an email message into a dictionary.

This function transforms an email.message.Message object into a dictionary. Headers are stored as key:value pairs while the body of the message is stored inside body key. Body may have two other keys inside, ‘plain’, for plain body messages and ‘html’, for HTML encoded messages.

The returned dictionary has the type requests.structures.CaseInsensitiveDict due to same headers with different case formats can appear in the same message.

Parameters:msg – email message of type email.message.Message

:returns : dictionary of type requests.structures.CaseInsensitiveDict

Raises:ParseError – when an error occurs transforming the message to a dictionary
perceval.utils.months_range(from_date, to_date)[source]

Generate a months range.

Generator of months starting on from_date util to_date. Each returned item is a tuple of two datatime objects like in (month, month+1). Thus, the result will follow the sequence:

((fd, fd+1), (fd+1, fd+2), …, (td-2, td-1), (td-1, td))
Parameters:
  • from_date – generate dates starting on this month
  • to_date – generate dates until this month
Result:

a generator of months range

perceval.utils.remove_invalid_xml_chars(raw_xml)[source]

Remove control and invalid characters from an xml stream.

Looks for invalid characters and subtitutes them with whitespaces. This solution is based on these two posts: Olemis Lang’s reponse on StackOverflow (http://stackoverflow.com/questions/1707890) and lawlesst’s on GitHub Gist (https://gist.github.com/lawlesst/4110923), that is based on the previous answer.

Parameters:xml – XML stream
Returns:a purged XML stream
perceval.utils.xml_to_dict(raw_xml)[source]

Convert a XML stream into a dictionary.

This function transforms a xml stream into a dictionary. The attributes are stored as single elements while child nodes are stored into lists. The text node is stored using the special key ‘__text__’.

This code is based on Winston Ewert’s solution to this problem. See http://codereview.stackexchange.com/questions/10400/convert-elementtree-to-dict for more info. The code was licensed as cc by-sa 3.0.

Parameters:raw_xml – XML stream
Returns:a dict with the XML data
Raises:ParseError – raised when an error occurs parsing the given XML stream

Module contents