perceval.backends.core package

Submodules

perceval.backends.core.askbot module

class perceval.backends.core.askbot.Askbot(url, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Askbot backend.

This class retrieves the questions posted in an Askbot site. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters:
  • url – Askbot site URL
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['question']
fetch(category='question', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the questions/answers from the repository.

The method retrieves, from an Askbot site, the questions and answers updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain questions/answers updated since this date
Returns:

a generator of items

fetch_items(category, **kwargs)[source]

Fetch the questions

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from an Askbot item.

This backend only generates one type of item which is ‘question’.

static metadata_id(item)[source]

Extracts the identifier from an Askbot question item.

static metadata_updated_on(item)[source]

Extracts the update time from an Askbot item.

The timestamp is extracted from ‘last_activity_at’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.6.5'
class perceval.backends.core.askbot.AskbotClient(base_url, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Askbot client.

This class implements a simple client to retrieve distinct kind of data from an Askbot site.

Parameters:
  • base_url – URL of the Askbot site
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
Raises:

HTTPError – when an error occurs doing the request

API_QUESTIONS = 'api/v1/questions/'
COMMENTS = 's/post_comments'
COMMENTS_OLD = 'post_comments'
HTML_QUESTION = 'question/'
ORDER_API = 'activity-asc'
ORDER_HTML = 'votes'
get_api_questions(path)[source]

Retrieve a question page using the API.

Parameters:page – page to retrieve
get_comments(post_id)[source]

Retrieve a list of comments by a given id.

Parameters:object_id – object identifiere
get_html_question(question_id, page=1)[source]

Retrieve a raw HTML question and all it’s information.

Parameters:
  • question_id – question identifier
  • page – page to retrieve
class perceval.backends.core.askbot.AskbotCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Askbot backend from the command line.

BACKEND

alias of Askbot

static setup_cmd_parser()[source]

Returns the Askbot argument parser.

class perceval.backends.core.askbot.AskbotParser[source]

Bases: object

Askbot HTML parser.

This class parses a plain HTML document, converting questions, answers, comments and user information into dict items.

static parse_answers(html_question)[source]

Parse the answers of a given HTML question.

The method parses the answers related with a given HTML question, as well as all the comments related to the answer.

Parameters:html_question – raw HTML question element
Returns:a list with the answers
static parse_number_of_html_pages(html_question)[source]

Parse number of answer pages to paginate over them.

Parameters:html_question – raw HTML question element
Returns:an integer with the number of pages
static parse_question_container(html_question)[source]

Parse the question info container of a given HTML question.

The method parses the information available in the question information container. The container can have up to 2 elements: the first one contains the information related with the user who generated the question and the date (if any). The second one contains the date of the updated, and the user who updated it (if not the same who generated the question).

Parameters:html_question – raw HTML question element
Returns:an object with the parsed information
static parse_user_info(update_info)[source]

Parse the user information of a given HTML container.

The method parses all the available user information in the container. If the class “user-info” exists, the method will get all the available information in the container. If not, if a class “tip” exists, it will be a wiki post with no user associated. Else, it can be an empty container.

Parameters:update_info – beautiful soup answer container element
Returns:an object with the parsed information

perceval.backends.core.bugzilla module

class perceval.backends.core.bugzilla.Bugzilla(url, user=None, password=None, max_bugs=200, max_bugs_csv=10000, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Bugzilla backend.

This class allows the fetch the bugs stored in Bugzilla repository. To initialize this class the URL of the server must be provided. The url will be set as the origin of the data.

Parameters:
  • url – Bugzilla server URL
  • user – Bugzilla user
  • password – Bugzilla user password
  • max_bugs – maximum number of bugs requested on the same query
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['bug']
fetch(category='bug', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the bugs from the repository.

The method retrieves, from a Bugzilla repository, the bugs updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain bugs updated since this date
Returns:

a generator of bugs

fetch_items(category, **kwargs)[source]

Fetch the bugs

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Bugzilla item.

This backend only generates one type of item which is ‘bug’.

static metadata_id(item)[source]

Extracts the identifier from a Bugzilla item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Bugzilla item.

The timestamp is extracted from ‘delta_ts’ field. This date is converted to UNIX timestamp format. Due Bugzilla servers ignore the timezone on HTTP requests, it will be ignored during the conversion, too.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_bug_activity(raw_html)[source]

Parse a Bugzilla bug activity HTML stream.

This method extracts the information about activity from the given HTML stream. The bug activity is stored into a HTML table. Each parsed activity event is returned into a dictionary.

If the given HTML is invalid, the method will raise a ParseError exception.

Parameters:raw_html – HTML string to parse
Returns:a generator of parsed activity events
Raises:ParseError – raised when an error occurs parsing the given HTML stream
static parse_buglist(raw_csv)[source]

Parse a Bugzilla CSV bug list.

The method parses the CSV file and returns an iterator of dictionaries. Each one of this, contains the summary of a bug.

Parameters:raw_csv – CSV string to parse
Returns:a generator of parsed bugs
static parse_bugs_details(raw_xml)[source]

Parse a Bugilla bugs details XML stream.

This method returns a generator which parses the given XML, producing an iterator of dictionaries. Each dictionary stores the information related to a parsed bug.

If the given XML is invalid or does not contains any bug, the method will raise a ParseError exception.

Parameters:raw_xml – XML string to parse
Returns:a generator of parsed bugs
Raises:ParseError – raised when an error occurs parsing the given XML stream
version = '0.10.3'
class perceval.backends.core.bugzilla.BugzillaClient(base_url, user=None, password=None, max_bugs_csv=10000, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Bugzilla API client.

This class implements a simple client to retrieve distinct kind of data from a Bugzilla repository. Currently, it only supports 3.x and 4.x servers.

When it is initialized, it checks if the given Bugzilla is available and retrieves its version.

Parameters:
  • base_url – URL of the Bugzilla server
  • user – Bugzilla user
  • password – user password
  • max_bugs_cvs – max bugs requested per CSV query
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
Raises:

BackendError – when an error occurs initilizing the client

CGI_BUG = 'show_bug.cgi'
CGI_BUGLIST = 'buglist.cgi'
CGI_BUG_ACTIVITY = 'show_activity.cgi'
CGI_LOGIN = 'index.cgi'
CTYPE_CSV = 'csv'
CTYPE_XML = 'xml'
OLD_STYLE_VERSIONS = ['3.2.3', '3.2.2']
PBUGZILLA_LOGIN = 'Bugzilla_login'
PBUGZILLA_PASSWORD = 'Bugzilla_password'
PBUG_ID = 'id'
PCHFIELD_FROM = 'chfieldfrom'
PCTYPE = 'ctype'
PEXCLUDE_FIELD = 'excludefield'
PLIMIT = 'limit'
PLOGIN = 'GoAheadAndLogIn'
PLOGOUT = 'logout'
PORDER = 'order'
URL = '%(base)s/%(cgi)s'
VERSION_REGEX = re.compile('.+bugzilla version="([^"]+)"', re.DOTALL)
bug_activity(bug_id)[source]

Get the activity of a bug in HTML format.

Parameters:bug_id – bug identifier
buglist(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Get a summary of bugs in CSV format.

Parameters:from_date – retrieve bugs that where updated from that date
bugs(*bug_ids)[source]

Get the information of a list of bugs in XML format.

Parameters:bug_ids – list of bug identifiers
call(cgi, params)[source]

Run an API command.

Parameters:
  • cgi – cgi command to run on the server
  • params – dict with the HTTP parameters needed to run the given command
login(user, password)[source]

Authenticate a user in the server.

Parameters:
  • user – Bugzilla user
  • password – user password
logout()[source]

Logout from the server.

metadata()[source]

Get metadata information in XML format.

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the login and password information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.bugzilla.BugzillaCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Bugzilla backend from the command line.

BACKEND

alias of Bugzilla

static setup_cmd_parser()[source]

Returns the Bugzilla argument parser.

perceval.backends.core.bugzillarest module

class perceval.backends.core.bugzillarest.BugzillaREST(url, user=None, password=None, api_token=None, max_bugs=500, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Bugzilla backend that uses its API REST.

This class allows the fetch the bugs stored in Bugzilla server (version 5.0 or later). To initialize this class the URL of the server must be provided. The url will be set as the origin of the data.

Parameters:
  • url – Bugzilla server URL
  • user – Bugzilla user
  • password – Bugzilla user password
  • api_token – Bugzilla token
  • max_bugs – maximum number of bugs requested on the same query
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['bug']
fetch(category='bug', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the bugs from the repository.

The method retrieves, from a Bugzilla repository, the bugs updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain bugs updated since this date
Returns:

a generator of bugs

fetch_items(category, **kwargs)[source]

Fetch the bugs

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Bugzilla item.

This backend only generates one type of item which is ‘bug’.

static metadata_id(item)[source]

Extracts the identifier from a Bugzilla item.

static metadata_updated_on(item)[source]

Extracts the update time from a Bugzilla item.

The timestamp used is extracted from ‘last_change_time’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.8.3'
class perceval.backends.core.bugzillarest.BugzillaRESTClient(base_url, user=None, password=None, api_token=None, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Bugzilla REST API client.

This class implements a simple client to retrieve distinct kind of data from a Bugzilla > 5.0 repository using its REST API.

When user and password parameters are given it logs in the server. Further requests will use the token obtained during the sign in phase.

Parameters:
  • base_url – URL of the Bugzilla server
  • user – Bugzilla user
  • password – user password
  • api_token – api token for user; when this is provided user and password parameters will be ignored
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
Raises:

BackendError – when an error occurs initilizing the client

PBUGZILLA_LOGIN = 'login'
PBUGZILLA_PASSWORD = 'password'
PBUGZILLA_TOKEN = 'token'
PEXCLUDE_FIELDS = 'exclude_fields'
PIDS = 'ids'
PINCLUDE_FIELDS = 'include_fields'
PLAST_CHANGE_TIME = 'last_change_time'
PLIMIT = 'limit'
POFFSET = 'offset'
PORDER = 'order'
RATTACHMENT = 'attachment'
RBUG = 'bug'
RCOMMENT = 'comment'
RHISTORY = 'history'
RLOGIN = 'login'
URL = '%(base)s/rest/%(resource)s'
VCHANGE_DATE_ORDER = 'changeddate'
VEXCLUDE_ATTCH_DATA = 'data'
VINCLUDE_ALL = '_all'
attachments(*bug_ids)[source]

Get the attachments of the given bugs.

Parameters:bug_id – list of bug identifiers
bugs(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_bugs=500)[source]

Get the information of a list of bugs.

Parameters:
  • from_date – retrieve bugs that where updated from that date; dates are converted to UTC
  • offset – starting position for the search; i.e to return 11th element, set this value to 10.
  • max_bugs – maximum number of bugs to reteurn per query
call(resource, params)[source]

Retrive the given resource.

Parameters:
  • resource – resource to retrieve
  • params – dict with the HTTP parameters needed to retrieve the given resource
Raises:

BugzillaRESTError – raised when an error is returned by the server

comments(*bug_ids)[source]

Get the comments of the given bugs.

Parameters:bug_ids – list of bug identifiers
history(*bug_ids)[source]

Get the history of the given bugs.

Parameters:bug_ids – list of bug identifiers
login(user, password)[source]

Authenticate a user in the server.

Parameters:
  • user – Bugzilla user
  • password – user password
static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the login, password and token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.bugzillarest.BugzillaRESTCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run BugzillaREST backend from the command line.

BACKEND

alias of BugzillaREST

static setup_cmd_parser()[source]

Returns the BugzillaREST argument parser.

exception perceval.backends.core.bugzillarest.BugzillaRESTError(**kwargs)[source]

Bases: perceval.errors.BaseError

Raised when an error occurs using the API

message = '%(error)s (code: %(code)s)'

perceval.backends.core.confluence module

class perceval.backends.core.confluence.Confluence(url, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Confluence backend.

This class allows the fetch the historical contents (content versions) stored on a Confluence server. Initialize this class passing the URL os this server. The url will be set as the origin of the data.

Parameters:
  • url – URL of the server
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['historical content']
fetch(category='historical content', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the contents by version from the server.

This method fetches the different historical versions (or snapshots) of the contents stored in the server that were updated since the given date. Only those snapshots created or updated after from_date will be returned.

Take into account that the seconds of from_date parameter will be ignored because the Confluence REST API only accepts the date and hours and minutes for timestamps values.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain historical versions of contents updated since this date
Returns:

a generator of historical versions

fetch_items(category, **kwargs)[source]

Fetch the contents

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Confluence item.

This backend only generates one type of item which is ‘historical content’.

static metadata_id(item)[source]

Extracts the identifier from a Confluence item.

This identifier will be the mix of two fields because a historical content does not have any unique identifier. In this case, ‘id’ and ‘version’ values are combined because it should not be possible to have two equal version numbers for the same content. The value to return will follow the pattern: <content>#v<version> (i.e 28979#v10).

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Confluence item.

The timestamp is extracted from ‘when’ field on ‘version’ section. This date is converted to UNIX timestamp format.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_contents_summary(raw_json)[source]

Parse a Confluence summary JSON list.

The method parses a JSON stream and returns an iterator of diccionaries. Each dictionary is a content summary.

Parameters:raw_json – JSON string to parse
Returns:a generator of parsed content summaries.
static parse_historical_content(raw_json)[source]

Parse a Confluence historical content JSON stream.

This method parses a JSON stream and returns a dictionary that contains the data of a historical content.

Parameters:raw_json – JSON string to parse
Returns:a dict with historical content
version = '0.9.2'
class perceval.backends.core.confluence.ConfluenceClient(base_url, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Confluence REST API client.

This class implements a client to retrieve contents from a Confluence server using its REST API.

Parameters:
  • base_url – URL of the Confluence server
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
MSEARCH = 'search'
PCQL = 'cql'
PEXPAND = 'expand'
PLIMIT = 'limit'
PSTART = 'start'
PSTATUS = 'status'
PVERSION = 'version'
RCONTENTS = 'content'
RHISTORY = 'history'
RSPACE = 'space'
URL = '%(base)s/rest/api/%(resource)s'
VCQL = "lastModified>='%(date)s' order by lastModified"
VEXPAND = ['body.storage', 'history', 'version']
VHISTORICAL = 'historical'
contents(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_contents=200)[source]

Get the contents of a repository.

This method returns an iterator that manages the pagination over contents. Take into account that the seconds of from_date parameter will be ignored because the API only works with hours and minutes.

Parameters:
  • from_date – fetch the contents updated since this date
  • offset – fetch the contents starting from this offset
  • limit – maximum number of contents to fetch per request
historical_content(content_id, version)[source]

Get the snapshot of a content for the given version.

Parameters:
  • content_id – fetch the snapshot of this content
  • version – snapshot version of the content
class perceval.backends.core.confluence.ConfluenceCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Confluence backend from the command line.

BACKEND

alias of Confluence

static setup_cmd_parser()[source]

Returns the Bugzilla argument parser.

perceval.backends.core.discourse module

class perceval.backends.core.discourse.Discourse(url, api_token=None, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Discourse backend for Perceval.

This class retrieves the topics posted in a Discourse board. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters:
  • url – Discourse URL
  • api_token – Discourse API access token
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['topic']
fetch(category='topic', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the topics from the Discurse board.

The method retrieves, from a Discourse board the topics updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain topics updated since this date
Returns:

a generator of topics

fetch_items(category, **kwargs)[source]

Fetch the topics

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Discourse item.

This backend only generates one type of item which is ‘topic’.

static metadata_id(item)[source]

Extracts the identifier from a Discourse item.

static metadata_updated_on(item)[source]

Extracts the update time from a Discourse item.

The timestamp used is extracted from ‘last_posted_at’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.9.3'
class perceval.backends.core.discourse.DiscourseClient(base_url, api_key=None, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Discourse API client.

This class implements a simple client to retrieve topics from any Discourse board.

Parameters:
  • url – URL of the Discourse site
  • api_key – Discourse API access token
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
Raises:

HTTPError – when an error occurs doing the request

ALL_TOPICS = None
PKEY = 'api_key'
POSTS = 'posts'
PPAGE = 'page'
TJSON = '.json'
TOPIC = 't'
TOPICS_SUMMARY = 'latest'
post(post_id)[source]

Retrieve the post whit post_id identifier.

Parameters:post_id – identifier of the post to retrieve
static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

topic(topic_id)[source]

Retrive the topic with topic_id identifier.

Parameters:topic_id – identifier of the topic to retrieve
topics_page(page=None)[source]

Retrieve the #page summaries of the latest topics.

Parameters:page – number of page to retrieve
class perceval.backends.core.discourse.DiscourseCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Discourse backend from the command line.

BACKEND

alias of Discourse

static setup_cmd_parser()[source]

Returns the Discourse argument parser.

perceval.backends.core.dockerhub module

class perceval.backends.core.dockerhub.DockerHub(owner, repository, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

DockerHub backend for Perceval.

This class retrieves data from a repository stored in the Docker Hub site. To initialize this class owner and repositories where data will be fetched must be provided. The origin of the data will be built with both parameters.

Shortcut _ owner for official Docker repositories will be replaced by its long name: library.

Parameters:
  • owner – DockerHub owner
  • repository – DockerHub repository owned by owner
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['dockerhub-data']
fetch(category='dockerhub-data')[source]

Fetch data from a Docker Hub repository.

The method retrieves, from a repository stored in Docker Hub, its data which includes number of pulls, stars, description, among other data.

Parameters:category – the category of items to fetch
Returns:a generator of data
fetch_items(category, **kwargs)[source]

Fetch the Dockher Hub items

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Docker Hub item.

This backend only generates one type of item which is ‘dockerhub-data’.

static metadata_id(item)[source]

Extracts the identifier from a Docker Hub item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Docker Hub item.

The timestamp is extracted from ‘fetched_on’ field. This field is not part of the data provided by Docker Hub. It is added by this backend.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_json(raw_json)[source]

Parse a Docker Hub JSON stream.

The method parses a JSON stream and returns a dict with the parsed data.

Parameters:raw_json – JSON string to parse
Returns:a dict with the parsed data
version = '0.4.2'
class perceval.backends.core.dockerhub.DockerHubClient(archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

DockerHub API client.

Client for fetching information from the DockerHub server using its REST API v2.

Parameters:
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
RREPOSITORY = 'repositories'
repository(owner, repository)[source]

Fetch information about a repository.

class perceval.backends.core.dockerhub.DockerHubCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run DockerHub backend from the command line.

BACKEND

alias of DockerHub

static setup_cmd_parser()[source]

Returns the DockerHub argument parser.

perceval.backends.core.gerrit module

class perceval.backends.core.gerrit.Gerrit(hostname, user=None, port='29418', max_reviews=500, blacklist_reviews=None, disable_host_key_check=False, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Gerrit backend.

Class to fetch the reviews from a Gerrit server. To initialize this class the Hostname of the server must be provided. The hostname will be set as the origin of the data.

Parameters:
  • hostname – Gerrit server Hostname
  • user – SSH user used to connect to the Gerrit server
  • port – SSH port
  • max_reviews – maximum number of reviews requested on the same query
  • blacklist_reviews – exclude the reviews of this list while fetching
  • disable_host_key_check – disable host key controls
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['review']
fetch(category='review', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the reviews from the repository.

The method retrieves, from a Gerrit repository, the reviews updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain reviews updated since this date
Returns:

a generator of reviews

fetch_items(category, **kwargs)[source]

Fetch the reviews

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend does not support items resuming
static metadata_category(item)[source]

Extracts the category from a Gerrit item.

This backend only generates one type of item which is ‘review’.

static metadata_id(item)[source]

Extracts the identifier from a Gerrit item.

static metadata_updated_on(item)[source]

Extracts and converts the update time from a Gerrit item.

The timestamp is extracted from ‘lastUpdated’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_reviews(raw_data)[source]

Parse a Gerrit reviews list.

version = '0.11.1'
class perceval.backends.core.gerrit.GerritClient(repository, user=None, max_reviews=500, blacklist_reviews=None, disable_host_key_check=False, port='29418', archive=None, from_archive=False)[source]

Bases: object

Gerrit API client.

This class implements a client to retrieve reviews from a Gerrit repository using the ssh API. Currently it supports <2.8 and >=2.9 versions in incremental mode.

Check the next link for more info: https://gerrit-documentation.storage.googleapis.com/Documentation/2.12/cmd-query.html

Parameters:
  • repository – Hostname of the Gerrit server
  • user – SSH user to be used to connect to gerrit server
  • max_reviews – max number of reviews per query
  • blacklist_reviews – exclude the reviews of this list while fetching
  • disable_host_key_check – disable host key controls
  • port – SSH port
  • archive – collect issues already retrieved from an archive
  • from_archive – it tells whether to write/read the archive
CMD_GERRIT = 'gerrit'
CMD_VERSION = 'version'
MAX_RETRIES = 3
RETRY_WAIT = 60
VERSION_REGEX = re.compile('gerrit version (\\d+)\\.(\\d+).*')
next_retrieve_group_item(last_item=None, entry=None)[source]

Return the item to start from in next reviews group.

reviews(last_item, filter_=None)[source]

Get the reviews starting from last_item.

static sanitize_for_archive(cmd)[source]

Sanitize the Gerrit command by removing username information before storing/retrieving archived items

Param:cmd: Gerrit command

:returns the sanitized cmd

version

Return the Gerrit server version.

class perceval.backends.core.gerrit.GerritCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Gerrit backend from the command line.

BACKEND

alias of Gerrit

static setup_cmd_parser()[source]

Returns the Gerrit argument parser.

perceval.backends.core.git module

exception perceval.backends.core.git.EmptyRepositoryError(**kwargs)[source]

Bases: perceval.errors.RepositoryError

Exception raised when a repository is empty

message = '%(repository)s is empty'
class perceval.backends.core.git.Git(uri, gitpath, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Git backend.

This class allows the fetch the commits from a Git repository (local or remote) or from a log file. To initialize this class, you have to provide the URI repository and a value for gitpath. This uri will be set as the origin of the data.

When gitpath is a directory or does not exist, it will be considered as the place where the repository is/will be cloned; when gitpath is a file it will be considered as a Git log file.

Parameters:
  • uri – URI of the Git repository
  • gitpath – path to the repository or to the log file
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
Raises:

RepositoryError – raised when there was an error cloning or updating the repository.

CATEGORIES = ['commit']
fetch(category='commit', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), branches=None, latest_items=False)[source]

Fetch commits.

The method retrieves from a Git repository or a log file a list of commits. Commits are returned in the same order they were obtained.

When from_date parameter is given it returns items commited since the given date.

The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.

The parameter latest_items returns only those commits which are new since the last time this method was called.

Take into account that from_date and branches are ignored when the commits are fetched from a Git log file or when latest_items flag is set.

The class raises a RepositoryError exception when an error occurs accessing the repository.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain commits newer than a specific date (inclusive)
  • to_date – obtain commits older than a specific date
  • branches – names of branches to fetch from (default: None)
  • latest_items – sync with the repository to fetch only the newest commits
Returns:

a generator of commits

fetch_items(category, **kwargs)[source]

Fetch the commits

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend does not support items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Git item.

This backend only generates one type of item which is ‘commit’.

static metadata_id(item)[source]

Extracts the identifier from a Git item.

static metadata_updated_on(item)[source]

Extracts the update time from a Git item.

The timestamp used is extracted from ‘CommitDate’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_git_log_from_file(filepath)[source]

Parse a Git log file.

The method parses the Git log file and returns an iterator of dictionaries. Each one of this, contains a commit.

Parameters:

filepath – path to the log file

Returns:

a generator of parsed commits

Raises:
  • ParseError – raised when the format of the Git log file is invalid
  • OSError – raised when an error occurs reading the given file
static parse_git_log_from_iter(iterator)[source]

Parse a Git log obtained from an iterator.

The method parses the Git log fetched from an iterator, where each item is a line of the log. It returns and iterator of dictionaries. Each dictionary contains a commit.

Parameters:iterator – iterator of Git log lines
Raises:ParseError – raised when the format of the Git log is invalid
version = '0.10.2'
class perceval.backends.core.git.GitCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Git backend from the command line.

BACKEND

alias of Git

static setup_cmd_parser()[source]

Returns the Git argument parser.

class perceval.backends.core.git.GitParser(stream)[source]

Bases: object

Git log parser.

This class parses a plain Git log stream, converting plain commits into dict items.

Not every Git log output is valid to be parsed. The Git log stream must have a specific structure. It must contain raw commits data and stats about modified files. The next excerpt shows an example of a valid log:

commit aaa7a9209f096aaaadccaaa7089aaaa3f758a703 Author: John Smith <jsmith@example.com> AuthorDate: Tue Aug 14 14:30:13 2012 -0300 Commit: John Smith <jsmith@example.com> CommitDate: Tue Aug 14 14:30:13 2012 -0300

Commit for testing

:000000 100644 0000000… aaaaaaa… A aaa/otherthing :000000 100644 0000000… aaaaaaa… A aaa/something :000000 100644 0000000… aaaaaaa… A bbb/bthing 0 0 aaa/otherthing 0 0 aaa/something 0 0 bbb/bthing

Each commit starts with the ‘commit’ tag that is followed by the SHA-1 of the commit, its parents (two or more parents in the case of a merge) and a list of refs, if any.

commit 456a68ee1407a77f3e804a30dff245bb6c6b872f
ce8e0b86a1e9877f42fe9453ede418519115f367 51a3b654f252210572297f47597b31527c475fb8 (HEAD -> refs/heads/master)

The commit line is followed by one or more headers. Each header has a key and a value:

Author: John Smith <jsmith@example.com> AuthorDate: Tue Aug 14 14:30:13 2012 -0300 Commit: John Smith <jsmith@example.com> CommitDate: Tue Aug 14 14:30:13 2012 -0300

Then, an empty line divides the headers from the commit message.

First line of the commit

Commit message splitted into one or several lines. Each line of the message stars with 4 spaces.

Commit messages can contain a list of ‘trailers’. These trailers have the same format of headers but their meaning is project dependent. This is an example of a commit message with trailers:

Commit message with trailers

This is the body of the message where trailers are included. Trailers are part of the body so each line of the message stars with 4 spaces.

Signed-off-by: John Doe <jdoe@example.com> Signed-off-by: Jane Rae <jrae@example.com>

After a new empty line, actions and stats over files can be found. A action line starts with one or more ‘:’ chars and contain data about the old and new permissions of a file, its old and new indexes, the action code and the filepath to the file. In the case of a copied, renamed or moved file, the new filepath to that file is included.

:100644 100644 e69de29… e69de29… R100 aaa/otherthing aaa/otherthing.renamed

Stats lines include the number of lines added and removed, and the name of the file. The new name is also included for moved or renamed files.

10 0 aaa/{otherthing => otherthing.renamed}

The commit ends with an empty line.

Take into account that one empty line is valid at the beginning of the log. This allows to parse empty logs without raising exceptions.

This example was generated using the next command:

git log –raw –numstat –pretty=fuller –decorate=full –parents -M -C -c –remotes=origin –all
Parameters:stream – a file object which stores the log
ACTION_PATTERN = '^(?P<sc>\\:+)\n (?P<modes>(?:\\d{6}[ \\t])+)\n (?P<indexes>(?:[a-f0-9]+\\.{,3}[ \\t])+)\n (?P<action>[^\\t]+)\\t+\n (?P<file>[^\\t]+)\n (?:\\t+(?P<newfile>.+))?$'
COMMIT = 1
COMMIT_PATTERN = '^commit[ \\t](?P<commit>[a-f0-9]{40})\n (?:[ \\t](?P<parents>[a-f0-9][a-f0-9 \\t]+))?\n (?:[ \\t]\\((?P<refs>.+)\\))?$\n '
EMPTY_LINE_PATTERN = '^$'
FILE = 4
GIT_ACTION_REGEXP = re.compile('^(?P<sc>\\:+)\n (?P<modes>(?:\\d{6}[ \\t])+)\n (?P<indexes>(?:[a-f0-9]+\\.{,3}[ \\t])+)\n (?P<action>[^\\t]+)\\t+\n , re.VERBOSE)
GIT_COMMIT_REGEXP = re.compile('^commit[ \\t](?P<commit>[a-f0-9]{40})\n (?:[ \\t](?P<parents>[a-f0-9][a-f0-9 \\t]+))?\n (?:[ \\t]\\((?P<refs>.+)\\))?$\n ', re.VERBOSE)
GIT_HEADER_TRAILER_REGEXP = re.compile('^(?P<name>[a-zA-z0-9\\-]+)\\:[ \\t]+(?P<value>.+)$', re.VERBOSE)
GIT_MESSAGE_REGEXP = re.compile('^[\\s]{4}(?P<msg>.*)$', re.VERBOSE)
GIT_NEXT_STATE_REGEXP = re.compile('^$', re.VERBOSE)
GIT_STATS_REGEXP = re.compile('^(?P<added>\\d+|-)[ \\t]+(?P<removed>\\d+|-)[ \\t]+(?P<file>.+)$', re.VERBOSE)
HEADER = 2
HEADER_TRAILER_PATTERN = '^(?P<name>[a-zA-z0-9\\-]+)\\:[ \\t]+(?P<value>.+)$'
INIT = 0
MESSAGE = 3
MESSAGE_LINE_PATTERN = '^[\\s]{4}(?P<msg>.*)$'
STATS_PATTERN = '^(?P<added>\\d+|-)[ \\t]+(?P<removed>\\d+|-)[ \\t]+(?P<file>.+)$'
TRAILERS = ['Signed-off-by']
parse()[source]

Parse the Git log stream.

class perceval.backends.core.git.GitRef(hash, refname)

Bases: tuple

hash

Alias for field number 0

refname

Alias for field number 1

class perceval.backends.core.git.GitRepository(uri, dirpath)[source]

Bases: object

Manage a Git repository.

This class provides access to a Git repository running some common commands such as clone, pull or log. To create an instance from a remote repository, use clone() class method.

Parameters:
  • uri – URI of the repository
  • dirpath – local directory where the repository is stored
GIT_PRETTY_OUTPUT_OPTS = ['--raw', '--numstat', '--pretty=fuller', '--decorate=full', '--parents', '-M', '-C', '-c']
classmethod clone(uri, dirpath)[source]

Clone a Git repository.

Make a bare copy of the repository stored in uri into dirpath. The repository would be either local or remote.

Parameters:
  • uri – URI of the repository
  • dirtpath – directory where the repository will be cloned
Returns:

a GitRepository class having cloned the repository

Raises:

RepositoryError – when an error occurs cloning the given repository

count_objects()[source]

Count the objects of a repository.

The method returns the total number of objects (packed and unpacked) available on the repository.

Raises:RepositoryError – when an error occurs counting the objects of a repository
is_detached()[source]

Check if the repo is in a detached state.

The repository is in a detached state when HEAD is not a symbolic reference.

Returns:whether the repository is detached or not
Raises:RepositoryError – when an error occurs checking the state of the repository
is_empty()[source]

Determines whether the repository is empty or not.

Returns True when the repository is empty. Under the hood, it checks the number of objects on the repository. When this number is 0, the repositoy is empty.

Raises:RepositoryError – when an error occurs accessing the repository
log(from_date=None, to_date=None, branches=None, encoding='utf-8')[source]

Read the commit log from the repository.

The method returns the Git log of the repository using the following options:

git log –raw –numstat –pretty=fuller –decorate=full
–all –reverse –topo-order –parents -M -C -c –remotes=origin

When from_date is given, it gets the commits equal or older than that date. This date is given in a datetime object.

The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.

Parameters:
  • from_date – fetch commits newer than a specific date (inclusive)
  • branches – names of branches to fetch from (default: None)
  • encoding – encode the log using this format
Returns:

a generator where each item is a line from the log

Raises:
show(commits=None, encoding='utf-8')[source]

Show the data of a set of commits.

The method returns the output of Git show command for a set of commits using the following options:

git show –raw –numstat –pretty=fuller –decorate=full
–parents -M -C -c [<commit>…<commit>]

When the list of commits is empty, the command will return data about the last commit, like the default behaviour of git show.

Parameters:
  • commits – list of commits to show data
  • encoding – encode the output using this format
Returns:

a generator where each item is a line from the show output

Raises:
sync()[source]

Keep the repository in sync.

This method will synchronize the repository with its ‘origin’, fetching newest objects and updating references. It uses low level commands which allow to keep track of which things have changed in the repository.

The method also returns a list of hashes related to the new commits fetched during the process.

Returns:list of new commits
Raises:RepositoryError – when an error occurs synchronizing the repository
update()[source]

Update repository from its remote.

Calling this method, the repository will be synchronized with the remote repository using ‘fetch’ command for ‘heads’ refs. Any commit stored in the local copy will be removed; refs will be overwritten.

Raises:RepositoryError – when an error occurs updating the repository

perceval.backends.core.github module

class perceval.backends.core.github.GitHub(owner=None, repository=None, api_token=None, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1)[source]

Bases: perceval.backend.Backend

GitHub backend for Perceval.

This class allows the fetch the issues stored in GitHub repository.

Parameters:
  • owner – GitHub owner
  • repository – GitHub repository from the owner
  • api_token – GitHub auth token to access the API
  • base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • sleep_time – time to sleep in case of connection problems
CATEGORIES = ['issue', 'pull_request']
fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues/pull requests from the repository.

The method retrieves, from a GitHub repository, the issues/pull requests updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain issues updated since this date
Returns:

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the items (issues or pull_requests)

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a GitHub item.

This backend generates two types of item which are ‘issue’ and ‘pull_request’.

static metadata_id(item)[source]

Extracts the identifier from a GitHub item.

static metadata_updated_on(item)[source]

Extracts the update time from a GitHub item.

The timestamp used is extracted from ‘updated_at’ field. This date is converted to UNIX timestamp format. As GitHub dates are in UTC the conversion is straightforward.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.17.4'
class perceval.backends.core.github.GitHubClient(owner, repository, token, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Client for retieving information from GitHub API

Parameters:
  • owner – GitHub owner
  • repository – GitHub repository from the owner
  • token – GitHub auth token to access the API
  • base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • sleep_time – time to sleep in case of connection problems
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • archive – collect issues already retrieved from an archive
  • from_archive – it tells whether to write/read the archive
calculate_time_to_reset()[source]

Calculate the seconds to reset the token requests, by obtaining the different between the current date and the next date when the token is fully regenerated.

fetch(url, payload=None, headers=None, method='GET', stream=False, verify=True)[source]

Fetch the data from a given URL.

Parameters:
  • url – link to the resource
  • payload – payload of the request
  • headers – headers of the request
  • method – type of request call (GET or POST)
  • stream – defer downloading the response body until the response content is available

:returns a response object

fetch_items(path, payload)[source]

Return the items from github API using links pagination

issue_comment_reactions(comment_id)[source]

Get reactions of an issue comment

issue_comments(issue_number)[source]

Get the issue comments from pagination

issue_reactions(issue_number)[source]

Get reactions of an issue

issues(from_date=None)[source]

Get the issues from pagination. Note that issues contain also pull requests.

pull_commits(pr_number)[source]

Get pull request commits

pull_requested_reviewers(pr_number)[source]

Get pull requested reviewers

pull_review_comment_reactions(comment_id)[source]

Get reactions of a review comment

pull_review_comments(pr_number)[source]

Get pull request review comments

pulls(from_date=None)[source]

Get ony pull requests

user(login)[source]

Get the user information and update the user cache

user_orgs(login)[source]

Get the user public organizations

class perceval.backends.core.github.GitHubCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run GitHub backend from the command line.

BACKEND

alias of GitHub

static setup_cmd_parser()[source]

Returns the GitHub argument parser.

perceval.backends.core.gitlab module

class perceval.backends.core.gitlab.GitLab(owner=None, repository=None, api_token=None, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, blacklist_ids=None)[source]

Bases: perceval.backend.Backend

GitLab backend for Perceval.

This class allows the fetch the issues stored in GitLab repository.

Parameters:
  • owner – GitLab owner
  • repository – GitLab repository from the owner
  • api_token – GitLab auth token to access the API
  • base_url – GitLab URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitLab public site.
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • sleep_time – time to sleep in case
  • blacklist_ids – ids of items that must not be retrieved
CATEGORIES = ['issue', 'merge_request']
fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues/merge requests from the repository.

The method retrieves, from a GitLab repository, the issues/merge requests updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain issues updated since this date
Returns:

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the items (issues or merge_requests)

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archivng items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend does not support items resuming
static metadata_category(item)[source]

Extracts the category from a GitLab item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a GitLab item.

static metadata_updated_on(item)[source]

Extracts the update time from a GitLab item.

The timestamp used is extracted from ‘updated_at’ field. This date is converted to UNIX timestamp format. As GitLab dates are in UTC the conversion is straightforward.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.6.2'
class perceval.backends.core.gitlab.GitLabClient(owner, repository, token, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Client for retieving information from GitLab API

Parameters:
  • owner – GitLab owner
  • repository – GitLab owner’s repository
  • token – GitLab auth token to access the API
  • base_url
    GitLab URL in enterprise edition case;
    when no value is set the backend will be fetch the data from the GitLab public site.
    param sleep_for_rate:
     sleep until rate limit is reset
    param min_rate_to_sleep:
     minimun rate needed to sleep until it will be reset
    param sleep_time:
     time to sleep in case of connection problems
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
EMOJI = 'award_emoji'
ISSUES = 'issues'
MERGES = 'merge_requests'
NOTES = 'notes'
PROJECTS = 'projects'
RATE_LIMIT_HEADER = 'RateLimit-Remaining'
RATE_LIMIT_RESET_HEADER = 'RateLimit-Reset'
VERSIONS = 'versions'
calculate_time_to_reset()[source]

Calculate the seconds to reset the token requests, by obtaining the different between the current date and the next date when the token is fully regenerated.

emojis(item_type, item_id)[source]

Get emojis from pagination

fetch(url, payload=None, headers=None, method='GET', stream=False)[source]

Fetch the data from a given URL.

Parameters:
  • url – link to the resource
  • payload – payload of the request
  • headers – headers of the request
  • method – type of request call (GET or POST)
  • stream – defer downloading the response body until the response content is available

:returns a response object

fetch_items(path, payload)[source]

Return the items from GitLab API using links pagination

issues(from_date=None)[source]

Get the issues from pagination

merge(merge_id)[source]

Get the merge full data

merge_version(merge_id, version_id)[source]

Get merge version detail

merge_versions(merge_id)[source]

Get the merge versions from pagination

merges(from_date=None)[source]

Get the merge requests from pagination

note_emojis(item_type, item_id, note_id)[source]

Get emojis of a note

notes(item_type, item_id)[source]

Get the notes from pagination

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.gitlab.GitLabCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run GitLab backend from the command line.

BACKEND

alias of GitLab

static setup_cmd_parser()[source]

Returns the GitLab argument parser.

perceval.backends.core.googlehits module

class perceval.backends.core.googlehits.GoogleHits(keywords, tag=None, archive=None, max_retries=5, sleep_time=1)[source]

Bases: perceval.backend.Backend

GoogleHits backend for Perceval.

This class retrieves the number of hits for a given list of keywords via the Google API. To initialize this class a list of keywords is needed.

Parameters:
  • keywords – a list of keywords
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • sleep_time – time to sleep in case of connection problems
CATEGORIES = ['google_hits']
fetch(category='google_hits')[source]

Fetch data from Google API.

The method retrieves a list of hits for some given keywords using the Google API.

Parameters:category – the category of items to fetch
Returns:a generator of data
fetch_items(category, **kwargs)[source]

Fetch Google hit items

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a GoogleHits item.

This backend only generates one type of item which is ‘google_hits’.

static metadata_id(item)[source]

Extracts the identifier from a GoogleHit item.

static metadata_updated_on(item)[source]

Extracts the update time from a GoogleHit item.

The timestamp is based on the current time when the hit was extracted. This field is not part of the data provided by Google API. It is added by this backend.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.2.1'
class perceval.backends.core.googlehits.GoogleHitsClient(sleep_time=1, max_retries=5, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

GoogleHits API client.

Client for fetching hits data from Google API.

Parameters:
  • sleep_time – time to sleep in case of connection problems
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
hits(keywords)[source]

Fetch information about a list of keywords.

class perceval.backends.core.googlehits.GoogleHitsCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run GoogleHits backend from the command line.

BACKEND

alias of GoogleHits

static setup_cmd_parser()[source]

Returns the GoogleHits argument parser.

perceval.backends.core.groupsio module

class perceval.backends.core.groupsio.Groupsio(group_name, dirpath, api_token, verify=True, tag=None, archive=None)[source]

Bases: perceval.backends.core.mbox.MBox

Groups.io backend.

This class allows the fetch the messages of a Groups.io group. Initialize this class passing the name of the group and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the url of the group on Groups.io.

Parameters:
  • group_name – Name of the group
  • dirpath – directory path where the mboxes are stored
  • api_token – GitHub auth token to access the API
  • verify – allows to disable SSL verification
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['message']
fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from a Groups.io group.

The method fetches the mbox files from a remote Groups.io group and retrieves the messages stored on them.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain messages since this date
Returns:

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend does not support items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
version = '0.1.2'
class perceval.backends.core.groupsio.GroupsioClient(group_name, dirpath, api_token, verify=True)[source]

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by Groups.io.

This class gives access to remote and local mboxes archives from a mailing list stored by Groups.io. This class also allows to keep them in sync.

Parameters:
  • group_name – Name of the group
  • dirpath – directory path where the mboxes are stored
  • api_token – GitHub auth token to access the API
  • verify – allows to disable SSL verification
DOWNLOAD_ARCHIVES = 'downloadarchives'
GET_SUBSCRIPTIONS = 'getsubs'
fetch()[source]

Fetch the mbox files from the remote archiver.

Stores the archives in the path given during the initialization of this object. Those archives which a not valid extension will be ignored.

Groups.io archives are returned as a .zip file, which contains one file in mbox format.

Returns:a list of tuples, storing the links and paths of the fetched archives
subscriptions(per_page=100)[source]

Fetch the groupsio paginated subscriptions for a given token

Parameters:per_page – number of subscriptions per page
Returns:an iterator of subscriptions
class perceval.backends.core.groupsio.GroupsioCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Groupsio backend from the command line.

BACKEND

alias of Groupsio

static setup_cmd_parser()[source]

Returns the Groupsio argument parser.

perceval.backends.core.hyperkitty module

class perceval.backends.core.hyperkitty.HyperKitty(url, dirpath, tag=None, archive=None)[source]

Bases: perceval.backends.core.mbox.MBox

HyperKitty backend.

This class allows the fetch the email messages stored on a HyperKitty archiver. Initialize this class passing the URL where the mailing list archiver is and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the value of url.

Parameters:
  • url – URL to the HyperKitty mailing list archiver
  • dirpath – directory path where the mboxes are stored
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['message']
fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the HyperKitty mailing list archiver.

The method fetches the mbox files from a remote HyperKitty mailing list archiver and retrieves the messages stored on them.

Take into account that HyperKitty does not provide yet any kind of info to know which is the first message on the mailing list. For this reason, using a value in from_date previous to the date where the first message was sent will make to download empty mbox files.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain messages since this date
Returns:

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend does not support items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
version = '0.4.2'
class perceval.backends.core.hyperkitty.HyperKittyCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run HyperKitty backend from the command line.

BACKEND

alias of HyperKitty

static setup_cmd_parser()[source]

Returns the HyperKitty argument parser.

class perceval.backends.core.hyperkitty.HyperKittyList(url, dirpath)[source]

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by HyperKitty archiver.

This class gives access to remote and local mboxes archives from a mailing list stored by HyperKitty. This class also allows to keep them in sync.

Notice that this class only works with HyperKitty version 1.0.4 or greater. Previous versions do not export messages in MBox format.

Parameters:
  • url – URL to the HyperKitty archiver for this list
  • dirpath – path to the local mboxes archives
fetch(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the mbox files from the remote archiver.

This method stores the archives in the path given during the initialization of this object.

HyperKitty archives are accessed month by month and stored following the schema year-month. Archives are fetched from the given month till the current month.

Parameters:from_date – fetch archives that store messages equal or after the given date; only year and month values are compared
Returns:a list of tuples, storing the links and paths of the fetched archives
mboxes

Get the mboxes managed by this mailing list.

Returns the archives sorted by date in ascending order.

Returns:a list of .MBoxArchive objects

perceval.backends.core.jenkins module

class perceval.backends.core.jenkins.Jenkins(url, tag=None, archive=None, blacklist_jobs=None, detail_depth=1, sleep_time=10)[source]

Bases: perceval.backend.Backend

Jenkins backend for Perceval.

This class retrieves the builds from a Jenkins site. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters:
  • url – Jenkins url
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
  • blacklist_jobs – exclude the jobs of this list while fetching
  • detail_depth – control the detail level of the data returned by the API
  • sleep_time – minimun waiting time due to a timeout connection exception
  • archive – collect builds already retrieved from an archive
CATEGORIES = ['build']
fetch(category='build')[source]

Fetch the builds from the url.

The method retrieves, from a Jenkins url, the builds updated since the given date.

Parameters:category – the category of items to fetch
Returns:a generator of builds
fetch_items(category, **kwargs)[source]

Fetch the contents

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archiving
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend does not supports items resuming
static metadata_category(item)[source]

Extracts the category from a Jenkins item.

This backend only generates one type of item which is ‘build’.

static metadata_id(item)[source]

Extracts the identifier from a Build item.

static metadata_updated_on(item)[source]

Extracts the update time from a Jenkins item.

The timestamp is extracted from ‘timestamp’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.11.1'
class perceval.backends.core.jenkins.JenkinsClient(url, blacklist_jobs=None, detail_depth=1, sleep_time=10, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Jenkins API client.

This class implements a simple client to retrieve jobs/builds from projects in a Jenkins node. The amount of data returned for each request depends on the detail_depth value selected (minimun and default is 1). Note that increasing the detail_depth may considerably slow down the fetch operation and cause connection broken errors.

Parameters:
  • url – URL of jenkins node: https://build.opnfv.org/ci
  • blacklist_jobs – exclude the jobs of this list while fetching
  • detail_depth – set the detail level of the data returned by the API
  • sleep_time – minimun waiting time due to a timeout connection exception
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
Raises:

HTTPError – when an error occurs doing the request

MAX_RETRIES = 5
get_builds(job_name)[source]

Retrieve all builds from a job

get_jobs()[source]

Retrieve all jobs

class perceval.backends.core.jenkins.JenkinsCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Jenkins backend from the command line.

BACKEND

alias of Jenkins

static setup_cmd_parser()[source]

Returns the Jenkins argument parser.

perceval.backends.core.jira module

class perceval.backends.core.jira.Jira(url, project=None, user=None, password=None, verify=True, cert=None, max_issues=100, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

JIRA backend for Perceval.

This class retrieves the issues stored in JIRA issue tracking system. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters:
  • url – JIRA’s endpoint
  • project – filter issues by project
  • user – Jira user
  • password – Jira user password
  • verify – allows to disable SSL verification
  • cert – SSL certificate path (PEM)
  • max_issues – max number of issues per query
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['issue']
fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues from the site.

The method retrieves, from a JIRA site, the issues updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – retrieve issues updated from this date
Returns:

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the issues

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Jira item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a Jira item.

static metadata_updated_on(item)[source]

Extracts the update time from a Jira item.

The timestamp used is extracted from ‘updated’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_issues(raw_page)[source]

Parse a JIRA API raw response.

The method parses the API response retrieving the issues from the received items

Parameters:items – items from where to parse the issues
Returns:a generator of issues
version = '0.11.4'
class perceval.backends.core.jira.JiraClient(url, project, user, password, verify, cert, max_issues=100, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

JIRA API client.

This class implements a simple client to retrieve issues from any JIRA issue tracking system.

Parameters:
  • URL – URL of the JIRA server
  • project – filter issues by project
  • user – JIRA’s username
  • password – JIRA’s password
  • verify – allows to disable SSL verification
  • cert – SSL certificate
  • max_issues – max number of issues per query
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
Raises:

HTTPError – when an error occurs doing the request

EXPAND = 'renderedFields,transitions,operations,changelog'
RESOURCE = 'rest/api'
VERSION_API = '2'
get_fields()[source]

Retrieve all the fields available.

get_issues(from_date)[source]

Retrieve all the issues from a given date.

Parameters:from_date – obtain issues updated since this date
class perceval.backends.core.jira.JiraCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Jira backend from the command line.

BACKEND

alias of Jira

static setup_cmd_parser()[source]

Returns the Jira argument parser.

perceval.backends.core.jira.filter_custom_fields(fields)[source]

Filter custom fields from a given set of fields.

Parameters:fields – set of fields
Returns:an object with the filtered custom fields
perceval.backends.core.jira.map_custom_field(custom_fields, fields)[source]

Add extra information for custom fields.

Parameters:
  • custom_fields – set of custom fields with the extra information
  • fields – fields of the issue where to add the extra information
Returns:

an set of items with the extra information mapped

perceval.backends.core.launchpad module

class perceval.backends.core.launchpad.Launchpad(distribution, package=None, items_per_page=75, sleep_time=300, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Launchpad backend for Perceval.

This class allows the fetch the issues stored in Launchpad.

Parameters:
  • distribution – Launchpad distribution
  • package – Distribution package
  • items_per_page – number of items in a retrieved page
  • sleep_time – time to sleep in case of connection problems
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['issue']
fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues from a project (distribution/package).

The method retrieves, from a Launchpad project, the issues updated since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain issues updated since this date
Returns:

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the issues

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Launchpad item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a Launchpad item.

static metadata_updated_on(item)[source]

Extracts the update time from a Launchpad item.

The timestamp used is extracted from ‘date_last_updated’ field. This date is converted to UNIX timestamp format. As Launchpad dates are in UTC in ISO 8601 (e.g., ‘2008-03-26T01:43:15.603905+00:00’) the conversion is straightforward.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.6.2'
class perceval.backends.core.launchpad.LaunchpadClient(distribution, package=None, items_per_page=75, sleep_time=300, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Client for retrieving information from Launchpad API

Parameters:
  • distribution – Launchpad distribution
  • package – Distribution package
  • items_per_page – number of items in a retrieved page
  • sleep_time – time to sleep in case of connection problems
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
issue(issue_id)[source]

Get the issue data by its ID

issue_collection(issue_id, collection_name)[source]

Get a collection list of a given issue

issues(start=None)[source]

Get the issues from pagination

user(user_name)[source]

Get the user data by URL

user_name(user_link)[source]

Get user name from link

class perceval.backends.core.launchpad.LaunchpadCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Launchpad backend from the command line.

BACKEND

alias of Launchpad

static setup_cmd_parser()[source]

Returns the Launchpad argument parser.

perceval.backends.core.mattermost module

class perceval.backends.core.mattermost.Mattermost(url, channel, api_token, max_items=60, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1)[source]

Bases: perceval.backend.Backend

Mattermost backend.

This class retrieves the posts sent to a Mattermost channel. To access the server an API token is required, which must have enough permissions to read from the given channel.

To initialize this class the URL of the server must be provided. The origin of data will be set using this url plus the channel from data is obtained (i.e: https://mattermost.example.com/abcdefg).

Parameters:
  • url – URL of the server
  • channel – identifier of the channel where data will be fetched
  • api_token – token or key needed to use the API
  • max_items – maximum number of message requested on the same query
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • sleep_time – minimun waiting time to avoid too many request exception
CATEGORIES = ['post']
fetch(category='post', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the posts from the channel.

This method fetches the posts stored on the channel that were sent since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain posts sent since this date
Returns:

a generator of posts

fetch_items(category, **kwargs)[source]

Fetch the messages.

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend does not support items resuming
static metadata_category(item)[source]

Extracts the category from a Mattermost item.

This backend only generates one type of item which is ‘post’.

static metadata_id(item)[source]

Extracts the identifier from a Mattermost item.

static metadata_updated_on(item)[source]

Extracts and converts the update time from a Metadata item.

The timestamp is extracted from ‘update_at’ field. This field is already a UNIX timestamp but it needs to be converted to float.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_json(raw_json)[source]

Parse a Mattermost JSON stream.

The method parses a JSON stream and returns a dict with the parsed data.

Parameters:raw_json – JSON string to parse
Returns:a dict with the parsed data
version = '0.1.0'
class perceval.backends.core.mattermost.MattermostClient(base_url, api_token, max_items=60, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Mattermost API client.

Client for fetching information from a Mattermost server using its REST API.

Parameters:
  • base_url – URL of the Mattermost server
  • api_key – key needed to use the API
  • max_items – maximum number of items fetched per request
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • sleep_time – time to sleep in case of connection problems
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
API_URL = '%(base_url)s/api/v4/%(entrypoint)s'
PPAGE = 'page'
PPER_PAGE = 'per_page'
RCHANNELS = 'channels'
RPOSTS = 'posts'
RUSERS = 'users'
calculate_time_to_reset()[source]

Number of seconds to wait.

The time is obtained by the different between the current date and the next date when the token is fully regenerated.

fetch(url, payload=None, headers=None, method='GET', stream=False, verify=True)[source]

Override fetch method to handle API rate limit.

Parameters:
  • url – link to the resource
  • payload – payload of the request
  • headers – headers of the request
  • method – type of request call (GET or POST)
  • stream – defer downloading the response body until the response content is available

:returns a response object

posts(channel, page=None)[source]

Fetch the history of a channel.

user(user)[source]

Fetch user data.

class perceval.backends.core.mattermost.MattermostCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Mattermost backend from the command line.

BACKEND

alias of Mattermost

static setup_cmd_parser()[source]

Returns the Meetup argument parser.

perceval.backends.core.mbox module

class perceval.backends.core.mbox.MBox(uri, dirpath, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

MBox backend.

This class allows the fetch the email messages stored one or several mbox files. Initialize this class passing the directory path where the mbox files are stored. The origin of the data will be set to to the value of uri.

Parameters:
  • uri – URI of the mboxes; typically, the URL of their mailing list
  • dirpath – directory path where the mboxes are stored
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['message']
DATE_FIELD = 'Date'
MESSAGE_ID_FIELD = 'Message-ID'
fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from a set of mbox files.

The method retrieves, from mbox files, the messages stored in these containers.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain messages since this date
Returns:

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend does not support items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a MBox item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a MBox item.

static metadata_updated_on(item)[source]

Extracts the update time from a MBox item.

The timestamp used is extracted from ‘Date’ field in its several forms. This date is converted to UNIX timestamp format.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_mbox(filepath)[source]

Parse a mbox file.

This method parses a mbox file and returns an iterator of dictionaries. Each one of this contains an email message.

Parameters:filepath – path of the mbox to parse
:returns : generator of messages; each message is stored in a
dictionary of type requests.structures.CaseInsensitiveDict
version = '0.11.0'
class perceval.backends.core.mbox.MBoxArchive(filepath)[source]

Bases: object

Class to access a mbox archive.

MBOX archives can be stored into plain or compressed files (gzip, bz2 or zip).

Parameters:filepath – path to the mbox file
compressed_type
container
filepath
is_compressed()[source]
class perceval.backends.core.mbox.MBoxCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run MBox backend from the command line.

BACKEND

alias of MBox

static setup_cmd_parser()[source]

Returns the MBox argument parser.

class perceval.backends.core.mbox.MailingList(uri, dirpath)[source]

Bases: object

Manage mailing lists archives.

This class gives access to the local mboxes archives that a mailing list manages.

Parameters:
  • uri – URI of the mailing lists, usually its URL address
  • dirpath – path to the mboxes archives
mboxes

Get the mboxes managed by this mailing list.

Returns the archives sorted by name.

Returns:a list of .MBoxArchive objects

perceval.backends.core.mediawiki module

class perceval.backends.core.mediawiki.MediaWiki(url, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

MediaWiki backend for Perceval.

This class retrieves the wiki pages and edits from a MediaWiki site. To initialize this class the URL must be provided. The origin of the data will be set to this URL.

It uses different APIs to support pre and post 1.27 MediaWiki versions. The pre 1.27 approach performance is better but it needs different logic for full an incremental retrieval.

In pre 1.27 the incremental approach uses the recent changes API which just covers MAX_RECENT_DAYS. If the from_date used is older, all the pages must be retrieved and the consumer of the items must filter itself.

Both approach return a common format: a page with all its revisions. It is different how the pages list is generated.

The page and revisions data downloaded are the standard. More data could be gathered using additional properties.

Deleted pages are not analyzed.

Parameters:
  • url – MediaWiki url
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['page']
fetch(category='page', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), reviews_api=False)[source]

Fetch the pages from the backend url.

The method retrieves, from a MediaWiki url, the wiki pages.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain pages updated since this date
  • reviews_api – use the reviews API available in MediaWiki >= 1.27
Returns:

a generator of pages

fetch_items(category, **kwargs)[source]

Fetch the pages

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend does not support items resuming
static metadata_category(item)[source]

Extracts the category from a MediaWiki item.

This backend only generates one type of item which is ‘page’.

static metadata_id(item)[source]

Extracts the identifier from a MediaWiki page.

static metadata_updated_on(item)[source]

Extracts the update field from a MediaWiki item.

The timestamp is extracted from ‘update’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.9.5'
class perceval.backends.core.mediawiki.MediaWikiClient(url, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

MediaWiki API client.

This class implements a simple client to retrieve pages from projects in a MediaWiki node.

Parameters:
  • url – URL of mediawiki site: https://wiki.mozilla.org
  • archive – an archive to store/retrieved the fetched data
  • from_archive – define whether the archive is used to store/read data
Raises:

HTTPError – when an error occurs doing the request

call(params)[source]

Run an API command. :param cgi: cgi command to run on the server :param params: dict with the HTTP parameters needed to run

the given command
get_namespaces()[source]

Retrieve all contents namespaces.

get_pages(namespace, apcontinue='')[source]

Retrieve all pages from a namespace starting from apcontinue.

get_pages_from_allrevisions(namespaces, from_date=None, arvcontinue=None)[source]
get_recent_pages(namespaces, rccontinue='')[source]

Retrieve recent pages from all namespaces starting from rccontinue.

get_revisions(pageid, last_date=None)[source]
get_version()[source]
class perceval.backends.core.mediawiki.MediaWikiCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run MediaWiki backend from the command line.

BACKEND

alias of MediaWiki

static setup_cmd_parser()[source]

Returns the MediaWiki argument parser.

perceval.backends.core.meetup module

class perceval.backends.core.meetup.Meetup(group, api_token, max_items=200, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30)[source]

Bases: perceval.backend.Backend

Meetup backend.

This class allows to fetch the events of a group from the Meetup server. Initialize this class passing API key needed for authentication with the parameter api_key.

Parameters:
  • group – name of the group where data will be fetched
  • api_token – token or key needed to use the API
  • max_items – maximum number of issues requested on the same query
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • sleep_time – minimun waiting time to avoid too many request exception
CATEGORIES = ['event']
fetch(category='event', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=None)[source]

Fetch the events from the server.

This method fetches those events of a group stored on the server that were updated since the given date. Data comments and rsvps are included within each event.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain events updated since this date
  • to_date – obtain events updated before this date
Returns:

a generator of events

fetch_items(category, **kwargs)[source]

Fetch the events

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Meetup item.

This backend only generates one type of item which is ‘event’.

static metadata_id(item)[source]

Extracts the identifier from a Meetup item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Meetup item.

The timestamp is extracted from ‘updated’ field and converted to a UNIX timestamp.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_json(raw_json)[source]

Parse a Meetup JSON stream.

The method parses a JSON stream and returns a list with the parsed data.

Parameters:raw_json – JSON string to parse
Returns:a list with the parsed data
version = '0.11.5'
class perceval.backends.core.meetup.MeetupClient(api_key, max_items=200, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Meetup API client.

Client for fetching information from the Meetup server using its REST API v3.

Parameters:
  • api_key – key needed to use the API
  • max_items – maximum number of items per request
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • sleep_time – time to sleep in case of connection problems
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
PFIELDS = 'fields'
PKEY = 'key'
PORDER = 'order'
PPAGE = 'page'
PRESPONSE = 'response'
PSCROLL = 'scroll'
PSIGN = 'sign'
PSTATUS = 'status'
RCOMMENTS = 'comments'
REVENTS = 'events'
RRSVPS = 'rsvps'
VEVENT_FIELDS = ['event_hosts', 'featured', 'group_topics', 'plain_text_description', 'rsvpable', 'series']
VRESPONSE = ['yes', 'no']
VRSVP_FIELDS = ['attendance_status']
VSTATUS = ['cancelled', 'upcoming', 'past', 'proposed', 'suggested']
VUPDATED = 'updated'
calculate_time_to_reset()[source]

Number of seconds to wait. They are contained in the rate limit reset header

comments(group, event_id)[source]

Fetch the comments of a given event.

events(group, from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the events pages of a given group.

rsvps(group, event_id)[source]

Fetch the rsvps of a given event.

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.meetup.MeetupCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Meetup backend from the command line.

BACKEND

alias of Meetup

static setup_cmd_parser()[source]

Returns the Meetup argument parser.

perceval.backends.core.nntp module

class perceval.backends.core.nntp.NNTP(host, group, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

NNTP backend.

This class allows to fetch the articles published on a news group using NNTP. It is initialized giving the host and the name of the news group.

Parameters:
  • host – host
  • group – name of the group
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['article']
fetch(category='article', offset=1)[source]

Fetch articles posted on a news group.

This method fetches those messages or articles published on a news group starting on the given offset.

Parameters:
  • category – the category of items to fetch
  • offset – obtain messages from this offset
Returns:

a generator of articles

fetch_items(category, **kwargs)[source]

Fetch the articles

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
metadata(item)[source]

NNTP metadata.

This method takes items, overriding metadata decorator, to add extra information related to NNTP.

Parameters:item – an item fetched by a backend
static metadata_category(item)[source]

Extracts the category from a NNTP item.

This backend only generates one type of item which is ‘article’.

static metadata_id(item)[source]

Extracts the identifier from a NNTP item.

static metadata_updated_on(item)[source]

Extracts the update time from a NNTP item.

The timestamp is extracted from ‘Date’ field and converted to a UNIX timestamp.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_article(raw_article)[source]

Parse a NNTP article.

This method parses a NNTP article stored in a string object and returns an dictionary.

Parameters:raw_article – NNTP article string
Returns:a dictionary of type requests.structures.CaseInsensitiveDict
Raises:ParseError – when an error is found parsing the article
version = '0.5.2'
class perceval.backends.core.nntp.NNTPCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run NNTP backend from the command line.

BACKEND

alias of NNTP

static setup_cmd_parser()[source]

Returns the NNTP argument parser.

class perceval.backends.core.nntp.NNTTPClient(host, archive=None, from_archive=False)[source]

Bases: object

NNTP client

Parameters:
  • host – host
  • group – name of the group
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
ARTICLE = 'article'
GROUP = 'group'
OVER = 'over'
article(article_id)[source]

Fetch article data

Parameters:article_id – id of the article to fetch
group(group_name)[source]

Fetch group data

Parameters:group_name – name of the group
over(offset)[source]

Fetch messages data

Parameters:offset – a tuple representing the offset to retrieve
quit()[source]

perceval.backends.core.phabricator module

class perceval.backends.core.phabricator.ConduitClient(base_url, api_token, max_retries=5, sleep_time=1, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Conduit API Client.

Phabricator uses Conduit as the Phabricator REST API. This class implements some of its methods to retrieve the contents from a Phabricator server.

Parameters:
  • base_url – URL of the Phabricator server
  • api_token – token to get access to restricted methods of the API
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • sleep_time – time to sleep in case of connection problems
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
MANIPHEST_TASKS = 'maniphest.search'
MANIPHEST_TRANSACTIONS = 'maniphest.gettasktransactions'
PAFTER = 'after'
PATTACHMENTS = 'attachments'
PCONSTRAINTS = 'constraints'
PHAB_PHIDS = 'phid.query'
PHAB_USERS = 'user.query'
PHIDS = 'phids'
PIDS = 'ids'
PMODIFIED_START = 'modifiedStart'
PORDER = 'order'
PPROJECTS = 'projects'
URL = '%(base)s/api/%(method)s'
VOUTDATED = 'outdated'
phids(*phids)[source]

Retrieve data about PHIDs.

Params phids:list of PHIDs
static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

tasks(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Retrieve tasks.

Parameters:from_date – retrieve tasks that where updated from that date; dates are converted epoch time.
transactions(*phids)[source]

Retrieve tasks transactions.

Parameters:phids – list of tasks identifiers
users(*phids)[source]

Retrieve users.

Params phids:list of users identifiers
exception perceval.backends.core.phabricator.ConduitError(**kwargs)[source]

Bases: perceval.errors.BaseError

Raised when an error occurs using Conduit

message = '%(error)s (code: %(code)s)'
class perceval.backends.core.phabricator.Phabricator(url, api_token, tag=None, archive=None, max_retries=5, sleep_time=1)[source]

Bases: perceval.backend.Backend

Phabricator backend.

This class allows to fetch the tasks stored on a Phabricator server. Initialize this class passing the URL of this server and the API token. The origin of the data will be set to this URL.

Parameters:
  • url – URL of the server
  • api_token – token needed to use the API
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
  • max_retries – number of max retries to a data source before raising a RetryError exception
  • sleep_time – time to sleep in case of connection problems
CATEGORIES = ['task']
fetch(category='task', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the tasks from the server.

This method fetches the tasks stored on the server that were updated since the given date. The transactions data related to each task is also included within them.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain tasks updated since this date
Returns:

a generator of tasks

fetch_items(category, **kwargs)[source]

Fetch the tasks

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Phabricator item.

This backend only generates one type of item which is ‘task’.

static metadata_id(item)[source]

Extracts the identifier from a Phabricator item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Phabricator item.

The timestamp is extracted from ‘dateModified’ field. This date is in UNIX timestamp format but needs to be converted to a float number.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_phids(results)[source]

Parse a Phabicator PHIDs JSON stream.

This method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the PHID parsed data.

Parameters:results – JSON to parse
Returns:a generator of parsed PHIDs
static parse_tasks(raw_json)[source]

Parse a Phabricator tasks JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the task parsed data.

Parameters:raw_json – JSON string to parse
Returns:a generator of parsed tasks
static parse_tasks_transactions(raw_json)[source]

Parse a Phabricator tasks transactions JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed transactions.

Parameters:raw_json – JSON string to parse
Returns:a dict with the parsed transactions
static parse_users(raw_json)[source]

Parse a Phabricator users JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contais the user parsed data.

Parameters:raw_json – JSON string to parse
Returns:a generator of parsed users
version = '0.11.0'
class perceval.backends.core.phabricator.PhabricatorCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Phabricator backend from the command line.

BACKEND

alias of Phabricator

static setup_cmd_parser()[source]

Returns the Phabricator argument parser.

perceval.backends.core.pipermail module

class perceval.backends.core.pipermail.Pipermail(url, dirpath, verify=True, tag=None, archive=None)[source]

Bases: perceval.backends.core.mbox.MBox

Pipermail backend.

This class allows the fetch the email messages stored on a Pipermail archiver. Initialize this class passing the URL where the archiver is and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the value of url.

Parameters:
  • url – URL to the Pipermail archiver
  • dirpath – directory path where the mboxes are stored
  • verify – allows to disable SSL verification
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['message']
fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the Pipermail archiver.

The method fetches the mbox files from a remote Pipermail archiver and retrieves the messages stored on them.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain messages since this date
Returns:

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend does not support items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
version = '0.9.0'
class perceval.backends.core.pipermail.PipermailCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Pipermail backend from the command line.

BACKEND

alias of Pipermail

static setup_cmd_parser()[source]

Returns the Pipermail argument parser.

class perceval.backends.core.pipermail.PipermailList(url, dirpath, verify=True)[source]

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by Pipermail archiver.

This class gives access to remote and local mboxes archives from a mailing list stored by Pipermail. This class also allows to keep them in sync.

Parameters:
  • url – URL to the Pipermail archiver for this list
  • dirpath – path to the local mboxes archives
  • verify – allows to disable SSL verification
fetch(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the mbox files from the remote archiver.

Stores the archives in the path given during the initialization of this object. Those archives which a not valid extension will be ignored.

Pipermail archives usually have on their file names the date of the archives stored following the schema year-month. When from_date property is called, it will return the mboxes which their year and month are equal or after that date.

Parameters:from_date – fetch archives that store messages equal or after the given date; only year and month values are compared
Returns:a list of tuples, storing the links and paths of the fetched archives
mboxes

Get the mboxes managed by this mailing list.

Returns the archives sorted by date in ascending order.

Returns:a list of .MBoxArchive objects

perceval.backends.core.redmine module

class perceval.backends.core.redmine.Redmine(url, api_token=None, max_issues=100, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Redmine backend.

This class allows to fetch the issues stored on a Redmine server. Initialize this class passing the URL of this server. Some servers require authentication to get access to some data, if this is the case, pass the API token to api_token parameter.

Parameters:
  • url – URL of the server
  • api_token – token needed to use the API
  • max_issues – maximum number of issues requested on the same query
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['issue']
fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues from the server.

This method fetches the issues stored on the server that were updated since the given date. Data about attachments, journals and watchers (among others) are included within each issue.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain issues updated since this date
Returns:

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the issues

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Redmine item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a Redmine item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Redmine item.

The timestamp is extracted from ‘updated_on’ field and converted to a UNIX timestamp.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_issue_data(raw_json)[source]

Parse a Redmine issue JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed data for the given issue.

Parameters:raw_json – JSON string to parse
Returns:a dictionary with the parsed issue data
static parse_issues(raw_json)[source]

Parse a Redmine issues JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the issue parsed data.

Parameters:raw_json – JSON string to parse
Returns:a generator of parsed issues
static parse_user_data(raw_json)[source]

Parse a Redmine user JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed data for the given user.

Parameters:raw_json – JSON string to parse
Returns:a dictionary with the parsed user data
version = '0.9.4'
class perceval.backends.core.redmine.RedmineClient(base_url, api_token=None, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Redmine API client.

This class implements a client that retrieves issues from a Redmine server. Remine servers provides a REST API that returns its results in JSON format.

Parameters:
  • base_url – URL of the Phabricator server
  • api_token – token to get access to restricted data stored in the server
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
CATTACHMENTS = 'attachments'
CCHANGESETS = 'changesets'
CCHILDREN = 'children'
CJOURNALS = 'journals'
CJSON = '.json'
CRELATIONS = 'relations'
CWATCHERS = 'watchers'
PINCLUDE = 'include'
PKEY = 'key'
PLIMIT = 'limit'
POFFSET = 'offset'
PSORT = 'sort'
PSTATUS_ID = 'status_id'
PUPDATED_ON = 'updated_on'
RISSUES = 'issues'
RUSERS = 'users'
URL = '%(base)s/%(resource)s'
issue(issue_id)[source]

Get the information of the given issue.

Parameters:issue_id – issue identifier
issues(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_issues=100)[source]

Get the information of a list of issues.

Parameters:
  • from_date – retrieve issues that where updated from that date; dates are converted to UTC
  • offset – starting position for the search
  • max_issues – maximum number of issues to reteurn per query
static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

user(user_id)[source]

Get the information of the given user.

Parameters:user_id – user identifier
class perceval.backends.core.redmine.RedmineCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Redmine backend from the command line.

BACKEND

alias of Redmine

static setup_cmd_parser()[source]

Returns the Redmine argument parser.

perceval.backends.core.rss module

class perceval.backends.core.rss.RSS(url, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

RSS backend for Perceval.

This class retrieves the entries from a RSS feed. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters:
  • url – RSS url
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['entry']
fetch(category='entry')[source]

Fetch the entries from the url.

The method retrieves all entries from a RSS url

Parameters:category – the category of items to fetch
Returns:a generator of entries
fetch_items(category, **kwargs)[source]

Fetch the entries

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving entries on the fetch process.

Returns:this backend supports entries archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend does not supports entries resuming
static metadata_category(item)[source]

Extracts the category from a RSS item.

This backend only generates one type of item which is ‘entry’.

static metadata_id(item)[source]

Extracts the identifier from an entry item.

static metadata_updated_on(item)[source]

Extracts the update time from a RSS item.

The timestamp is extracted from ‘published’ field. This date is a datetime string that needs to be converted to a UNIX timestamp float value.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
classmethod parse_feed(raw_entries)[source]
version = '0.5.2'
class perceval.backends.core.rss.RSSClient(url, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

RSS API client.

This class implements a simple client to retrieve entries from projects in a RSS node.

Parameters:
  • url – URL of rss node: https://item.opnfv.org/ci
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
Raises:

HTTPError – when an error occurs doing the request

get_entries()[source]

Retrieve all entries from a RSS feed

class perceval.backends.core.rss.RSSCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run RSS backend from the command line.

BACKEND

alias of RSS

static setup_cmd_parser()[source]

Returns the RSS argument parser.

perceval.backends.core.slack module

class perceval.backends.core.slack.Slack(channel, api_token, max_items=1000, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Slack backend.

This class retrieves the messages sent to a Slack channel. To access the server an API token is required, which must have enough permissions to read from the given channel.

The origin of the data will be set to the SLACK_URL plus the identifier of the channel; i.e ‘https://slack.com/C01234ABC’.

Parameters:
  • channel – identifier of the channel where data will be fetched
  • api_token – token or key needed to use the API
  • max_items – maximum number of message requested on the same query
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['message']
fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the channel.

This method fetches the messages stored on the channel that were sent since the given date.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain messages sent since this date
Returns:

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend does not support items resuming
static metadata_category(item)[source]

Extracts the category from a Slack item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a Slack item.

This identifier will be the mix of two fields because Slack messages does not have any unique identifier. In this case, ‘ts’ and ‘user’ values (or ‘bot_id’ when the message is sent by a bot) are combined because there have been cases where two messages were sent by different users at the same time.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Slack item.

The timestamp is extracted from ‘ts’ field and converted to a UNIX timestamp.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_channel_info(raw_channel_info)[source]

Parse a channel info JSON stream.

This method parses a JSON stream, containing the information from a channel, and returns a dict with the parsed data.

:param raw_channel_info

Returns:a dict with the parsed information about a channel
static parse_history(raw_history)[source]

Parse a channel history JSON stream.

This method parses a JSON stream, containing the history of a channel, and returns a list with the parsed data. It also returns if there are more messages that are not included on this stream.

Parameters:raw_history – JSON string to parse
Returns:a tuple with a list of dicts with the parsed messages and ‘has_more’ value
static parse_user(raw_user)[source]

Parse a user’s info JSON stream.

This method parses a JSON stream, containing the information from a user, and returns a dict with the parsed data.

Parameters:raw_user – JSON string to parse
Returns:a dict with the parsed user’s information
version = '0.7.1'
class perceval.backends.core.slack.SlackClient(api_token, max_items=1000, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Slack API client.

Client for fetching information from the Slack server using its REST API.

Parameters:
  • api_key – key needed to use the API
  • max_items – maximum number of items per request
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
PCHANNEL = 'channel'
PCOUNT = 'count'
PLATEST = 'latest'
POLDEST = 'oldest'
PTOKEN = 'token'
PUSER = 'user'
RCHANNEL_HISTORY = 'channels.history'
RCHANNEL_INFO = 'channels.info'
RCONVERSATION_INFO = 'conversations.members'
RUSER_INFO = 'users.info'
URL = 'https://slack.com/api/%(resource)s'
channel_info(channel)[source]

Fetch information about a channel.

conversation_members(conversation)[source]

Fetch the number of members in a conversation, which is a supertype for public and private ones, DM and group DM.

Parameters:conversation – the ID of the conversation
history(channel, oldest=None, latest=None)[source]

Fetch the history of a channel.

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

user(user_id)[source]

Fetch user info.

exception perceval.backends.core.slack.SlackClientError(**kwargs)[source]

Bases: perceval.errors.BaseError

Raised when an error occurs using the Slack client

message = '%(error)s'
class perceval.backends.core.slack.SlackCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Slack backend from the command line.

BACKEND

alias of Slack

static setup_cmd_parser()[source]

Returns the Slack argument parser.

perceval.backends.core.stackexchange module

class perceval.backends.core.stackexchange.StackExchange(site, tagged=None, api_token=None, max_questions=100, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

StackExchange backend for Perceval.

This class retrieves the questions stored in any of the StackExchange sites. To initialize this class the site must be provided.

Parameters:
  • site – StackExchange site
  • tagged – filter items by question Tag
  • api_token – StackExchange access_token for the API
  • max_questions – max of questions per page retrieved
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['question']
fetch(category='question', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the questions from the site.

The method retrieves, from a StackExchange site, the questions updated since the given date.

Parameters:from_date – obtain questions updated since this date
Returns:a generator of questions
fetch_items(category, **kwargs)[source]

Fetch the questions

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a StackExchange item.

This backend only generates one type of item which is ‘question’.

static metadata_id(item)[source]

Extracts the identifier from a StackExchange item.

static metadata_updated_on(item)[source]

Extracts the update time from a StackExchange item.

The timestamp is extracted from ‘last_activity_date’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_questions(raw_page)[source]

Parse a StackExchange API raw response.

The method parses the API response retrieving the questions from the received items

Parameters:items – items from where to parse the questions
Returns:a generator of questions
version = '0.10.4'
class perceval.backends.core.stackexchange.StackExchangeClient(site, tagged, token, max_questions=100, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

StackExchange API client.

This class implements a simple client to retrieve questions from any Stackexchange site.

Parameters:
  • site – URL of the Bugzilla server
  • tagged – filter items by question Tag
  • token – StackExchange access_token for the API
  • max_questions – max number of questions per query
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
Raises:

HTTPError – when an error occurs doing the request

QUESTIONS_FILTER = 'Bf*y*ByQD_upZqozgU6lXL_62USGOoV3)MFNgiHqHpmO_Y-jHR'
STACKEXCHANGE_API_URL = 'https://api.stackexchange.com'
VERSION_API = '2.2'
get_questions(from_date)[source]

Retrieve all the questions from a given date.

Parameters:from_date – obtain questions updated since this date
static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.stackexchange.StackExchangeCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run StackExchange backend from the command line.

BACKEND

alias of StackExchange

static setup_cmd_parser()[source]

Returns the StackExchange argument parser.

perceval.backends.core.supybot module

class perceval.backends.core.supybot.Supybot(uri, dirpath, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Supybot IRC log backend.

This class fetches the messages stored by Supybot in log files. Initialize this class providing the directory where those IRC log files are stored.

The log filenames expected by this backend should follow the pattern: #channel_YYYY-MM-DD.log (i.e #grimoirelab_2016-06-27.log). This is needed to determine the date when messages were sent. Other filenames might work too but the behaviour is unknown.

The format of the messages must also follow a pattern. This patterns can be found in SupybotParser class documentation.

Parameters:
  • uri – URI of the IRC archives; typically, the URL of their IRC channel
  • dirpath – directory path where the archives are stored
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['message']
fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the Supybot IRC logger.

The method parsers and returns the messages saved on the IRC log files and stored by Supybot in dirpath.

Parameters:
  • category – the category of items to fetch
  • from_date – obtain messages since this date
Returns:

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend does not support items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Supybot item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a Supybot item.

This identifier will be the mix of three fields because IRC messages does not have any unique identifier. In this case, ‘timestamp’, ‘nick’ and ‘body’ values are combined because there have been cases where two messages were sent by the same user at the same time.

static metadata_updated_on(item)[source]

Extracts the update time from a Supybot item.

The timestamp used is extracted from ‘timestamp’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_supybot_log(filepath)[source]

Parse a Supybot IRC log file.

The method parses the Supybot IRC log file and returns an iterator of dictionaries. Each one of this, contains a message from the file.

Parameters:

filepath – path to the IRC log file

Returns:

a generator of parsed messages

Raises:
  • ParseError – raised when the format of the Supybot log file is invalid
  • OSError – raised when an error occurs reading the given file
version = '0.8.2'
class perceval.backends.core.supybot.SupybotCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Supybot backend from the command line.

BACKEND

alias of Supybot

static setup_cmd_parser()[source]

Returns the Supybot argument parser.

class perceval.backends.core.supybot.SupybotParser(stream)[source]

Bases: object

Supybot IRC parser.

This class parses a Supybot IRC log stream, converting plain log lines (or messages) into dict items. Each dictionary will contain the date of the message, the type of message (comment or server message), the nick of the sender and its body.

Each line on a log starts with a date in ISO format including its timezone and it is followed by two spaces and by a message.

There are two types of valid messages in a Supybot log: comment messages and server messages. First one follows any of these two patterns:

2016-06-27T12:00:00+0000 <nick> body of the message 2016-06-27T12:00:00+0000 * nick waves hello

While a valid server message has the next pattern:

2016-06-27T12:00:00+0000 *** nick is known as new_nick

An exception is raised when any of the lines does not follow any of the above formats.

Parameters:stream – an iterator which produces Supybot log lines
BOT_PATTERN = '^-(?P<nick>(.*?)(!.*)?)-\\s(?P<body>.+)$'
COMMENT_ACTION_PATTERN = '^\\*\\s?(?P<body>(?P<nick>([^\\s\\*]+?)(!.*)?)\\s.+)$'
COMMENT_PATTERN = '^<(?P<nick>(.*?)(!.*)?)>\\s(?P<body>.+)$'
EMPTY_BOT_PATTERN = '^-(.*?)(!.*)?-\\s*$'
EMPTY_COMMENT_ACTION_PATTERN = '^\\*\\s?([^\\s\\*]+?)(!.*)?\\s*$'
EMPTY_COMMENT_PATTERN = '^<(.*?)(!.*)?>\\s*$'
EMPTY_PATTERN = '^\\s*$'
SERVER_PATTERN = '^\\*\\*\\*\\s(?P<body>(?P<nick>(.*?)(!.*)?)\\s.+)$'
SUPYBOT_BOT_REGEX = re.compile('^-(?P<nick>(.*?)(!.*)?)-\\s(?P<body>.+)$', re.VERBOSE)
SUPYBOT_COMMENT_ACTION_REGEX = re.compile('^\\*\\s?(?P<body>(?P<nick>([^\\s\\*]+?)(!.*)?)\\s.+)$', re.VERBOSE)
SUPYBOT_COMMENT_REGEX = re.compile('^<(?P<nick>(.*?)(!.*)?)>\\s(?P<body>.+)$', re.VERBOSE)
SUPYBOT_EMPTY_BOT_REGEX = re.compile('^-(.*?)(!.*)?-\\s*$', re.VERBOSE)
SUPYBOT_EMPTY_COMMENT_ACTION_REGEX = re.compile('^\\*\\s?([^\\s\\*]+?)(!.*)?\\s*$', re.VERBOSE)
SUPYBOT_EMPTY_COMMENT_REGEX = re.compile('^<(.*?)(!.*)?>\\s*$', re.VERBOSE)
SUPYBOT_EMPTY_REGEX = re.compile('^\\s*$', re.VERBOSE)
SUPYBOT_SERVER_REGEX = re.compile('^\\*\\*\\*\\s(?P<body>(?P<nick>(.*?)(!.*)?)\\s.+)$', re.VERBOSE)
SUPYBOT_TIMESTAMP_REGEX = re.compile('^(?P<ts>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}[\\+\\-]\\d{4})\\s\\s\n (?P<msg>.+)$\n ', re.VERBOSE)
TCOMMENT = 'comment'
TIMESTAMP_PATTERN = '^(?P<ts>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}[\\+\\-]\\d{4})\\s\\s\n (?P<msg>.+)$\n '
TSERVER = 'server'
parse()[source]

Parse a Supybot IRC stream.

Returns an iterator of dicts. Each dicts contains information about the date, type, nick and body of a single log entry.

Returns:iterator of parsed lines
Raises:ParseError – when an invalid line is found parsing the given stream

perceval.backends.core.telegram module

class perceval.backends.core.telegram.Telegram(bot, bot_token, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Telegram backend.

The Telegram backend fetches the messages that a Telegram bot can receive. Usually, these messages are direct or private messages but a bot can be configured to receive every message sent to a channel/group where it is subscribed. Take into account that messages are removed from the Telegram server 24 hours after they are sent. Moreover, once they are fetched using an offset, these messages are also removed. This means every time this backend is called, messages will be deleted.

Initialize this class passing the name of the bot and the authentication token used by this bot. The authentication token is provided by Telegram once the bot is created.

The origin of the data will be set to the TELEGRAM_URL plus the name of the bot; i.e ‘http://telegram.org/mybot’.

Parameters:
  • bot – name of the bot
  • bot_token – authentication token used by the bot
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['message']
fetch(category='message', offset=1, chats=None)[source]

Fetch the messages the bot can read from the server.

The method retrieves, from the Telegram server, the messages sent with an offset equal or greater than the given.

A list of chats, groups and channels identifiers can be set using the parameter chats. When it is set, only those messages sent to any of these will be returned. An empty list will return no messages.

Parameters:
  • category – the category of items to fetch
  • offset – obtain messages from this offset
  • chats – list of chat names used to filter messages
Returns:

a generator of messages

Raises:

ValueError – when chats is an empty list

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
metadata(item)[source]

Telegram metadata.

The metod takes an item and overrides the metadata information to add extra information related to Telegram.

Currently, it adds the ‘offset’ keyword.

Parameters:item – an item fetched by a backend
static metadata_category(item)[source]

Extracts the category from a Telegram item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a Telegram item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Telegram item.

The timestamp is extracted from ‘date’ field that is inside of ‘message’ dict. This date is converted to UNIX timestamp format.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
static parse_messages(raw_json)[source]

Parse a Telegram JSON messages list.

The method parses the JSON stream and returns an iterator of dictionaries. Each one of this, contains a Telegram message.

Parameters:raw_json – JSON string to parse
Returns:a generator of parsed messages
version = '0.9.3'
class perceval.backends.core.telegram.TelegramBotClient(bot_token, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient

Telegram Bot API 2.0 client.

This class implements a simple client to retrieve those messages sent to a Telegram bot. This includes personal messages or messages sent to a channel (when privacy settings are disabled).

Parameters:
  • bot_token – token for the bot
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
API_URL = 'https://api.telegram.org/bot%(token)s/%(method)s'
OFFSET = 'offset'
UPDATES_METHOD = 'getUpdates'
static sanitize_for_archive(url, headers, payload)[source]

Sanitize URL of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns the sanitized url, plus the headers and payload

updates(offset=None)[source]

Fetch the messages that a bot can read.

When the offset is given it will retrieve all the messages that are greater or equal to that offset. Take into account that, due to how the API works, all previous messages will be removed from the server.

Parameters:offset – fetch the messages starting on this offset
class perceval.backends.core.telegram.TelegramCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Telegram backend from the command line.

BACKEND

alias of Telegram

static setup_cmd_parser()[source]

Returns the Telegram argument parser.

perceval.backends.core.twitter module

class perceval.backends.core.twitter.Twitter(query, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Twitter backend.

This class allows to fetch samples of tweets containing specific keywords. Initialize this class passing API key needed for authentication with the parameter api_key.

Parameters:
  • query – query to fetch tweets
  • api_token – token or key needed to use the API
  • max_items – maximum number of issues requested on the same query
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • sleep_time – minimun waiting time to avoid too many request exception
  • tag – label used to mark the data
  • archive – archive to store/retrieve items
CATEGORIES = ['tweet']
fetch(category='tweet', since_id=None, max_id=None, geocode=None, lang=None, include_entities=True, tweets_type='mixed')[source]

Fetch the tweets from the server.

This method fetches tweets from the TwitterSearch API published in the last seven days.

Parameters:
  • category – the category of items to fetch
  • since_id – if not null, it returns results with an ID greater than the specified ID
  • max_id – when it is set or if not None, it returns results with an ID less than the specified ID
  • geocode – if enabled, returns tweets by users located at latitude,longitude,”mi”|”km”
  • lang – if enabled, restricts tweets to the given language, given by an ISO 639-1 code
  • include_entities – if disabled, it excludes entities node
  • tweets_type – type of tweets returned. Default is “mixed”, others are “recent” and “popular”
Returns:

a generator of tweets

fetch_items(category, **kwargs)[source]

Fetch the tweets

Parameters:
  • category – the category of items to fetch
  • kwargs – backend arguments
Returns:

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns:this backend supports items archive
classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns:this backend supports items resuming
static metadata_category(item)[source]

Extracts the category from a Twitter item.

This backend only generates one type of item which is ‘tweet’.

static metadata_id(item)[source]

Extracts the identifier from a Twitter item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Twitter item.

The timestamp is extracted from ‘created_at’ field and converted to a UNIX timestamp.

Parameters:item – item generated by the backend
Returns:a UNIX timestamp
version = '0.2.2'
class perceval.backends.core.twitter.TwitterClient(api_key, max_items=100, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, archive=None, from_archive=False)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Twitter API client.

Client for fetching information from the Twitter server using its REST API v1.1.

Parameters:
  • api_key – key needed to use the API
  • max_items – maximum number of items per request
  • sleep_for_rate – sleep until rate limit is reset
  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset
  • sleep_time – time to sleep in case of connection problems
  • archive – an archive to store/read fetched data
  • from_archive – it tells whether to write/read the archive
calculate_time_to_reset()[source]

Number of seconds to wait. They are contained in the rate limit reset header

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param:url: HTTP url request
Param:headers: HTTP headers request
Param:payload: HTTP payload request

:returns url, headers and the sanitized payload

tweets(query, since_id=None, max_id=None, geocode=None, lang=None, include_entities=True, result_type='mixed')[source]

Fetch tweets for a given query between since_id and max_id.

Parameters:
  • query – query to fetch tweets
  • since_id – if not null, it returns results with an ID greater than the specified ID
  • max_id – if not null, it returns results with an ID less than the specified ID
  • geocode – if enabled, returns tweets by users located at latitude,longitude,”mi”|”km”
  • lang – if enabled, restricts tweets to the given language, given by an ISO 639-1 code
  • include_entities – if disabled, it excludes entities node
  • result_type – type of tweets returned. Default is “mixed”, others are “recent” and “popular”
Returns:

a generator of tweets

class perceval.backends.core.twitter.TwitterCommand(*args)[source]

Bases: perceval.backend.BackendCommand

Class to run Twitter backend from the command line.

BACKEND

alias of Twitter

static setup_cmd_parser()[source]

Returns the Twitter argument parser.

Module contents