bitcrawler

What is it?

Bitcrawler is a Python package that provides functionality for crawling & scraping the web. The library brings simplicity, speed, and extensibility to any crawling project. The library can be exteded to easily add on additional crawling behavior and functionality for specific use cases.

Installation

pip install bitcrawler

Dependencies

  • Reppy

  • BeautifulSoup4

  • Requests

Example Crawler

The below example extends the crawler object and overrides the parse function. The parse function is always called at the end of crawling. It is passed all the pages fetched. In the below example the pages are parsed using beautifulsoup and the title is printed with the URL.

from bs4 import BeautifulSoup
from bitcrawler import Crawler

class MyCrawler(Crawler):
    def parse(self, webpages):
        for page in webpages:
            # If page response is not none, response code is in 200s, and document is html.
            if page.response and \
               page.response.ok and \
               page.response.headers.get('content-type').startswith('text/html'):
                soup = BeautifulSoup(page.response.text, "html.parser")
                print(page.url, "- ", soup.title)

# Initializes the crawler with the configuration specified by parameters.
crawler = MyCrawler(cross_site=True, crawl_depth=2, multithreading=True)
# Crawls pages starting from "http://test.com"
crawled_pages = crawler.crawl("http://test.com")

Submodules

bitcrawler.crawler module

This module provides functionality for crawling the web.

class bitcrawler.crawler.Crawler(user_agent='python-requests', crawl_delay=0, crawl_depth=5, cross_site=False, respect_robots=True, respect_robots_crawl_delay=False, multithreading=False, max_threads=100, webpage_builder=<class 'bitcrawler.webpage.WebpageBuilder'>, request_kwargs=None, reppy_cache_capacity=100, reppy_cache_policy=None, reppy_ttl_policy=None, reppy_args=())

Bases: object

Provides functionality for crawling webpages.

crawl(url, allowed_domains=None, disallowed_domains=None, page_timeout=10)

Crawls webpages by traversing links.

Parameters
  • url (str) – The URL to be crawled.

  • allowed_domains (list(str)) – A list of allowed domains to crawl. Default None Original URL domain takes precidence. cross_site must be enabled.

  • disallowed_domains (list(str)) – A list of allowed domains to crawl. Default None. Original URL domain takes precidence. cross_site must be enabled and allowed_domains must be empty/null.

  • page_timeout (int, optional) – Number of seconds to allow for page retrieval. Default 10.

Returns

Returns a call to the overidable parse function.

Supplies the webpages as input.

Return type

self.parse(webpages)

parse(webpages)

Parses the webpages. Meant to be Overridden.

Parameters

webpages (list(webpage.Webpage)) – A list of webpages.

Returns

The crawled webpages.

Return type

list(webpage.Webpage)

bitcrawler.parsing module

Utilities for parsing html.

Extends functionality of BeautifulSoup for added html parsing functionality.

class bitcrawler.parsing.HtmlParser(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)

Bases: bs4.BeautifulSoup

HtmlParser extends functionality provided by BeautifulSoup.

Finds links from anchor tags in the soup.

Parameters

None

Returns

A list of links discovered within the html.

Return type

list(str)

Examples

>>> response = requests.get("http://python.org")
>>> HtmlParser(response.text).get_links()
["http://python.org/search", "/about", ..., "http://python.org/learn"]

bitcrawler.robots module

Utilities for fetching and parsing robots.txt files.

Extends the reppy library (https://github.com/seomoz/reppy) for robots.txt fetching and parsing.

class bitcrawler.robots.ReppyUtils

Bases: object

A set of reppy utilities.

classmethod allowed(url, user_agent='python-requests', request_kwargs=None)

Determines if a URL is crawlable for a given user agent.

Parameters
  • url (str) – The url to check for crawlability.

  • user_agent (str, optional) – The user agent to check for in robots.txt. Default ‘python-requests’.

  • requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns

True if the page is allowed to be crawled.

Return type

bool

Examples

>>> ReppyUtils.allowed('http://python.org/test')
True
classmethod crawl_delay(url, user_agent, request_kwargs=None)

Determines the robots crawl delay for a given user agent.

Parameters
  • url (str) – The url to get a crawl delay for.

  • user_agent – The user agent to get the crawl delay for.

  • requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns

The time to wait between crawling pages (seconds).

Return type

int

Examples

>>> ReppyUtils.crawl_delay('http://python.org/test', 'python-requests')
2
classmethod fetch_robots(robots_url, request_kwargs=None)

Fetches the robots URL.

Parameters
  • robots_url (str) – The robots url to fetch.

  • requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns

the reppy object from feting the robots.txt file.

Return type

reppy.Robots

classmethod get_robots_url(url)

Gets the URL where the robots file should be stored.

Parameters

url (str) – The url to derive the robots url from.

Returns

The robots.txt url.

Return type

str

Examples

>>> ReppyUtils.get_robots_url('http://python.org/test/path")
'http://python.org/robots.txt'_
class bitcrawler.robots.RobotsCache(capacity, cache_policy=None, ttl_policy=None, *args, **kwargs)

Bases: reppy.cache.RobotsCache

Extends the reppy RobotsCache to include extra functionality.

crawl_delay(url, user_agent='python-requests')

Gets a crawl delay for a given url and user_agent. Note: Crawl delay is the same for all pages under a robots.txt file for

a given user agent.

Note: Going to open a PR on reppy to have this built in. :param url: The target URL. :type url: str :param user_agent: The user agent. Default “python-requests” :type user_agent: str, optional

Returns

The number of seconds the specified user_agent should wait between calls.

Return type

int

Examples

>>> crawl_delay("http://python.org", user_agent="python-requests")
2

bitcrawler.webpage module

This module provides functionality for fetching a webpage and stores relevant ojects from page retrieval.

class bitcrawler.webpage.Webpage

Bases: object

Webpage provides the ability to fetch a webpage. Stores data from the retrieval of the page.

classmethod fetch(url, **requests_kwargs)

Fetches the webpage for the URL using the requests library.

Parameters
  • url (str) – The target URL.

  • **requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.

Returns

The response from the web request.

Return type

obj requests.Response

Raises
  • Exception – Can raise a variety of exceptions. See requests library

  • for more details.

Parses links from an html document.

Parameters
  • url (str) – The target URL.

  • html (str) – the html document.

Returns

A list containing all valid urls found in the html.

Return type

list

get_page(url, user_agent, request_kwargs=None, respect_robots=True, reppy=None)

Fetches a webpage for the provided URL.

Parameters
  • url (str) – The url for the webpage.

  • user_agent (str) – The user_agent to use during requests. Note: This param overrides any user agent kwargs.

  • request_kwargs (dict, optional) – The page retrieval request kwargs.

  • respect_robots (bool) – If true, robots.txt will be honored.

  • ( (reppy) – obj:robots.RobotParser, optional): A robots parsing object.

Returns

The instance of the Webpage class.

Return type

this

Extracts links from a page.

Only supports documents with a content type of ‘text/html’. TODO: Add further support for other doc types.

Returns

A list of links from the page.

Return type

list(str)

Raises
  • RuntimeError – Raises a runtime error if this function is called

  • prior to calling Webpage(..)get_page.

  • The response from get_page is required in this function.

classmethod is_allowed_by_robots(url, user_agent, reppy=None, request_kwargs=None)

Determine if a page is crawlable by robots.txt.

Leverages the Reppy library for retrieval and parsing of robots.txt.

i
Args:

url (str): The target URL. user_agent (str): The user agent being used to crawl the page. reppy (:obj:robots.RobotParser, optional): A robots parsing object. request_kwargs (dict, optional): requests.get kwargs for fetching the

robots.txt file.

Returns:

bool: True if the page is allowed by robots.txt. Otherwise False.

classmethod parse_mime_type(mime_type)

Parses a mime type into its content type and parameters.

Parameters
  • url (str) – The target URL.

  • **requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.

Returns

str: The type of the content. str: The content type parameters.

Return type

tuple

Examples

>>> parse_mime_type("text/html; encoding=utf-8")
("text/html", "encoding=utf-8",)
class bitcrawler.webpage.WebpageBuilder

Bases: object

Builds a Webpage object by intilizing the class and calling the get_page method.

classmethod build(url, user_agent, request_kwargs=None, respect_robots=True, reppy=None)

Builds a Webpage by fetching the provided URL.

Parameters
  • user_agent (str) – The user_agent to use during requests. Note: This param overrides any user agent kwargs.

  • request_kwargs (dict, optional) – The page retrieval request kwargs.

  • respect_robots (bool) – If true, robots.txt will be honored.

  • ( (reppy) – obj:robots.RobotParser, optional): A robots parsing object.

Returns

The instance of the Webpage class.

Return type

this

Module contents