bitcrawler

What is it?

Bitcrawler is a Python package that provides functionality for crawling & scraping the web. The library brings simplicity, speed, and extensibility to any crawling project. The library can be exteded to easily add on additional crawling behavior and functionality for specific use cases.

Installation

pip install bitcrawler

Dependencies

  • Reppy

  • BeautifulSoup4

  • Requests

Example Crawler

Crawling webpages will begin by fetching the original URL supplied. The crawler will traverse links discoverd on the pages until it reaches the specified crawl depth or runs out of links.

A bitcrawler.webpage.Webpage class instance will be returned for each page fetched. To see more details on the Webpage class see the documetation on the class (https://bitcrawler.readthedocs.io/en/latest/bitcrawler.html#module-bitcrawler.webpage).

Simple Usage

from bitcrawler.crawler import Crawler

crawler = Crawler()
# Returns a list of bitcrawler.webpage.Webpage instances.
# See the Webpage class for more details on its members.
crawled_pages = crawler.crawl('http://test.com')

Advanced Usage

The below example extends the crawler object and overrides the parse function. The parse function is always called at the end of crawling. It is passed all the pages fetched. In the below example the pages are parsed using beautifulsoup and the title is printed with the URL.

from bs4 import BeautifulSoup
from bitcrawler.crawler import Crawler
from bitcrawler import webpage

class MyCrawler(Crawler):
    # Parse is always called py the `crawl` method and is provided
    # a webpage.Webpage class instance for each URL.
    # See the webpage.Webpage class for details about the object.
    def parse(self, webpages):
        for page in webpages:
            # If page response is not none, response code is in 200s, and document is html.
            if page.response and \
               page.response.ok and \
               page.response.headers.get('content-type').startswith('text/html'):
                soup = BeautifulSoup(page.response.text, "html.parser")
                print(page.url, "- ", soup.title)
        return webpages

# Initializes the crawler with the configuration specified by parameters.
crawler = MyCrawler(
    user_agent='python-requests', # The User Agent to use for all requests.
    crawl_delay=0, # Number of seconds to wait between web requests.
    crawl_depth=2, # The max depth from following links (Default is 5).
    cross_site=False, # If true, domains other than the original domain can be crawled.
    respect_robots=True, # If true, the robots.txt standard will be followed.
    respect_robots_crawl_delay=True, # If true, the robots.txt crawl delay will be followed.
    multithreading=True, # If true, parallelizes requests for faster crawling.
    max_threads=100, # If multithreading is true, this determines the number of threads.
    webpage_builder=webpage.WebpageBuilder, # Advanced Usage - Allows the WebpageBuilder class to be overridden to allow modificaion.
    request_kwargs={'timeout': 10}, # Additional keyword arguments that you would like to pass into any request made.
    reppy_cache_capacity=100, # The number of robots.txt objects to cache. Eliminates the need to fetch robots.txt file many times.
    reppy_cache_policy=None, # Advanced Usage - See docs for details.
    reppy_ttl_policy=None, # Advanced Usage - See docs for details.
    reppy_args=tuple()) # Advanced Usage - See docs for details.

# Crawls pages starting from "http://test.com"
# Returns a list of bitcrawler.webpage.Webpage instances.
# See the Webpage class for more details on its members.
crawled_pages = crawler.crawl(
    url="http://test.com", # The start URL to crawl from.
    allowed_domains=[], # A list of allowed domains. `cross_site` must be True. Ex. ['python.org',...]
    disallowed_domains=[], # A list of disallowed domains. `cross_site` must be True and `allowed_domains` empty.
    page_timeout=10) # The ammount of time before a page retrieval/build times out.

Submodules

bitcrawler.crawler module

This module provides functionality for crawling the web.

class bitcrawler.crawler.Crawler(user_agent='python-requests', crawl_delay=0, crawl_depth=5, cross_site=False, respect_robots=True, respect_robots_crawl_delay=False, multithreading=False, max_threads=100, webpage_builder=<class 'bitcrawler.webpage.WebpageBuilder'>, request_kwargs=None, reppy_cache_capacity=100, reppy_cache_policy=None, reppy_ttl_policy=None, reppy_args=())

Bases: object

Provides functionality for crawling webpages.

crawl(urls, allowed_domains=None, disallowed_domains=None, page_timeout=10)

Crawls webpages by traversing links.

Parameters:
  • list (urls) – The URL or list of start URLs to be crawled.

  • allowed_domains (list(str)) – A list of allowed domains to crawl. Default None Original URL domain takes precidence. cross_site must be enabled.

  • disallowed_domains (list(str)) – A list of allowed domains to crawl. Default None. Original URL domain takes precidence. cross_site must be enabled and allowed_domains must be empty/null.

  • page_timeout (int, optional) – Number of seconds to allow for page retrieval. Default 10.

Returns:

Returns a call to the overidable parse function.

Supplies the webpages as input.

Return type:

self.parse(webpages)

parse(webpages)

Parses the webpages. Meant to be Overridden.

Parameters:

webpages (list(webpage.Webpage)) – A list of webpages.

Returns:

The crawled webpages.

Return type:

list(webpage.Webpage)

bitcrawler.parsing module

Utilities for parsing html.

Extends functionality of BeautifulSoup for added html parsing functionality.

class bitcrawler.parsing.HtmlParser(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)

Bases: BeautifulSoup

HtmlParser extends functionality provided by BeautifulSoup.

Finds links from anchor tags in the soup.

Parameters:

None

Returns:

A list of links discovered within the html.

Return type:

list(str)

Examples

>>> response = requests.get("http://python.org")
>>> HtmlParser(response.text).get_links()
["http://python.org/search", "/about", ..., "http://python.org/learn"]

bitcrawler.robots module

Utilities for fetching and parsing robots.txt files.

Extends the reppy library (https://github.com/seomoz/reppy) for robots.txt fetching and parsing.

class bitcrawler.robots.ReppyUtils

Bases: object

A set of reppy utilities.

classmethod allowed(url, user_agent='python-requests', request_kwargs=None)

Determines if a URL is crawlable for a given user agent.

Parameters:
  • url (str) – The url to check for crawlability.

  • user_agent (str, optional) – The user agent to check for in robots.txt. Default ‘python-requests’.

  • requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns:

True if the page is allowed to be crawled.

Return type:

bool

Examples

>>> ReppyUtils.allowed('http://python.org/test')
True
classmethod crawl_delay(url, user_agent, request_kwargs=None)

Determines the robots crawl delay for a given user agent.

Parameters:
  • url (str) – The url to get a crawl delay for.

  • user_agent – The user agent to get the crawl delay for.

  • requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns:

The time to wait between crawling pages (seconds).

Return type:

int

Examples

>>> ReppyUtils.crawl_delay('http://python.org/test', 'python-requests')
2
classmethod fetch_robots(robots_url, request_kwargs=None)

Fetches the robots URL.

Parameters:
  • robots_url (str) – The robots url to fetch.

  • requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns:

the reppy object from feting the robots.txt file.

Return type:

reppy.Robots

classmethod get_robots_url(url)

Gets the URL where the robots file should be stored.

Parameters:

url (str) – The url to derive the robots url from.

Returns:

The robots.txt url.

Return type:

str

Examples

>>> ReppyUtils.get_robots_url('http://python.org/test/path")
'http://python.org/robots.txt'_
class bitcrawler.robots.RobotsCache(capacity, cache_policy=None, ttl_policy=None, *args, **kwargs)

Bases: RobotsCache

Extends the reppy RobotsCache to include extra functionality.

crawl_delay(url, user_agent='python-requests')

Gets a crawl delay for a given url and user_agent. Note: Crawl delay is the same for all pages under a robots.txt file for

a given user agent.

Note: Going to open a PR on reppy to have this built in. :param url: The target URL. :type url: str :param user_agent: The user agent. Default “python-requests” :type user_agent: str, optional

Returns:

The number of seconds the specified user_agent should wait between calls.

Return type:

int

Examples

>>> crawl_delay("http://python.org", user_agent="python-requests")
2

bitcrawler.webpage module

This module provides functionality for fetching a webpage and stores relevant ojects from page retrieval.

class bitcrawler.webpage.Webpage

Bases: object

Webpage provides the ability to fetch a webpage. Stores data from the retrieval of the page.

url

The associated to the webpage.

Type:

str

response

The requests library Response object from fetching the page.

Type:

obj requests.Response

A list of the links found on the page.

Type:

list(str)

allowed_by_robots

If true, the page is crawlable by robots.txt.

Type:

bool

message

Message detailing any issues fetching the page.

Type:

str

error

Any error that was raised during page retrieval.

Type:

obj Exception

classmethod fetch(url, **requests_kwargs)

Fetches the webpage for the URL using the requests library.

Parameters:
  • url (str) – The target URL.

  • **requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.

Returns:

The response from the web request.

Return type:

obj requests.Response

Raises:
  • Exception – Can raise a variety of exceptions. See requests library

  • for more details.

Parses links from an html document.

Parameters:
  • url (str) – The target URL.

  • html (str) – the html document.

Returns:

A list containing all valid urls found in the html.

Return type:

list

Extracts links from a page.

Only supports documents with a content type of ‘text/html’. TODO: Add further support for other doc types.

Returns:

A list of links from the page.

Return type:

list(str)

classmethod is_allowed_by_robots(url, user_agent, reppy=None, request_kwargs=None)

Determine if a page is crawlable by robots.txt.

Leverages the Reppy library for retrieval and parsing of robots.txt.

i
Args:

url (str): The target URL. user_agent (str): The user agent being used to crawl the page. reppy (:obj:robots.RobotParser, optional): A robots parsing object. request_kwargs (dict, optional): requests.get kwargs for fetching the

robots.txt file.

Returns:

bool: True if the page is allowed by robots.txt. Otherwise False.

classmethod parse_mime_type(mime_type)

Parses a mime type into its content type and parameters.

Parameters:
  • url (str) – The target URL.

  • **requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.

Returns:

str: The type of the content. str: The content type parameters.

Return type:

tuple

Examples

>>> parse_mime_type("text/html; encoding=utf-8")
("text/html", "encoding=utf-8",)
class bitcrawler.webpage.WebpageBuilder

Bases: object

Builds a Webpage object by intilizing the class and calling the get_page method.

classmethod build(url, user_agent, request_kwargs=None, respect_robots=True, reppy=None)

Builds a Webpage by fetching the provided URL.

Parameters:
  • user_agent (str) – The user_agent to use during requests. Note: This param overrides any user agent kwargs.

  • request_kwargs (dict, optional) – The page retrieval request kwargs.

  • respect_robots (bool) – If true, robots.txt will be honored.

  • ( (reppy) – obj:robots.RobotParser, optional): A robots parsing object.

Returns:

The instance of the Webpage class.

Return type:

this

Module contents