bitcrawler¶

What is it?

Bitcrawler is a Python package that provides functionality for crawling & scraping the web. The library brings simplicity, speed, and extensibility to any crawling project. The library can be exteded to easily add on additional crawling behavior and functionality for specific use cases.

Installation

pip install bitcrawler

Dependencies

Reppy
BeautifulSoup4
Requests

Example Crawler

The below example extends the crawler object and overrides the parse function. The parse function is always called at the end of crawling. It is passed all the pages fetched. In the below example the pages are parsed using beautifulsoup and the title is printed with the URL.

from bs4 import BeautifulSoup
from bitcrawler import Crawler

class MyCrawler(Crawler):
    def parse(self, webpages):
        for page in webpages:
            # If page response is not none, response code is in 200s, and document is html.
            if page.response and \
               page.response.ok and \
               page.response.headers.get('content-type').startswith('text/html'):
                soup = BeautifulSoup(page.response.text, "html.parser")
                print(page.url, "- ", soup.title)

# Initializes the crawler with the configuration specified by parameters.
crawler = MyCrawler(cross_site=True, crawl_depth=2, multithreading=True)
# Crawls pages starting from "http://test.com"
crawled_pages = crawler.crawl("http://test.com")

Submodules¶

bitcrawler.crawler module¶

This module provides functionality for crawling the web.

class bitcrawler.crawler.Crawler(user_agent='python-requests', crawl_delay=0, crawl_depth=5, cross_site=False, respect_robots=True, respect_robots_crawl_delay=False, multithreading=False, max_threads=100, webpage_builder=<class 'bitcrawler.webpage.WebpageBuilder'>, request_kwargs=None, reppy_cache_capacity=100, reppy_cache_policy=None, reppy_ttl_policy=None, reppy_args=())¶

Bases: object

Provides functionality for crawling webpages.

crawl(url, allowed_domains=None, disallowed_domains=None, page_timeout=10)¶

Crawls webpages by traversing links.

Parameters

url (str) – The URL to be crawled.
allowed_domains (list(str)) – A list of allowed domains to crawl. Default None Original URL domain takes precidence. cross_site must be enabled.
disallowed_domains (list(str)) – A list of allowed domains to crawl. Default None. Original URL domain takes precidence. cross_site must be enabled and allowed_domains must be empty/null.
page_timeout (int, optional) – Number of seconds to allow for page retrieval. Default 10.

Returns

Returns a call to the overidable parse function.: Supplies the webpages as input.

Return type

self.parse(webpages)

parse(webpages)¶

Parses the webpages. Meant to be Overridden.

Parameters: webpages (list(webpage.Webpage)) – A list of webpages.
Returns: The crawled webpages.
Return type: list(webpage.Webpage)

bitcrawler.link_utils module¶

Provides tools for interacting with links and URLs.

class bitcrawler.link_utils.LinkUtils¶

Bases: object

Utils for working with URLs and links.

classmethod get_base_url(url)¶

Gets the base url (scheme://netloc) from a url.

Uses the python urllib.parse object to generate the scheme and netloc.

Parameters: url (str) – A URL.
Returns: The base url generated from the provided url.
Return type: string

Examples

>>> get_base_url("http://python.org:8000/test/link/path")
"http://python.org:8000"

classmethod get_domain(url)¶

Checks is two urls share the same domain.

Generates domains from the urllib.parse objects netloc. The domain is extracted from the netloc by stripping the port and subdomain info.

Parameters: url (str) – A URL.
Returns: The domain of from the input URL.
Return type: str

Examples

>>> get_domain("http://subdomain.python.org:8000/test/link/path")
"python.org"

classmethod is_relative(link)¶

Determines if a link is a relative link.

Parameters: link (str) – A link.
Returns: True if the link is a relative link. Otherwise False.
Return type: bool

Examples

>>> is_relative("/test/link/path")
True

>>> is_relative("http://python.org/test/link/path")
False

classmethod is_same_domain(url1, url2)¶

Checks is two urls share the same domain.

Uses get_domain to extract a domain.

Parameters

url1 (str) – The first URL for comparison.
url2 (str) – The second URL for comparison.

Returns

True if the domains match. Otherwise False.

Return type

bool

Examples

>>>is_same_domain(“http://python.org”, “https://subdomain.python.org:8000”) True

>>>is_same_domain(“http://python.org”, “https://pandas.com”) False

bitcrawler.parsing module¶

Utilities for parsing html.

Extends functionality of BeautifulSoup for added html parsing functionality.

class bitcrawler.parsing.HtmlParser(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)¶

Bases: bs4.BeautifulSoup

HtmlParser extends functionality provided by BeautifulSoup.

get_links()¶

Finds links from anchor tags in the soup.

Parameters: None –
Returns: A list of links discovered within the html.
Return type: list(str)

Examples

>>> response = requests.get("http://python.org")
>>> HtmlParser(response.text).get_links()
["http://python.org/search", "/about", ..., "http://python.org/learn"]

bitcrawler.robots module¶

Utilities for fetching and parsing robots.txt files.

Extends the reppy library (https://github.com/seomoz/reppy) for robots.txt fetching and parsing.

class bitcrawler.robots.ReppyUtils¶

Bases: object

A set of reppy utilities.

classmethod allowed(url, user_agent='python-requests', request_kwargs=None)¶

Determines if a URL is crawlable for a given user agent.

Parameters

url (str) – The url to check for crawlability.
user_agent (str, optional) – The user agent to check for in robots.txt. Default ‘python-requests’.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns

True if the page is allowed to be crawled.

Return type

bool

Examples

>>> ReppyUtils.allowed('http://python.org/test')
True

classmethod crawl_delay(url, user_agent, request_kwargs=None)¶

Determines the robots crawl delay for a given user agent.

Parameters

url (str) – The url to get a crawl delay for.
user_agent – The user agent to get the crawl delay for.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns

The time to wait between crawling pages (seconds).

Return type

int

Examples

>>> ReppyUtils.crawl_delay('http://python.org/test', 'python-requests')
2

classmethod fetch_robots(robots_url, request_kwargs=None)¶

Fetches the robots URL.

Parameters

robots_url (str) – The robots url to fetch.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None

Returns

the reppy object from feting the robots.txt file.

Return type

reppy.Robots

classmethod get_robots_url(url)¶

Gets the URL where the robots file should be stored.

Parameters: url (str) – The url to derive the robots url from.
Returns: The robots.txt url.
Return type: str

Examples

>>> ReppyUtils.get_robots_url('http://python.org/test/path")
'http://python.org/robots.txt'_

class bitcrawler.robots.RobotsCache(capacity, cache_policy=None, ttl_policy=None, *args, **kwargs)¶

Bases: reppy.cache.RobotsCache

Extends the reppy RobotsCache to include extra functionality.

crawl_delay(url, user_agent='python-requests')¶

Gets a crawl delay for a given url and user_agent. Note: Crawl delay is the same for all pages under a robots.txt file for

a given user agent.

Note: Going to open a PR on reppy to have this built in. :param url: The target URL. :type url: str :param user_agent: The user agent. Default “python-requests” :type user_agent: str, optional

Returns: The number of seconds the specified user_agent should wait between calls.
Return type: int

Examples

>>> crawl_delay("http://python.org", user_agent="python-requests")
2

bitcrawler.webpage module¶

This module provides functionality for fetching a webpage and stores relevant ojects from page retrieval.

class bitcrawler.webpage.Webpage¶

Bases: object

Webpage provides the ability to fetch a webpage. Stores data from the retrieval of the page.

classmethod fetch(url, **requests_kwargs)¶

Fetches the webpage for the URL using the requests library.

Parameters

url (str) – The target URL.
**requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.

Returns

The response from the web request.

Return type

obj requests.Response

Raises

Exception – Can raise a variety of exceptions. See requests library
for more details. –

classmethod get_html_links(url, html)¶

Parses links from an html document.

Parameters

url (str) – The target URL.
html (str) – the html document.

Returns

A list containing all valid urls found in the html.

Return type

list

get_page(url, user_agent, request_kwargs=None, respect_robots=True, reppy=None)¶

Fetches a webpage for the provided URL.

Parameters

url (str) – The url for the webpage.
user_agent (str) – The user_agent to use during requests. Note: This param overrides any user agent kwargs.
request_kwargs (dict, optional) – The page retrieval request kwargs.
respect_robots (bool) – If true, robots.txt will be honored.
( (reppy) – obj:robots.RobotParser, optional): A robots parsing object.

Returns

The instance of the Webpage class.

Return type

this

get_page_links()¶

Extracts links from a page.

Only supports documents with a content type of ‘text/html’. TODO: Add further support for other doc types.

Returns

A list of links from the page.

Return type

list(str)

Raises

RuntimeError – Raises a runtime error if this function is called
prior to calling Webpage(..)get_page. –
The response from get_page is required in this function. –

classmethod is_allowed_by_robots(url, user_agent, reppy=None, request_kwargs=None)¶

Determine if a page is crawlable by robots.txt.

Leverages the Reppy library for retrieval and parsing of robots.txt.

i

Args:: url (str): The target URL. user_agent (str): The user agent being used to crawl the page. reppy (:obj:robots.RobotParser, optional): A robots parsing object. request_kwargs (dict, optional): requests.get kwargs for fetching the

robots.txt file.
Returns:: bool: True if the page is allowed by robots.txt. Otherwise False.

classmethod parse_mime_type(mime_type)¶

Parses a mime type into its content type and parameters.

Parameters

url (str) – The target URL.
**requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.

Returns

str: The type of the content. str: The content type parameters.

Return type

tuple

Examples

>>> parse_mime_type("text/html; encoding=utf-8")
("text/html", "encoding=utf-8",)

class bitcrawler.webpage.WebpageBuilder¶

Bases: object

Builds a Webpage object by intilizing the class and calling the get_page method.

classmethod build(url, user_agent, request_kwargs=None, respect_robots=True, reppy=None)¶

Builds a Webpage by fetching the provided URL.

Parameters

user_agent (str) – The user_agent to use during requests. Note: This param overrides any user agent kwargs.
request_kwargs (dict, optional) – The page retrieval request kwargs.
respect_robots (bool) – If true, robots.txt will be honored.
( (reppy) – obj:robots.RobotParser, optional): A robots parsing object.

Returns

The instance of the Webpage class.

Return type

this

bitcrawler¶

Submodules¶

bitcrawler.crawler module¶

bitcrawler.link_utils module¶

bitcrawler.parsing module¶

bitcrawler.robots module¶

bitcrawler.webpage module¶

Module contents¶

BitCrawler

Navigation

Related Topics