bitcrawler¶
What is it?
Bitcrawler is a Python package that provides functionality for crawling & scraping the web. The library brings simplicity, speed, and extensibility to any crawling project. The library can be exteded to easily add on additional crawling behavior and functionality for specific use cases.
Installation
pip install bitcrawler
Dependencies
Reppy
BeautifulSoup4
Requests
Example Crawler
The below example extends the crawler object and overrides the parse function. The parse function is always called at the end of crawling. It is passed all the pages fetched. In the below example the pages are parsed using beautifulsoup and the title is printed with the URL.
from bs4 import BeautifulSoup
from bitcrawler import Crawler
class MyCrawler(Crawler):
def parse(self, webpages):
for page in webpages:
# If page response is not none, response code is in 200s, and document is html.
if page.response and \
page.response.ok and \
page.response.headers.get('content-type').startswith('text/html'):
soup = BeautifulSoup(page.response.text, "html.parser")
print(page.url, "- ", soup.title)
# Initializes the crawler with the configuration specified by parameters.
crawler = MyCrawler(cross_site=True, crawl_depth=2, multithreading=True)
# Crawls pages starting from "http://test.com"
crawled_pages = crawler.crawl("http://test.com")
Submodules¶
bitcrawler.crawler module¶
This module provides functionality for crawling the web.
-
class
bitcrawler.crawler.Crawler(user_agent='python-requests', crawl_delay=0, crawl_depth=5, cross_site=False, respect_robots=True, respect_robots_crawl_delay=False, multithreading=False, max_threads=100, webpage_builder=<class 'bitcrawler.webpage.WebpageBuilder'>, request_kwargs=None, reppy_cache_capacity=100, reppy_cache_policy=None, reppy_ttl_policy=None, reppy_args=())¶ Bases:
objectProvides functionality for crawling webpages.
-
crawl(url, allowed_domains=None, disallowed_domains=None, page_timeout=10)¶ Crawls webpages by traversing links.
- Parameters
url (str) – The URL to be crawled.
allowed_domains (list(str)) – A list of allowed domains to crawl. Default None Original URL domain takes precidence. cross_site must be enabled.
disallowed_domains (list(str)) – A list of allowed domains to crawl. Default None. Original URL domain takes precidence. cross_site must be enabled and allowed_domains must be empty/null.
page_timeout (int, optional) – Number of seconds to allow for page retrieval. Default 10.
- Returns
- Returns a call to the overidable parse function.
Supplies the webpages as input.
- Return type
self.parse(webpages)
-
parse(webpages)¶ Parses the webpages. Meant to be Overridden.
- Parameters
webpages (list(webpage.Webpage)) – A list of webpages.
- Returns
The crawled webpages.
- Return type
list(webpage.Webpage)
-
bitcrawler.link_utils module¶
Provides tools for interacting with links and URLs.
-
class
bitcrawler.link_utils.LinkUtils¶ Bases:
objectUtils for working with URLs and links.
-
classmethod
get_base_url(url)¶ Gets the base url (scheme://netloc) from a url.
Uses the python urllib.parse object to generate the scheme and netloc.
- Parameters
url (str) – A URL.
- Returns
The base url generated from the provided url.
- Return type
string
Examples
>>> get_base_url("http://python.org:8000/test/link/path") "http://python.org:8000"
-
classmethod
get_domain(url)¶ Checks is two urls share the same domain.
Generates domains from the urllib.parse objects netloc. The domain is extracted from the netloc by stripping the port and subdomain info.
- Parameters
url (str) – A URL.
- Returns
The domain of from the input URL.
- Return type
str
Examples
>>> get_domain("http://subdomain.python.org:8000/test/link/path") "python.org"
-
classmethod
is_relative(link)¶ Determines if a link is a relative link.
- Parameters
link (str) – A link.
- Returns
True if the link is a relative link. Otherwise False.
- Return type
bool
Examples
>>> is_relative("/test/link/path") True
>>> is_relative("http://python.org/test/link/path") False
-
classmethod
is_same_domain(url1, url2)¶ Checks is two urls share the same domain.
Uses get_domain to extract a domain.
- Parameters
url1 (str) – The first URL for comparison.
url2 (str) – The second URL for comparison.
- Returns
True if the domains match. Otherwise False.
- Return type
bool
Examples
>>>is_same_domain(“http://python.org”, “https://subdomain.python.org:8000”) True
>>>is_same_domain(“http://python.org”, “https://pandas.com”) False
-
classmethod
bitcrawler.parsing module¶
Utilities for parsing html.
Extends functionality of BeautifulSoup for added html parsing functionality.
-
class
bitcrawler.parsing.HtmlParser(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)¶ Bases:
bs4.BeautifulSoupHtmlParser extends functionality provided by BeautifulSoup.
-
get_links()¶ Finds links from anchor tags in the soup.
- Parameters
None –
- Returns
A list of links discovered within the html.
- Return type
list(str)
Examples
>>> response = requests.get("http://python.org") >>> HtmlParser(response.text).get_links() ["http://python.org/search", "/about", ..., "http://python.org/learn"]
-
bitcrawler.robots module¶
Utilities for fetching and parsing robots.txt files.
Extends the reppy library (https://github.com/seomoz/reppy) for robots.txt fetching and parsing.
-
class
bitcrawler.robots.ReppyUtils¶ Bases:
objectA set of reppy utilities.
-
classmethod
allowed(url, user_agent='python-requests', request_kwargs=None)¶ Determines if a URL is crawlable for a given user agent.
- Parameters
url (str) – The url to check for crawlability.
user_agent (str, optional) – The user agent to check for in robots.txt. Default ‘python-requests’.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None
- Returns
True if the page is allowed to be crawled.
- Return type
bool
Examples
>>> ReppyUtils.allowed('http://python.org/test') True
-
classmethod
crawl_delay(url, user_agent, request_kwargs=None)¶ Determines the robots crawl delay for a given user agent.
- Parameters
url (str) – The url to get a crawl delay for.
user_agent – The user agent to get the crawl delay for.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None
- Returns
The time to wait between crawling pages (seconds).
- Return type
int
Examples
>>> ReppyUtils.crawl_delay('http://python.org/test', 'python-requests') 2
-
classmethod
fetch_robots(robots_url, request_kwargs=None)¶ Fetches the robots URL.
- Parameters
robots_url (str) – The robots url to fetch.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None
- Returns
the reppy object from feting the robots.txt file.
- Return type
reppy.Robots
-
classmethod
get_robots_url(url)¶ Gets the URL where the robots file should be stored.
- Parameters
url (str) – The url to derive the robots url from.
- Returns
The robots.txt url.
- Return type
str
Examples
>>> ReppyUtils.get_robots_url('http://python.org/test/path") 'http://python.org/robots.txt'_
-
classmethod
-
class
bitcrawler.robots.RobotsCache(capacity, cache_policy=None, ttl_policy=None, *args, **kwargs)¶ Bases:
reppy.cache.RobotsCacheExtends the reppy RobotsCache to include extra functionality.
-
crawl_delay(url, user_agent='python-requests')¶ Gets a crawl delay for a given url and user_agent. Note: Crawl delay is the same for all pages under a robots.txt file for
a given user agent.
Note: Going to open a PR on reppy to have this built in. :param url: The target URL. :type url: str :param user_agent: The user agent. Default “python-requests” :type user_agent: str, optional
- Returns
The number of seconds the specified user_agent should wait between calls.
- Return type
int
Examples
>>> crawl_delay("http://python.org", user_agent="python-requests") 2
-
bitcrawler.webpage module¶
This module provides functionality for fetching a webpage and stores relevant ojects from page retrieval.
-
class
bitcrawler.webpage.Webpage¶ Bases:
objectWebpage provides the ability to fetch a webpage. Stores data from the retrieval of the page.
-
classmethod
fetch(url, **requests_kwargs)¶ Fetches the webpage for the URL using the requests library.
- Parameters
url (str) – The target URL.
**requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.
- Returns
The response from the web request.
- Return type
obj requests.Response
- Raises
Exception – Can raise a variety of exceptions. See requests library
for more details. –
-
classmethod
get_html_links(url, html)¶ Parses links from an html document.
- Parameters
url (str) – The target URL.
html (str) – the html document.
- Returns
A list containing all valid urls found in the html.
- Return type
list
-
get_page(url, user_agent, request_kwargs=None, respect_robots=True, reppy=None)¶ Fetches a webpage for the provided URL.
- Parameters
url (str) – The url for the webpage.
user_agent (str) – The user_agent to use during requests. Note: This param overrides any user agent kwargs.
request_kwargs (dict, optional) – The page retrieval request kwargs.
respect_robots (bool) – If true, robots.txt will be honored.
( (reppy) – obj:robots.RobotParser, optional): A robots parsing object.
- Returns
The instance of the Webpage class.
- Return type
this
-
get_page_links()¶ Extracts links from a page.
Only supports documents with a content type of ‘text/html’. TODO: Add further support for other doc types.
- Returns
A list of links from the page.
- Return type
list(str)
- Raises
RuntimeError – Raises a runtime error if this function is called
prior to calling Webpage(..)get_page. –
The response from get_page is required in this function. –
-
classmethod
is_allowed_by_robots(url, user_agent, reppy=None, request_kwargs=None)¶ Determine if a page is crawlable by robots.txt.
Leverages the Reppy library for retrieval and parsing of robots.txt.
- i
- Args:
url (str): The target URL. user_agent (str): The user agent being used to crawl the page. reppy (:obj:robots.RobotParser, optional): A robots parsing object. request_kwargs (dict, optional): requests.get kwargs for fetching the
robots.txt file.
- Returns:
bool: True if the page is allowed by robots.txt. Otherwise False.
-
classmethod
parse_mime_type(mime_type)¶ Parses a mime type into its content type and parameters.
- Parameters
url (str) – The target URL.
**requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.
- Returns
str: The type of the content. str: The content type parameters.
- Return type
tuple
Examples
>>> parse_mime_type("text/html; encoding=utf-8") ("text/html", "encoding=utf-8",)
-
classmethod
-
class
bitcrawler.webpage.WebpageBuilder¶ Bases:
objectBuilds a Webpage object by intilizing the class and calling the get_page method.
-
classmethod
build(url, user_agent, request_kwargs=None, respect_robots=True, reppy=None)¶ Builds a Webpage by fetching the provided URL.
- Parameters
user_agent (str) – The user_agent to use during requests. Note: This param overrides any user agent kwargs.
request_kwargs (dict, optional) – The page retrieval request kwargs.
respect_robots (bool) – If true, robots.txt will be honored.
( (reppy) – obj:robots.RobotParser, optional): A robots parsing object.
- Returns
The instance of the Webpage class.
- Return type
this
-
classmethod