bitcrawler¶
What is it?
Bitcrawler is a Python package that provides functionality for crawling & scraping the web. The library brings simplicity, speed, and extensibility to any crawling project. The library can be exteded to easily add on additional crawling behavior and functionality for specific use cases.
Installation
pip install bitcrawler
Dependencies
Reppy
BeautifulSoup4
Requests
Example Crawler
Crawling webpages will begin by fetching the original URL supplied. The crawler will traverse links discoverd on the pages until it reaches the specified crawl depth or runs out of links.
A bitcrawler.webpage.Webpage class instance will be returned for each page fetched. To see more details on the Webpage class see the documetation on the class (https://bitcrawler.readthedocs.io/en/latest/bitcrawler.html#module-bitcrawler.webpage).
Simple Usage
from bitcrawler.crawler import Crawler
crawler = Crawler()
# Returns a list of bitcrawler.webpage.Webpage instances.
# See the Webpage class for more details on its members.
crawled_pages = crawler.crawl('http://test.com')
Advanced Usage
The below example extends the crawler object and overrides the parse function. The parse function is always called at the end of crawling. It is passed all the pages fetched. In the below example the pages are parsed using beautifulsoup and the title is printed with the URL.
from bs4 import BeautifulSoup
from bitcrawler.crawler import Crawler
from bitcrawler import webpage
class MyCrawler(Crawler):
# Parse is always called py the `crawl` method and is provided
# a webpage.Webpage class instance for each URL.
# See the webpage.Webpage class for details about the object.
def parse(self, webpages):
for page in webpages:
# If page response is not none, response code is in 200s, and document is html.
if page.response and \
page.response.ok and \
page.response.headers.get('content-type').startswith('text/html'):
soup = BeautifulSoup(page.response.text, "html.parser")
print(page.url, "- ", soup.title)
return webpages
# Initializes the crawler with the configuration specified by parameters.
crawler = MyCrawler(
user_agent='python-requests', # The User Agent to use for all requests.
crawl_delay=0, # Number of seconds to wait between web requests.
crawl_depth=2, # The max depth from following links (Default is 5).
cross_site=False, # If true, domains other than the original domain can be crawled.
respect_robots=True, # If true, the robots.txt standard will be followed.
respect_robots_crawl_delay=True, # If true, the robots.txt crawl delay will be followed.
multithreading=True, # If true, parallelizes requests for faster crawling.
max_threads=100, # If multithreading is true, this determines the number of threads.
webpage_builder=webpage.WebpageBuilder, # Advanced Usage - Allows the WebpageBuilder class to be overridden to allow modificaion.
request_kwargs={'timeout': 10}, # Additional keyword arguments that you would like to pass into any request made.
reppy_cache_capacity=100, # The number of robots.txt objects to cache. Eliminates the need to fetch robots.txt file many times.
reppy_cache_policy=None, # Advanced Usage - See docs for details.
reppy_ttl_policy=None, # Advanced Usage - See docs for details.
reppy_args=tuple()) # Advanced Usage - See docs for details.
# Crawls pages starting from "http://test.com"
# Returns a list of bitcrawler.webpage.Webpage instances.
# See the Webpage class for more details on its members.
crawled_pages = crawler.crawl(
url="http://test.com", # The start URL to crawl from.
allowed_domains=[], # A list of allowed domains. `cross_site` must be True. Ex. ['python.org',...]
disallowed_domains=[], # A list of disallowed domains. `cross_site` must be True and `allowed_domains` empty.
page_timeout=10) # The ammount of time before a page retrieval/build times out.
Submodules¶
bitcrawler.crawler module¶
This module provides functionality for crawling the web.
- class bitcrawler.crawler.Crawler(user_agent='python-requests', crawl_delay=0, crawl_depth=5, cross_site=False, respect_robots=True, respect_robots_crawl_delay=False, multithreading=False, max_threads=100, webpage_builder=<class 'bitcrawler.webpage.WebpageBuilder'>, request_kwargs=None, reppy_cache_capacity=100, reppy_cache_policy=None, reppy_ttl_policy=None, reppy_args=())¶
Bases:
object
Provides functionality for crawling webpages.
- crawl(urls, allowed_domains=None, disallowed_domains=None, page_timeout=10)¶
Crawls webpages by traversing links.
- Parameters:
list (urls) – The URL or list of start URLs to be crawled.
allowed_domains (list(str)) – A list of allowed domains to crawl. Default None Original URL domain takes precidence. cross_site must be enabled.
disallowed_domains (list(str)) – A list of allowed domains to crawl. Default None. Original URL domain takes precidence. cross_site must be enabled and allowed_domains must be empty/null.
page_timeout (int, optional) – Number of seconds to allow for page retrieval. Default 10.
- Returns:
- Returns a call to the overidable parse function.
Supplies the webpages as input.
- Return type:
self.parse(webpages)
- parse(webpages)¶
Parses the webpages. Meant to be Overridden.
- Parameters:
webpages (list(webpage.Webpage)) – A list of webpages.
- Returns:
The crawled webpages.
- Return type:
list(webpage.Webpage)
bitcrawler.link_utils module¶
Provides tools for interacting with links and URLs.
- class bitcrawler.link_utils.LinkUtils¶
Bases:
object
Utils for working with URLs and links.
- classmethod get_base_url(url)¶
Gets the base url (scheme://netloc) from a url.
Uses the python urllib.parse object to generate the scheme and netloc.
- Parameters:
url (str) – A URL.
- Returns:
The base url generated from the provided url.
- Return type:
string
Examples
>>> get_base_url("http://python.org:8000/test/link/path") "http://python.org:8000"
- classmethod get_domain(url)¶
Checks is two urls share the same domain.
Generates domains from the urllib.parse objects netloc. The domain is extracted from the netloc by stripping the port and subdomain info.
- Parameters:
url (str) – A URL.
- Returns:
The domain of from the input URL.
- Return type:
str
Examples
>>> get_domain("http://subdomain.python.org:8000/test/link/path") "python.org"
- classmethod is_relative(link)¶
Determines if a link is a relative link.
- Parameters:
link (str) – A link.
- Returns:
True if the link is a relative link. Otherwise False.
- Return type:
bool
Examples
>>> is_relative("/test/link/path") True
>>> is_relative("http://python.org/test/link/path") False
- classmethod is_same_domain(url1, url2)¶
Checks is two urls share the same domain.
Uses get_domain to extract a domain.
- Parameters:
url1 (str) – The first URL for comparison.
url2 (str) – The second URL for comparison.
- Returns:
True if the domains match. Otherwise False.
- Return type:
bool
Examples
>>>is_same_domain(”http://python.org”, “https://subdomain.python.org:8000”) True
>>>is_same_domain(”http://python.org”, “https://pandas.com”) False
bitcrawler.parsing module¶
Utilities for parsing html.
Extends functionality of BeautifulSoup for added html parsing functionality.
- class bitcrawler.parsing.HtmlParser(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)¶
Bases:
BeautifulSoup
HtmlParser extends functionality provided by BeautifulSoup.
- get_links()¶
Finds links from anchor tags in the soup.
- Parameters:
None –
- Returns:
A list of links discovered within the html.
- Return type:
list(str)
Examples
>>> response = requests.get("http://python.org") >>> HtmlParser(response.text).get_links() ["http://python.org/search", "/about", ..., "http://python.org/learn"]
bitcrawler.robots module¶
Utilities for fetching and parsing robots.txt files.
Extends the reppy library (https://github.com/seomoz/reppy) for robots.txt fetching and parsing.
- class bitcrawler.robots.ReppyUtils¶
Bases:
object
A set of reppy utilities.
- classmethod allowed(url, user_agent='python-requests', request_kwargs=None)¶
Determines if a URL is crawlable for a given user agent.
- Parameters:
url (str) – The url to check for crawlability.
user_agent (str, optional) – The user agent to check for in robots.txt. Default ‘python-requests’.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None
- Returns:
True if the page is allowed to be crawled.
- Return type:
bool
Examples
>>> ReppyUtils.allowed('http://python.org/test') True
- classmethod crawl_delay(url, user_agent, request_kwargs=None)¶
Determines the robots crawl delay for a given user agent.
- Parameters:
url (str) – The url to get a crawl delay for.
user_agent – The user agent to get the crawl delay for.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None
- Returns:
The time to wait between crawling pages (seconds).
- Return type:
int
Examples
>>> ReppyUtils.crawl_delay('http://python.org/test', 'python-requests') 2
- classmethod fetch_robots(robots_url, request_kwargs=None)¶
Fetches the robots URL.
- Parameters:
robots_url (str) – The robots url to fetch.
requests_kwargs (dict, optional) – The keyword arguments to pass into the requests.get call to the robots.txt url. Default None
- Returns:
the reppy object from feting the robots.txt file.
- Return type:
reppy.Robots
- classmethod get_robots_url(url)¶
Gets the URL where the robots file should be stored.
- Parameters:
url (str) – The url to derive the robots url from.
- Returns:
The robots.txt url.
- Return type:
str
Examples
>>> ReppyUtils.get_robots_url('http://python.org/test/path") 'http://python.org/robots.txt'_
- class bitcrawler.robots.RobotsCache(capacity, cache_policy=None, ttl_policy=None, *args, **kwargs)¶
Bases:
RobotsCache
Extends the reppy RobotsCache to include extra functionality.
- crawl_delay(url, user_agent='python-requests')¶
Gets a crawl delay for a given url and user_agent. Note: Crawl delay is the same for all pages under a robots.txt file for
a given user agent.
Note: Going to open a PR on reppy to have this built in. :param url: The target URL. :type url: str :param user_agent: The user agent. Default “python-requests” :type user_agent: str, optional
- Returns:
The number of seconds the specified user_agent should wait between calls.
- Return type:
int
Examples
>>> crawl_delay("http://python.org", user_agent="python-requests") 2
bitcrawler.webpage module¶
This module provides functionality for fetching a webpage and stores relevant ojects from page retrieval.
- class bitcrawler.webpage.Webpage¶
Bases:
object
Webpage provides the ability to fetch a webpage. Stores data from the retrieval of the page.
- url¶
The associated to the webpage.
- Type:
str
- response¶
The requests library Response object from fetching the page.
- Type:
obj requests.Response
- links¶
A list of the links found on the page.
- Type:
list(str)
- allowed_by_robots¶
If true, the page is crawlable by robots.txt.
- Type:
bool
- message¶
Message detailing any issues fetching the page.
- Type:
str
- error¶
Any error that was raised during page retrieval.
- Type:
obj Exception
- classmethod fetch(url, **requests_kwargs)¶
Fetches the webpage for the URL using the requests library.
- Parameters:
url (str) – The target URL.
**requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.
- Returns:
The response from the web request.
- Return type:
obj requests.Response
- Raises:
Exception – Can raise a variety of exceptions. See requests library
for more details. –
- classmethod get_html_links(url, html)¶
Parses links from an html document.
- Parameters:
url (str) – The target URL.
html (str) – the html document.
- Returns:
A list containing all valid urls found in the html.
- Return type:
list
- get_page_links()¶
Extracts links from a page.
Only supports documents with a content type of ‘text/html’. TODO: Add further support for other doc types.
- Returns:
A list of links from the page.
- Return type:
list(str)
- classmethod is_allowed_by_robots(url, user_agent, reppy=None, request_kwargs=None)¶
Determine if a page is crawlable by robots.txt.
Leverages the Reppy library for retrieval and parsing of robots.txt.
- i
- Args:
url (str): The target URL. user_agent (str): The user agent being used to crawl the page. reppy (:obj:robots.RobotParser, optional): A robots parsing object. request_kwargs (dict, optional): requests.get kwargs for fetching the
robots.txt file.
- Returns:
bool: True if the page is allowed by robots.txt. Otherwise False.
- classmethod parse_mime_type(mime_type)¶
Parses a mime type into its content type and parameters.
- Parameters:
url (str) – The target URL.
**requests_kwargs (kwargs, optional) – Any additional parameters to pass onto the reqeusts library.
- Returns:
str: The type of the content. str: The content type parameters.
- Return type:
tuple
Examples
>>> parse_mime_type("text/html; encoding=utf-8") ("text/html", "encoding=utf-8",)
- class bitcrawler.webpage.WebpageBuilder¶
Bases:
object
Builds a Webpage object by intilizing the class and calling the get_page method.
- classmethod build(url, user_agent, request_kwargs=None, respect_robots=True, reppy=None)¶
Builds a Webpage by fetching the provided URL.
- Parameters:
user_agent (str) – The user_agent to use during requests. Note: This param overrides any user agent kwargs.
request_kwargs (dict, optional) – The page retrieval request kwargs.
respect_robots (bool) – If true, robots.txt will be honored.
( (reppy) – obj:robots.RobotParser, optional): A robots parsing object.
- Returns:
The instance of the Webpage class.
- Return type:
this