Module ruya :: Class Config :: Class CrawlConfig
[hide private]
[frames] | no frames]

Class CrawlConfig

source code

object --+
         |
        Config.CrawlConfig

Ruya's crawler configuration object stores settings that are specific during a crawl. It supports all valid settings for a decent, obeying crawler.

Instance Methods [hide private]
None
__init__(self, useragent='Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET ..., crawlfrom='', obeyrobotstxt=True, obeymetarobots=True, acceptencoding='gzip, deflate', crawldelay=120, crawlscope=100001, allowedmimes=['text/html'], allowedextns=['', '.htm', '.html', '.cgi', '.php', '.jsp', '.cfm', '.asp', ..., levels=2, maxcontentbytes=500000, maxcontenttruncate=True, maxretries=3, retrydelay=120)
Constructor.
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Instance Variables [hide private]
  useragent
User-agent header to use during crawl - Specify a valid User-agent while crawling.
  crawlfrom
From header to use during crawl - Specify your email address here.
  obeyrobotstxt
Whether to obey or ignore robots.txt - If possible, always obey robots.txt during crawl.
  obeymetarobots
Recently crawler options are specified directly within html - Specify whether to obey or ignore meta robots.
  acceptencoding
Accept-encoding header to use during crawl - Specify whether to accept g-zip content or plain text only.
  crawldelay
Number of seconds (default 120 seconds) to wait before crawling next url within a website
  crawlscope
CrawlScope to use while crawling a website.
  allowedmimes
Valid MIME types to accept (default ['text/html']) during crawl.
  allowedextns
Valid extensions to accept during crawl.
  levels
Number of levels (default 2) to crawl deeper within a website.
  maxcontentbytes
Upper limit of page-size (default 500kb) to download in bytes during crawl.
  maxretries
Number of times to retry (default 3) a failed, unavailable url during crawl - Sometimes a url might be temporarily available, but may become available after a while.
  retrydelay
Number of seconds to wait before retrying (default 120 seconds) a failed, unavailable url.
  maxcontenttruncate
Whether to use available downloaded contents till max. page-size (default True).
  maxcontentdiscard
Whether to ignore a page completely if it's size exceeds max. page-size (default False).
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, useragent='Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET ..., crawlfrom='', obeyrobotstxt=True, obeymetarobots=True, acceptencoding='gzip, deflate', crawldelay=120, crawlscope=100001, allowedmimes=['text/html'], allowedextns=['', '.htm', '.html', '.cgi', '.php', '.jsp', '.cfm', '.asp', ..., levels=2, maxcontentbytes=500000, maxcontenttruncate=True, maxretries=3, retrydelay=120)
(Constructor)

source code 
Constructor. Provides default values for all settings.
Returns: None
None
Overrides: object.__init__

Note: Please refer to Instance Variables section for details on each parameter.