ruya.Config.CrawlConfig

Class CrawlConfig

object --+ | Config.CrawlConfig

crawler

Instance Methods

None

__init__(self, useragent='Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET ..., crawlfrom='', obeyrobotstxt=True, obeymetarobots=True, acceptencoding='gzip, deflate', crawldelay=120, crawlscope=100001, allowedmimes=['text/html'], allowedextns=['', '.htm', '.html', '.cgi', '.php', '.jsp', '.cfm', '.asp', ..., levels=2, maxcontentbytes=500000, maxcontenttruncate=True, maxretries=3, retrydelay=120)
Constructor. source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Instance Variables

[hide private]

useragent
User-agent header to use during crawl - Specify a valid User-agent while crawling.

crawlfrom
From header to use during crawl - Specify your email address here.

obeyrobotstxt
Whether to obey or ignore robots.txt - If possible, always obey robots.txt during crawl.

obeymetarobots
Recently crawler options are specified directly within html - Specify whether to obey or ignore meta robots.

acceptencoding
Accept-encoding header to use during crawl - Specify whether to accept g-zip content or plain text only.

crawldelay
Number of seconds (default 120 seconds) to wait before crawling next url within a website

crawlscope
CrawlScope to use while crawling a website.

allowedmimes
Valid MIME types to accept (default ['text/html']) during crawl.

allowedextns
Valid extensions to accept during crawl.

levels
Number of levels (default 2) to crawl deeper within a website.

maxcontentbytes
Upper limit of page-size (default 500kb) to download in bytes during crawl.

maxretries
Number of times to retry (default 3) a failed, unavailable url during crawl - Sometimes a url might be temporarily available, but may become available after a while.

retrydelay
Number of seconds to wait before retrying (default 120 seconds) a failed, unavailable url.

maxcontenttruncate
Whether to use available downloaded contents till max. page-size (default True).

maxcontentdiscard
Whether to ignore a page completely if it's size exceeds max. page-size (default False).

Properties

[hide private]

Inherited from object: __class__

init(self, useragent=`'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET` `...`, crawlfrom=`''`, obeyrobotstxt=True, obeymetarobots=True, acceptencoding=`'gzip, deflate'`, crawldelay=120, crawlscope=100001, allowedmimes=`['text/html']`, allowedextns=`['',` `'.htm',` `'.html',` `'.cgi',` `'.php',` `'.jsp',` `'.cfm',` `'.asp',` `...`, levels=2, maxcontentbytes=500000, maxcontenttruncate=True, maxretries=3, retrydelay=120)
(Constructor)

source code

Constructor. Provides default values for all settings.

Returns: None: None
Overrides: object.__init__

Note: Please refer to Instance Variables section for details on each parameter.