Module ruya :: Class SingleDomainDelayCrawler
[hide private]
[frames] | no frames]

Class SingleDomainDelayCrawler

source code

object --+    
         |    
   Crawler --+
             |
            SingleDomainDelayCrawler

Ruya's single domain delayed crawler is an enhancement to Ruya's base crawler. This is a breadth-first crawler with delay between each crawl request.

Nested Classes [hide private]
    Inherited from Crawler
  CrawlEventArgs
Ruya's Crawler provides event-based callback mechanism during crawl to allow clients to have more control over which urls are crawled.
  EventArgs
Ruya's Crawler provides event-based callback mechanism during crawl to allow clients to have more control over which urls are crawled.
  UriIncludeEventArgs
Ruya's Crawler provides event-based callback mechanism during crawl to allow clients to have more control over which urls are crawled.
Instance Methods [hide private]
None
__init__(self, config)
Constructor.
source code
tuple
crawl(self, document, level=0)
The main method where actual crawling is performed.
source code
tuple
crawlbreadth(self, level, maxlevels, domainuri, documents)
The main method where actual breadth-first crawling is performed.
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

    Inherited from Crawler
tuple
aftercrawl(self, document, level=0)
Performs a GET crawl on Document's url.
source code
tuple
beforecrawl(self, document, level=0)
Performs a HEAD crawl on Document's url.
source code
None
bind(self, event, eventhandler, addnleventargs)
Binds eventhandler, callback (pointer to a function) to one of Ruya's events.
source code
tuple
firevents(self, events, eargs)
Fires eventhandlers, callbacks (pointer to a function) for one of Ruya's events.
source code
Instance Variables [hide private]
    Inherited from Crawler
  callbacks
During crawl of a url, these events are invoked - clients can subscribe to these events to provide a finer level of control over crawling.
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, config)
(Constructor)

source code 
Constructor.
Parameters:
Returns: None
None
Overrides: Crawler.__init__

crawl(self, document, level=0)

source code 
The main method where actual crawling is performed.
Parameters:
Returns: tuple
(cancel, ignore) values set either internally or explicity by event handlers.
Overrides: Crawler.crawl

crawlbreadth(self, level, maxlevels, domainuri, documents)

source code 
The main method where actual breadth-first crawling is performed.
Parameters:
  • level (number.) - The level on which the Document is crawled.
  • maxlevels (number.) - Maximum number of levels to crawl.
  • domainuri (Uri.) - Valid instance of Uri object.
  • documents (list) - Documents list to which newly to-be-crawled urls are appended for later crawling.
Returns: tuple
(nextleveldocs, cancel, ignore) values set either internally or explicity by event handlers.

Attention: Event includelink is not fired for first Uri where crawl is started, however beforecrawl event might be fired if url is redirected.