Module ruya :: Class Crawler
[hide private]
[frames] | no frames]

Class Crawler

source code

object --+
         |
        Crawler
Known Subclasses:
SingleDomainDelayCrawler

Ruya's main object is the Crawler object. This object uses configuration settings, and performs a crawl on given url. Developers can extend Ruya's Crawler and create more sophisticated crawlers similar to Ruya's SingleDomainDelayCrawler.

Nested Classes [hide private]
  EventArgs
Ruya's Crawler provides event-based callback mechanism during crawl to allow clients to have more control over which urls are crawled.
  CrawlEventArgs
Ruya's Crawler provides event-based callback mechanism during crawl to allow clients to have more control over which urls are crawled.
  UriIncludeEventArgs
Ruya's Crawler provides event-based callback mechanism during crawl to allow clients to have more control over which urls are crawled.
Instance Methods [hide private]
None
__init__(self, config)
Constructor.
source code
None
bind(self, event, eventhandler, addnleventargs)
Binds eventhandler, callback (pointer to a function) to one of Ruya's events.
source code
tuple
firevents(self, events, eargs)
Fires eventhandlers, callbacks (pointer to a function) for one of Ruya's events.
source code
tuple
beforecrawl(self, document, level=0)
Performs a HEAD crawl on Document's url.
source code
tuple
aftercrawl(self, document, level=0)
Performs a GET crawl on Document's url.
source code
tuple
crawl(self, document, level=0)
The main method where actual crawling is performed.
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Instance Variables [hide private]
  callbacks
During crawl of a url, these events are invoked - clients can subscribe to these events to provide a finer level of control over crawling.
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, config)
(Constructor)

source code 
Constructor.
Parameters:
Returns: None
None
Overrides: object.__init__

bind(self, event, eventhandler, addnleventargs)

source code 
Binds eventhandler, callback (pointer to a function) to one of Ruya's events. Example:
     crawlerobj.bind('beforecrawl', myfunction, None)
     ...
     
     def myfunction(caller, eventargs):
        ...
Parameters:
  • event (str.) - Must be one of the following values- beforecrawl, aftercrawl, includelink.
  • eventhandler (function) - User-defined function having function signature as function(caller, eventargs)
  • addnleventargs (list.) - Additional event arguments to be passed when calling eventhandler.
Returns: None
None.

Note: The eventhandler should have signature as func(caller, eventargs)

See Also: callbacks

firevents(self, events, eargs)

source code 
Fires eventhandlers, callbacks (pointer to a function) for one of Ruya's events.
Parameters:
Returns: tuple
(cancel, ignore) values set either internally or explicity by event handlers.

See Also: bind

Note: While invoking multiple event-handlers sequentially, if any of the event-handlers sets ignore to True, it is remembered, and cannot be reset by any event handler in the chain.

beforecrawl(self, document, level=0)

source code 
Performs a HEAD crawl on Document's url. The beforecrawl events are fired before a url is crawled. It uses headers from Document instance, and uses robots.txt rules while crawling if allowed.
Parameters:
  • document (Document.) - Valid instance of Document object.
  • level (number.) - The current level of the document being crawled.
Returns: tuple
(cancel, ignore) values set either internally or explicity by event handlers.

Note: As redirects are also handled, the beforecrawl event can be fired multiple times if a url is redirected to another url.

aftercrawl(self, document, level=0)

source code 
Performs a GET crawl on Document's url. The aftercrawl events are fired after a url is crawled successfully. The attributes for Document object are extracted in this method. The CrawlScope is considered before including any links for the Document.
Parameters:
  • document (Document.) - Valid instance of Document object.
  • level (number.) - The current level of the document being crawled.
Returns: tuple
(cancel, ignore) values set either internally or explicity by event handlers.

crawl(self, document, level=0)

source code 
The main method where actual crawling is performed.
Parameters:
Returns: tuple
(cancel, ignore) values set either internally or explicity by event handlers.
To Do:
  • URL canonicalization http://www.archive.org/index.html and http://www.archive.org/ are same
  • Tidy?
  • Avoiding slow links? Currently handled by timeout from httplib.
  • Support Crawl-Delay from robots.txt?.
  • Detecting "soft" 404? http://blog.dtiblog.com/community-hobby.html => http://blog.dtiblog.com/404.html

Instance Variable Details [hide private]

callbacks

During crawl of a url, these events are invoked - clients can subscribe to these events to provide a finer level of control over crawling. The events that client can subscribe are "beforecrawl", "aftercrawl" and "includelink"

See Also: bind