Home | Trees | Indices | Help |
|
---|
|
#!/usr/bin/env python #-*- coding: UTF-8 -*- import ruya def test(): url= 'http://www.python.org/' # Create a Document instance representing start url doc= ruya.Document(ruya.Uri(url)) # Create a new crawler configuration object cfg= ruya.Config(ruya.Config.CrawlConfig(levels= 1, crawldelay= 5), ruya.Config.RedirectConfig(), ruya.Config.LogConfig()) # Use a single-domain breadth crawler with crawler configuration c= ruya.SingleDomainDelayCrawler(cfg) # Crawler raises following events before crawling a url. # Setup callbacks pointing to custom methods where we can control whether to crawl or ignore a url e.g. to ignore duplicates? c.bind('beforecrawl', beforecrawl, None) c.bind('aftercrawl', aftercrawl, None) c.bind('includelink', includelink, None) # Start crawling c.crawl(doc) # if(None!= doc.error): print(`doc.error.type`+ ': '+ `doc.error.value`) # This callback is invoked from Ruya crawler before a url is to be included in list of urls to be crawled # We can choose to ignore the url based on our custom logic def includelink(caller, eventargs): uri= eventargs.uri level= eventargs.level print 'includelink(): Include "%(uri)s" to crawl on level %(level)d?' %locals() # Before a url is actually crawled, Ruya invokes this callback to ask whether to crawl the url or not. # We can choose to ignore the url based on our custom logic def beforecrawl(caller, eventargs): uri= eventargs.document.uri print 'beforecrawl(): "%(uri)s" is about to be crawled...' %locals() # After a url is crawled, Ruya invokes this callback where we can check crawled values of a url. def aftercrawl(caller, eventargs): doc= eventargs.document uri= doc.uri print 'Url: '+ uri.url print 'Title: '+ doc.title print 'Description: '+ doc.description print 'Keywords: '+ doc.keywords print 'Last-modified: '+ doc.lastmodified print 'Etag: '+ doc.etag # Check if any errors occurred during crawl of this url if(None!= doc.error): print 'Error: '+ `doc.error.type` print 'Value: '+ `doc.error.value` print 'aftercrawl(): "%(uri)s" has finished crawling...' %locals() if('__main__'== __name__): # Test Ruya crawler test()For bugs, suggestions, feedback please report to the author.
To Do: epydoc-3.0beta1 doesn't support @rtype, @returns for property() yet?
Version: 1.0
Date: 2007-May-06 1441H
Author: NAIK Shantibhushan<qqbb65v59@world.ocn.ne.jp>
Copyright: Copyright (c) 2005 NAIK Shantibhushan<qqbb65v59@world.ocn.ne.jp>
License: Python
|
|||
CrawlScope Ruya's configuration object to determine which scope will be used for a website while crawling |
|||
Config Ruya's Crawler uses configuration objects to determine various settings during crawl. |
|||
Uri Ruya's Uri object encapsulates an http url used while crawling. |
|||
Document Ruya's document object represents an html document. |
|||
Crawler Ruya's main object is the Crawler object. |
|||
SingleDomainDelayCrawler Ruya's single domain delayed crawler is an enhancement to Ruya's base crawler. |
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0beta1 on Sun May 06 20:47:05 2007 | http://epydoc.sourceforge.net |