Module ruya
[hide private]
[frames] | no frames]

Module ruya

source code

Ruya Arabic name meaning "sight, vision" is a Python-based crawler for crawling English, Japanese websites. It is targeted solely towards developers who want crawling functionality in their code. Some important features of this tool are- Example:
  #!/usr/bin/env python
  #-*- coding: UTF-8 -*-

  import ruya

  def test():
     url= 'http://www.python.org/'

     # Create a Document instance representing start url
     doc= ruya.Document(ruya.Uri(url))

     # Create a new crawler configuration object
     cfg= ruya.Config(ruya.Config.CrawlConfig(levels= 1, crawldelay= 5), ruya.Config.RedirectConfig(), ruya.Config.LogConfig())

     # Use a single-domain breadth crawler with crawler configuration
     c= ruya.SingleDomainDelayCrawler(cfg)

     # Crawler raises following events before crawling a url.
     # Setup callbacks pointing to custom methods where we can control whether to crawl or ignore a url e.g. to ignore duplicates?
     c.bind('beforecrawl', beforecrawl, None)
     c.bind('aftercrawl', aftercrawl, None)
     c.bind('includelink', includelink, None)

     # Start crawling
     c.crawl(doc)

     #
     if(None!= doc.error):
        print(`doc.error.type`+ ': '+ `doc.error.value`)

  # This callback is invoked from Ruya crawler before a url is to be included in list of urls to be crawled
  # We can choose to ignore the url based on our custom logic
  def includelink(caller, eventargs):
     uri= eventargs.uri
     level= eventargs.level
     print 'includelink(): Include "%(uri)s" to crawl on level %(level)d?' %locals()

  # Before a url is actually crawled, Ruya invokes this callback to ask whether to crawl the url or not.
  # We can choose to ignore the url based on our custom logic
  def beforecrawl(caller, eventargs):
     uri= eventargs.document.uri
     print 'beforecrawl(): "%(uri)s" is about to be crawled...' %locals()

  # After a url is crawled, Ruya invokes this callback where we can check crawled values of a url.
  def aftercrawl(caller, eventargs):
     doc= eventargs.document
     uri= doc.uri

     print 'Url: '+ uri.url
     print 'Title: '+ doc.title
     print 'Description: '+ doc.description
     print 'Keywords: '+ doc.keywords
     print 'Last-modified: '+ doc.lastmodified
     print 'Etag: '+ doc.etag

     # Check if any errors occurred during crawl of this url
     if(None!= doc.error):
        print 'Error: '+ `doc.error.type`
        print 'Value: '+ `doc.error.value`

     print 'aftercrawl(): "%(uri)s" has finished crawling...' %locals()

  if('__main__'== __name__):
     # Test Ruya crawler
     test()
For bugs, suggestions, feedback please report to the author.


To Do: epydoc-3.0beta1 doesn't support @rtype, @returns for property() yet?

Version: 1.0

Date: 2007-May-06 1441H

Author: NAIK Shantibhushan<qqbb65v59@world.ocn.ne.jp>

Copyright: Copyright (c) 2005 NAIK Shantibhushan<qqbb65v59@world.ocn.ne.jp>

License: Python

Classes [hide private]
  CrawlScope
Ruya's configuration object to determine which scope will be used for a website while crawling
  Config
Ruya's Crawler uses configuration objects to determine various settings during crawl.
  Uri
Ruya's Uri object encapsulates an http url used while crawling.
  Document
Ruya's document object represents an html document.
  Crawler
Ruya's main object is the Crawler object.
  SingleDomainDelayCrawler
Ruya's single domain delayed crawler is an enhancement to Ruya's base crawler.