Module ruya :: Class Document

Class Document

object --+
         |
        Document

Ruya's document object represents an html document. It provides ready to use access to document's http headers and various other properties such as title, keywords etc. It also allows to access plain html contents as gzipped, or bz2 archived.

Nested Classes

[hide private]

DocumentError
Ruya's document error object represents crawl error occurred during crawl of a Document.

Instance Methods

[hide private]

None

__init__(self, uri, lastmodified='', etag='')
Constructor. source code

getUri(self)
Returns the url for this document.

source code

getNormalizedLinks(self)
Returns all links from this document converted to absolute links with reference to document's L(uri).

source code

getZippedContent(self)
Returns gzipped content for this document.

source code

setZippedContent(self, data)
Sets the gzipped content for the document.

source code

getBzippedContent(self)
Return bz2 archived contents for this document.

source code

setBzippedContent(self, data)
Sets the bz2 archived contents for this document.

source code

getPlainContent(self)
Returns the plain html content for this document.

source code

setPlainContent(self, data)
Sets the plain html content for this document.

source code

getContentHash(self)
Returns the SHA hash for plain contents of this document.

source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Instance Variables

[hide private]

headers
HTTP headers for this document.

_uri
HTTP url of this document.

title
Title of this document obtained from <title> tag.

description
Description of this document obtained from <meta name=description...> tag.

keywords
Description of this document obtained from <meta name=keywords...> tag.

lastmodified
Last-modified header for this document - Can be used to avoid recrawling document if contents are not changed.

etag
Etag header for this document - Can be used to avoid recrawling document if contents are not changed.

httpstatus
HTTP status obtained while crawling this document.

httpreason
HTTP reason obtained while crawling this document.

contenttype
Content-type header for this document.

contentencoding
Content-encoding header for this document.

_zippedcontent
gzipped contents for this document.

_isZipped
Internal flag to remember if gzip operation is already done for plain contents.

_bzippedcontent
bz2 archived contents for this document.

_isBzipped
Internal flag to remember if bz2 archive operation is already done for plain contents.

_plaincontent
Plain contents for this document.

links
All crawlable links found in this document.

redirecturi
Actual url of this document if this document was redirected from uri.

redirects
Number of times this document was redirected from uri.

redirecturis
All redirected urls (matching redirects) which were crawled while crawling document uri.

error
DocumentError object if error occurred during crawl for this document.

_cleandata
Regular expression to match newlines

Properties

[hide private]

uri
Returns the url for this document.

normalizedlinks
Returns all links from this document converted to absolute links with reference to document's L(uri).

zippedcontent
Returns gzipped content for this document.

bzippedcontent
Return bz2 archived contents for this document.

plaincontent
Returns the plain html content for this document.

hash
Returns the SHA hash for plain contents of this document.

Inherited from object: __class__

Method Details

[hide private]

init(self, uri, lastmodified=`''`, etag=`''`)
(Constructor)

source code

Constructor.

Parameters:

uri (Uri.) - Valid instance of Uri object.
lastmodified (str) - Last-modified header value obtained from last crawl, if any.
etag (str) - Etag header value obtained from last crawl, if any.

Returns: None

None

Overrides: object.__init__

getZippedContent(self)

source code

Returns gzipped content for this document.

Note: The content is gzipped with the maximum compression level of 9.

See Also: gzip

setZippedContent(self, data)

source code

Sets the gzipped content for the document.

Note: The content is unzipped assuming the compression level of 9.

See Also: gzip

getBzippedContent(self)

source code

Return bz2 archived contents for this document.

Note: The content is bz2 archived with the maximum compression level of 9.

See Also: bz2

setBzippedContent(self, data)

source code

Sets the bz2 archived contents for this document.

See Also: bz2

setPlainContent(self, data)

source code

Sets the plain html content for this document.

Note: Empty lines are removed from the plain contents.

Property Details

[hide private]

uri

Returns the url for this document.

Get Method:: ruya.Document.getUri(self) - Returns the url for this document.

normalizedlinks

Returns all links from this document converted to absolute links with reference to document's L(uri).

Get Method:: ruya.Document.getNormalizedLinks(self) - Returns all links from this document converted to absolute links with reference to document's L(uri).

zippedcontent

Returns gzipped content for this document.

Get Method:: ruya.Document.getZippedContent(self) - Returns gzipped content for this document.
Set Method:: ruya.Document.setZippedContent(self, data) - Sets the gzipped content for the document.

Note: The content is gzipped with the maximum compression level of 9.

See Also: gzip

bzippedcontent

Return bz2 archived contents for this document.

Get Method:: ruya.Document.getBzippedContent(self) - Return bz2 archived contents for this document.
Set Method:: ruya.Document.setBzippedContent(self, data) - Sets the bz2 archived contents for this document.

Note: The content is bz2 archived with the maximum compression level of 9.

See Also: bz2

plaincontent

Returns the plain html content for this document.

Get Method:: ruya.Document.getPlainContent(self) - Returns the plain html content for this document.
Set Method:: ruya.Document.setPlainContent(self, data) - Sets the plain html content for this document.

hash

Returns the SHA hash for plain contents of this document.

Get Method:: ruya.Document.getContentHash(self) - Returns the SHA hash for plain contents of this document.

Class Document

__init__(self, uri, lastmodified='', etag='') (Constructor)

getZippedContent(self)

setZippedContent(self, data)

getBzippedContent(self)

setBzippedContent(self, data)

setPlainContent(self, data)

uri

normalizedlinks

zippedcontent

bzippedcontent

plaincontent

hash

init(self, uri, lastmodified=`''`, etag=`''`)
(Constructor)