None
|
|
|
getUri(self)
Returns the url for this document. |
source code
|
|
|
getNormalizedLinks(self)
Returns all links from this document converted to absolute links
with reference to document's L(uri). |
source code
|
|
|
|
|
|
|
|
|
|
|
getPlainContent(self)
Returns the plain html content for this document. |
source code
|
|
|
|
|
getContentHash(self)
Returns the SHA hash for plain contents of this document. |
source code
|
|
Inherited from object :
__delattr__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__repr__ ,
__setattr__ ,
__str__
|
|
headers
HTTP headers for this document.
|
|
_uri
HTTP url of this document.
|
|
title
Title of this document obtained from <title> tag.
|
|
description
Description of this document obtained from <meta
name=description...> tag.
|
|
keywords
Description of this document obtained from <meta
name=keywords...> tag.
|
|
lastmodified
Last-modified header for this document - Can be used to
avoid recrawling document if contents are not changed.
|
|
etag
Etag header for this document - Can be used to avoid
recrawling document if contents are not changed.
|
|
httpstatus
HTTP status obtained while crawling this document.
|
|
httpreason
HTTP reason obtained while crawling this document.
|
|
contenttype
Content-type header for this document.
|
|
contentencoding
Content-encoding header for this document.
|
|
_zippedcontent
gzipped contents for this document.
|
|
_isZipped
Internal flag to remember if gzip operation is already done for
plain contents.
|
|
_bzippedcontent
bz2 archived contents for this document.
|
|
_isBzipped
Internal flag to remember if bz2 archive operation is already done
for plain contents.
|
|
_plaincontent
Plain contents for this document.
|
|
links
All crawlable links found in this document.
|
|
redirecturi
Actual url of this document if this document was redirected from
uri.
|
|
redirects
Number of times this document was redirected from uri.
|
|
redirecturis
All redirected urls (matching redirects)
which were crawled while crawling document uri.
|
|
error
DocumentError object if error occurred during crawl
for this document.
|
|
_cleandata
Regular expression to match newlines
|