Module ruya :: Class Uri

Class Uri

object --+
         |
        Uri

Ruya's Uri object encapsulates an http url used while crawling. It provides ready to use methods to obtain robots.txt path, domains by examining a url, and scope checks on two urls.

Instance Methods

[hide private]

None

__init__(self, url)
Constructor.

source code

getDomainUrl(self)
Returns the domain found after analyzing the url.

source code

getRobotsTxtUrl(self)
Returns the robots.txt path for a url.

source code

getDomains(self)
Returns valid domains found after analyzing the url.

source code

getHashes(self)
Returns valid SHA hashes for url string.

source code

getParts(self)
Returns a tuple consisting of various parts of a url.

source code

Uri

join(self, uri)
Joins two Uri objects and returns a new Uri object.

source code

boolean

issamedomain(self, uri)
Determines of two urls belong to the same domain.

source code

boolean

ishostscope(self, uri)
Determines if two urls belong to the same host

source code

boolean

isdomainscope(self, uri)
Determines two urls have same domain or either of the urls comes from a sub-domain of the other url.

source code

boolean

ispathscope(self, uri)
Determines if two urls belong to the same folder.

source code

str

__str__(self)
String representation of the url.

source code

str

__repr__(self)
Same as string representation.

source code

boolean

__eq__(self, uri)
Determines if two urls are identical by comparing their SHA hashes

source code

boolean

__ne__(self, uri)
Determines if two urls are not identical by comparing their SHA hashes

source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__

Instance Variables

[hide private]

url
Url with querystring removed.

hash
SHA has for url.

Properties

[hide private]

parts
Returns a tuple consisting of various parts of a url.

domainurl
Returns the domain found after analyzing the url.

robotstxturl
Returns the robots.txt path for a url.

domains
Returns valid domains found after analyzing the url.

hashes
Returns valid SHA hashes for url string.

Inherited from object: __class__

Method Details

[hide private]

init(self, url)
(Constructor)

source code

Constructor.

Parameters:

url (str) - The actual url to be used for representation.

Returns: None

None

Overrides: object.__init__

getRobotsTxtUrl(self)

source code

Returns the robots.txt path for a url. Usually, http://domain.ext/ has robots.txt placed in it's root as http://domain.ext/robots.txt.

getDomains(self)

source code

Returns valid domains found after analyzing the url. http://www.domain.ext/ and http://domain.ext/ both point to the same domain domain.ext, so they must be considered same. This function assists the crawler when determining if two urls are from same domain.

getHashes(self)

source code

Returns valid SHA hashes for url string. Two different hashes will be returned if url domain starts with www as http://www.domain.ext/ and http://domain.ext/ both point to the same domain domain.ext.

getParts(self)

source code

Returns a tuple consisting of various parts of a url.

See Also: urlparse

join(self, uri)

source code

Joins two Uri objects and returns a new Uri object.

Returns: Uri: Joined Uri instance.

issamedomain(self, uri)

source code

Determines of two urls belong to the same domain.

http://domain.ext/page1.htm has same domain as http://domain.ext/page2.htm.
http://domain.ext/page1.htm has same domain as http://www.domain.ext/page2.htm since http://www.domain.ext/ and http://domain.ext/ both point to the same domain domain.ext.

Parameters:

uri (Uri.) - Valid instance of Uri object.

Returns: boolean

True if urls belong to the same domain.

ishostscope(self, uri)

source code

Determines if two urls belong to the same host

http://domain.ext/page1.htm has same host (domain) as http://domain.ext/page2.htm
http://domain.ext/page1.htm does not have same host (domain) as http://otherdomain.ext/page2.htm

Parameters:

uri (Uri.) - Valid instance of Uri object.

Returns: boolean

True if urls belong to the same domain (host).

See Also: issamedomain

isdomainscope(self, uri)

source code

Determines two urls have same domain or either of the urls comes from a sub-domain of the other url.

http://domain.ext/page1.htm comes from the same domain as http://domain.ext/page2.htm
http://example.domain.ext/page1.htm comes from a sub-domain as http://domain.ext/page2.htm. example.domain.ext is a sub-domain of domain.ext.
http://domain.ext/page1.htm does not come from same domain, or sub-domain as http://otherdomain.ext/page2.htm

Returns: boolean: True if urls belong to the same domain or either of the urls comes from a sub-domain of the other url.

Note: Sub-domain is simply determined if example.domain.ext ends in domain.ext.

ispathscope(self, uri)

source code

Determines if two urls belong to the same folder.

http://domain.ext/support/page1.htm belongs to the same folder support as http://domain.ext/support/page2.htm
http://domain.ext/index.htm does not belong to same folder support as http://domain.ext/support/page2.htm

Parameters:

uri (Uri.) - Valid instance of Uri object.

Returns: boolean

True if urls belong to the same folder.

str(self)
(Informal representation operator)

source code

String representation of the url.

Returns: str: String representation of the url.
Overrides: object.__str__

repr(self)
(Representation operator)

source code

Same as string representation.

Returns: str: String representation of the url.
Overrides: object.__repr__

eq(self, uri)
(Equality operator)

source code

Determines if two urls are identical by comparing their SHA hashes

Parameters:

uri (Uri.) - Valid instance of Uri object.

Returns: boolean

True if urls are identical, False otherwise.

ne(self, uri)

source code

Determines if two urls are not identical by comparing their SHA hashes

Parameters:

uri (Uri.) - Valid instance of Uri object.

Returns: boolean

True if urls are not identical, False otherwise.

Property Details

[hide private]

parts

Returns a tuple consisting of various parts of a url.

Get Method:: ruya.Uri.getParts(self) - Returns a tuple consisting of various parts of a url.

See Also: urlparse

domainurl

Returns the domain found after analyzing the url.

Get Method:: ruya.Uri.getDomainUrl(self) - Returns the domain found after analyzing the url.

robotstxturl

Returns the robots.txt path for a url. Usually, http://domain.ext/ has robots.txt placed in it's root as http://domain.ext/robots.txt.

Get Method:: ruya.Uri.getRobotsTxtUrl(self) - Returns the robots.txt path for a url.

domains

Get Method:: ruya.Uri.getDomains(self) - Returns valid domains found after analyzing the url.

hashes

Get Method:: ruya.Uri.getHashes(self) - Returns valid SHA hashes for url string.

Class Uri

__init__(self, url) (Constructor)

getRobotsTxtUrl(self)

getDomains(self)

getHashes(self)

getParts(self)

join(self, uri)

issamedomain(self, uri)

ishostscope(self, uri)

isdomainscope(self, uri)

ispathscope(self, uri)

__str__(self) (Informal representation operator)

__repr__(self) (Representation operator)

__eq__(self, uri) (Equality operator)

__ne__(self, uri)

parts

domainurl

robotstxturl

domains

hashes

init(self, url)
(Constructor)

str(self)
(Informal representation operator)

repr(self)
(Representation operator)

eq(self, uri)
(Equality operator)

ne(self, uri)