Module ruya :: Class Uri
[hide private]
[frames] | no frames]

Class Uri

source code

object --+
         |
        Uri

Ruya's Uri object encapsulates an http url used while crawling. It provides ready to use methods to obtain robots.txt path, domains by examining a url, and scope checks on two urls.

Instance Methods [hide private]
None
__init__(self, url)
Constructor.
source code
 
getDomainUrl(self)
Returns the domain found after analyzing the url.
source code
 
getRobotsTxtUrl(self)
Returns the robots.txt path for a url.
source code
 
getDomains(self)
Returns valid domains found after analyzing the url.
source code
 
getHashes(self)
Returns valid SHA hashes for url string.
source code
 
getParts(self)
Returns a tuple consisting of various parts of a url.
source code
Uri
join(self, uri)
Joins two Uri objects and returns a new Uri object.
source code
boolean
issamedomain(self, uri)
Determines of two urls belong to the same domain.
source code
boolean
ishostscope(self, uri)
Determines if two urls belong to the same host
source code
boolean
isdomainscope(self, uri)
Determines two urls have same domain or either of the urls comes from a sub-domain of the other url.
source code
boolean
ispathscope(self, uri)
Determines if two urls belong to the same folder.
source code
str
__str__(self)
String representation of the url.
source code
str
__repr__(self)
Same as string representation.
source code
boolean
__eq__(self, uri)
Determines if two urls are identical by comparing their SHA hashes
source code
boolean
__ne__(self, uri)
Determines if two urls are not identical by comparing their SHA hashes
source code

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__

Instance Variables [hide private]
  url
Url with querystring removed.
  hash
SHA has for url.
Properties [hide private]
  parts
Returns a tuple consisting of various parts of a url.
  domainurl
Returns the domain found after analyzing the url.
  robotstxturl
Returns the robots.txt path for a url.
  domains
Returns valid domains found after analyzing the url.
  hashes
Returns valid SHA hashes for url string.

Inherited from object: __class__

Method Details [hide private]

__init__(self, url)
(Constructor)

source code 
Constructor.
Parameters:
  • url (str) - The actual url to be used for representation.
Returns: None
None
Overrides: object.__init__

getRobotsTxtUrl(self)

source code 
Returns the robots.txt path for a url. Usually, http://domain.ext/ has robots.txt placed in it's root as http://domain.ext/robots.txt.

getDomains(self)

source code 
Returns valid domains found after analyzing the url. http://www.domain.ext/ and http://domain.ext/ both point to the same domain domain.ext, so they must be considered same. This function assists the crawler when determining if two urls are from same domain.

getHashes(self)

source code 
Returns valid SHA hashes for url string. Two different hashes will be returned if url domain starts with www as http://www.domain.ext/ and http://domain.ext/ both point to the same domain domain.ext.

getParts(self)

source code 
Returns a tuple consisting of various parts of a url.

See Also: urlparse

join(self, uri)

source code 
Joins two Uri objects and returns a new Uri object.
Returns: Uri
Joined Uri instance.

issamedomain(self, uri)

source code 
Determines of two urls belong to the same domain.
  • http://domain.ext/page1.htm has same domain as http://domain.ext/page2.htm.
  • http://domain.ext/page1.htm has same domain as http://www.domain.ext/page2.htm since http://www.domain.ext/ and http://domain.ext/ both point to the same domain domain.ext.
Parameters:
  • uri (Uri.) - Valid instance of Uri object.
Returns: boolean
True if urls belong to the same domain.

ishostscope(self, uri)

source code 
Determines if two urls belong to the same host
  • http://domain.ext/page1.htm has same host (domain) as http://domain.ext/page2.htm
  • http://domain.ext/page1.htm does not have same host (domain) as http://otherdomain.ext/page2.htm
Parameters:
  • uri (Uri.) - Valid instance of Uri object.
Returns: boolean
True if urls belong to the same domain (host).

See Also: issamedomain

isdomainscope(self, uri)

source code 
Determines two urls have same domain or either of the urls comes from a sub-domain of the other url.
  • http://domain.ext/page1.htm comes from the same domain as http://domain.ext/page2.htm
  • http://example.domain.ext/page1.htm comes from a sub-domain as http://domain.ext/page2.htm. example.domain.ext is a sub-domain of domain.ext.
  • http://domain.ext/page1.htm does not come from same domain, or sub-domain as http://otherdomain.ext/page2.htm
Returns: boolean
True if urls belong to the same domain or either of the urls comes from a sub-domain of the other url.

Note: Sub-domain is simply determined if example.domain.ext ends in domain.ext.

ispathscope(self, uri)

source code 
Determines if two urls belong to the same folder.
  • http://domain.ext/support/page1.htm belongs to the same folder support as http://domain.ext/support/page2.htm
  • http://domain.ext/index.htm does not belong to same folder support as http://domain.ext/support/page2.htm
Parameters:
  • uri (Uri.) - Valid instance of Uri object.
Returns: boolean
True if urls belong to the same folder.

__str__(self)
(Informal representation operator)

source code 
String representation of the url.
Returns: str
String representation of the url.
Overrides: object.__str__

__repr__(self)
(Representation operator)

source code 
Same as string representation.
Returns: str
String representation of the url.
Overrides: object.__repr__

__eq__(self, uri)
(Equality operator)

source code 
Determines if two urls are identical by comparing their SHA hashes
Parameters:
  • uri (Uri.) - Valid instance of Uri object.
Returns: boolean
True if urls are identical, False otherwise.

__ne__(self, uri)

source code 
Determines if two urls are not identical by comparing their SHA hashes
Parameters:
  • uri (Uri.) - Valid instance of Uri object.
Returns: boolean
True if urls are not identical, False otherwise.

Property Details [hide private]

parts

Returns a tuple consisting of various parts of a url.
Get Method:
ruya.Uri.getParts(self) - Returns a tuple consisting of various parts of a url.

See Also: urlparse

domainurl

Returns the domain found after analyzing the url.
Get Method:
ruya.Uri.getDomainUrl(self) - Returns the domain found after analyzing the url.

robotstxturl

Returns the robots.txt path for a url. Usually, http://domain.ext/ has robots.txt placed in it's root as http://domain.ext/robots.txt.
Get Method:
ruya.Uri.getRobotsTxtUrl(self) - Returns the robots.txt path for a url.

domains

Returns valid domains found after analyzing the url. http://www.domain.ext/ and http://domain.ext/ both point to the same domain domain.ext, so they must be considered same. This function assists the crawler when determining if two urls are from same domain.
Get Method:
ruya.Uri.getDomains(self) - Returns valid domains found after analyzing the url.

hashes

Returns valid SHA hashes for url string. Two different hashes will be returned if url domain starts with www as http://www.domain.ext/ and http://domain.ext/ both point to the same domain domain.ext.
Get Method:
ruya.Uri.getHashes(self) - Returns valid SHA hashes for url string.