Urllib module in Python is used to access, and interact with, websites using URL (Uniform Resource Locator). A URL (colloquially termed a web address) is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it.
Urilib has below modules for working with URL
- urllib.request for opening and reading URLs
- urllib.error containing the exceptions raised by urllib.request
- urllib.parse for parsing URLs
- urllib.robotparser for parsing robots.txt files
Parsing Urls
urllib.parse module defines interface to break URL strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string.
urlparse()
parse a URL into six components. urlsplit()
is similar to urlparse(), but does not split the params from the URL. For example:
import urllib.parse sample_url = "http://example.com:8080/example.html?val1=1&val2=Hello" # Parse URL with urlparse() result = urllib.parse.urlparse(sample_url) print(result) # Output # ParseResult(scheme='http', netloc='example.com:8080', path='/example.html', params='', query='val1=1&val2=Hello', fragment='') print("Scheme : " + result.scheme) print("HostName : " + result.hostname) print("Path : " + result.path) # Output # Scheme : http # HostName : example.com # Path : /example.html print(result.geturl()) # Output # http://example.com:8080/example.html?val1=1&val2=Hello result = urllib.parse.urlsplit(sample_url) print(result) # Output # SplitResult(scheme='http', netloc='example.com:8080', path='/example.html', query='val1=1&val2=Hello', fragment='')
urljoin() construct a absolute URL by combining a base URL with another URL. It uses addressing scheme, the network location to provide missing components in the relative URL. For example:
import urllib.parse # Join URL with urljoin() print(urllib.parse.urljoin('http://example.com:8080/example.html', 'FAQ.html')) # Output # http://example.com:8080/FAQ.html
Quoting URL
URL quoting module provides functions to make program data safe for use as URL components by quoting special characters and appropriately encoding non-ASCII text. It also support reversing these operations to recreate the original data from the contents of a URL component.
quote() function replace special characters in string using the %xx escape. It is intended for quoting the path section of URL. quote_plus()
is similar to quote(), but it replace spaces by plus signs. Plus signs in the original string are escaped unless they are included in safe. To get back the URL from the quoted url, use unquote(). It %xx escapes by their single-character equivalent. unquote_plus()
is similar to unquote(), but also replace plus signs by spaces, as required for unquoting HTML form values. Syntax of above functions
quote(string, safe='/', encoding=None, errors=None) quote_plus(string, safe='', encoding=None, errors=None) unquote(string, encoding='utf-8', errors='replace') unquote_plus(string, encoding='utf-8', errors='replace')
Following example demonstrate quoting of program data, so that they can be used as component of URL.
import urllib.parse sample_string = "Hello El Niño" # Replaces special characters for use in URLs quoteStr = urllib.parse.quote(sample_string) print(quoteStr) # Output # Hello%20El%20Ni%C3%B1o # Replaces special characters for use in URLs quotePlusStr = urllib.parse.quote_plus(sample_string) print(quotePlusStr) # Output # Hello+El+Ni%C3%B1o # Get back actual string from quoted string print(urllib.parse.unquote(quoteStr)) # Output # Hello El Niño print(urllib.parse.unquote_plus(quotePlusStr)) # Output # Hello El Niño
Manipulating Query Parameter
urlencode() convert a mapping object or a sequence of two-element tuples to a percent-encoded ASCII text string. It returns string containing series of key=value pairs separated by ‘&’ characters, where both key and value are quoted. The order of parameters in the encoded string will match the order of parameter tuples in the sequence. To reverse encoding process, use parse_qs() and parse_qsl() to parse query strings into Python data structures. Syntax of the function are
urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus) urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None) urllib.parse.parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None)
parse_qs()
parse a query string given as a string argument and returns a dictionary. Dictionary keys are the unique query variable names and the values are lists of values for each name. parse_qsl()
parse a query string given as a string argument and returs as a list of name, value pairs.
import urllib.parse # Use urlencode() to convert maps to parameter strings query_data = { 'name': "Mango", "type": "Fruit", "price": 37 } result = urllib.parse.urlencode(query_data) print(result) # Output # name=Mango&type=Fruit&price=37 print(urllib.parse.parse_qs(result)) # Output # {'name': ['Mango'], 'type': ['Fruit'], 'price': ['37']} print(urllib.parse.parse_qsl(result)) # Output # [('name', 'Mango'), ('type', 'Fruit'), ('price', '37')]