Urllib module in Python is used to access, and interact with, websites using URL (Uniform Resource Locator). A URL(colloquially termed a web address) is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it.

Urilib has below modules for working with URL

  • urllib.request for opening and reading URLs
  • urllib.error containing the exceptions raised by urllib.request
  • urllib.parse for parsing URLs
  • urllib.robotparser for parsing robots.txt files

urllib.request module defines functions and classes which help in opening URLs. It can also perform basic and digest authentication, redirections, cookies and more.

Fetching URLs

To open URL use urlopen(). Syntax of this function is

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object. Various parameters are

  • data : It must be an object specifying additional data to be sent to the server, or None if no such data is needed.
  • timeout : Optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt. If not specified, the global default timeout setting will be used.
  • context : It must be of type ssl.SSLContext instance describing the various SSL options.

Below example demonstrate the fetching response from server using URL.

import urllib.request

sample_url = "http://httpbin.org/xml"

# Display the first 300 bytes.
with urllib.request.urlopen(sample_url) as f:
    print(f.read(300))

# Output
# b'<?xml version=\'1.0\' encoding=\'us-ascii\'?>\n\n<!--  A SAMPLE set of slides  -->\n\n<slideshow \n    title="Sample Slide Show"\n    date="Date of publication"\n    author="Yours Truly"\n    >\n\n    <!-- TITLE SLIDE -->\n    <slide type="all">\n      <title>Wake up to WonderWidgets!</title>\n    </slide>\n\n    <!--'


# Decoding the bytes object as utf-8.
with urllib.request.urlopen(sample_url) as f:
    print(f.read(100).decode('utf-8'))

# Output
# <?xml version='1.0' encoding='us-ascii'?>

# <!--  A SAMPLE set of slides  -->

# <slideshow 
#     title=

Parsing Response

urlopen() returns HTTPResponse instance which wraps the HTTP response from the server. The response is an iterable object and can be used in a with statement. It provides function to retrieve response of server. Various functions are

  • read([amt]) : Reads and returns the response body, or up to the next amt bytes.
  • readinto(b) : Reads up to the next len(b) bytes of the response body into the buffer b. Returns the number of bytes read.
  • getheader(name, default=None) : Return the value of the header name, or default if there is no header matching name.
  • getheaders() : Return a list of (header, value) tuples.
  • version : HTTP protocol version used by server. 10 for HTTP/1.0, 11 for HTTP/1.1.
  • status : Status code returned by server.
  • reason : Reason phrase returned by server.
  • closed : Is True if the stream is closed.

Below example shows retrieving various component of HTTP response.

import urllib.request

sample_url = "http://httpbin.org/xml"

# Create a request to retrieve data using urllib.request
response = urllib.request.urlopen(sample_url, timeout=5)

# Check the status
status_code = response.status
print("Status Code : " + str(status_code))
# Output
# Status Code : 200

print("HTTP Version : " + str(response.version))
# Output
# HTTP Version : 11

# if no error, then read the response contents
if status_code >= 200 and status_code < 300:

    # work with response headers
    print("Header : " + str(response.getheaders()))
    # Output
    # Header : [('Date', 'Fri, 11 Sep 2020 06:22:55 GMT'), ('Content-Type', 'application/xml'), ('Content-Length', '522'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]

    print("Content Length : " + response.getheader('Content-length'))
    # Output
    # Content Length : 522

    print("Contenet Type : " + response.headers['Content-Type'])
    # Output
    # Contenet Type : application/xml

    # Read data from the URL
    data = response.read().decode('utf-8')

    print(data)
    # Output
    # <?xml version='1.0' encoding='us-ascii'?>

    # <!--  A SAMPLE set of slides  -->

    # <slideshow 
    #     title="Sample Slide Show"
    #     date="Date of publication"
    #     author="Yours Truly"
    #     >

    #     <!-- TITLE SLIDE -->
    #     <slide type="all">
    #       <title>Wake up to WonderWidgets!</title>
    #     </slide>

    #     <!-- OVERVIEW -->
    #     <slide type="all">
    #         <title>Overview</title>
    #         <item>Why <em>WonderWidgets</em> are great</item>
    #         <item/>
    #         <item>Who <em>buys</em> WonderWidgets</item>
    #     </slide>

    # </slideshow>