Emulating a Browser in Python with mechanize

    Posted by Rogério Carvalho Schneider
    16 Aug 2009

    It is always useful to know how to quickly instantiate a browser in the command line or inside your python scripts.

    Every time I need to automate any task regarding web systems I do use this recipe to emulate a browser in python:

    import mechanize
    import cookielib
    # Browser
    br = mechanize.Browser()
    # Cookie Jar
    cj = cookielib.LWPCookieJar()
    # Browser options
    # Follows refresh 0 but not hangs on refresh > 0
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    # Want debugging messages?
    # User-Agent (this is cheating, ok?)
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

    Now you have this br object, this is your browser instance. With this its possible to open a page, to inspect or to interact with:

    # Open some site, let's pick a random one, the first that pops in mind:
    r = br.open('http://google.com')
    html = r.read()
    # Show the source
    print html
    # or
    print br.response().read()
    # Show the html title
    print br.title()
    # Show the response headers
    print r.info()
    # or
    print br.response().info()
    # Show the available forms
    for f in br.forms():
        print f
    # Select the first (index zero) form
    # Let's search
    br.form['q']='weekend codes'
    print br.response().read()
    # Looking at some results in link format
    for l in br.links(url_regex='stockrt'):
        print l

    If you are about to access a password protected site (http basic auth):

    # If the protected site didn't receive the authentication data you would
    # end up with a 410 error in your face
    br.add_password('http://safe-site.domain', 'username', 'password')

    Thanks to the Cookie Jar we’ve added before, you do not have to bother about session handling for authenticated sites, as in when you are accessing a service that requires a POST (form submit) of user and password. Usually they ask your browser to store a session cookie and expects your browser to contain that same cookie when re-accessing the page. All this, storing and re-sending the session cookies, is done by the Cookie Jar, neat!

    You can also manage with browsing history:

    # Testing presence of link (if the link is not found you would have to
    # handle a LinkNotFoundError exception)
    br.find_link(text='Weekend codes')
    # Actually clicking the link
    req = br.click_link(text='Weekend codes')
    print br.response().read()
    print br.geturl()
    # Back
    print br.response().read()
    print br.geturl()

    Downloading a file:

    # Download
    f = br.retrieve('http://www.google.com.br/intl/pt-BR_br/images/logo.gif')[0]
    print f
    fh = open(f)

    Setting a proxy for your http navigation:

    # Proxy and user/password
    br.set_proxies({"http": "joe:password@myproxy.example.com:3128"})
    # Proxy
    br.set_proxies({"http": "myproxy.example.com:3128"})
    # Proxy password
    br.add_proxy_password("joe", "password")

    But, if you just want to quickly open an webpage, without the fancy features above, just issue that:

    # Simple open?
    import urllib2
    print urllib2.urlopen('http://stockrt.github.com').read()
    # With password?
    import urllib
    opener = urllib.FancyURLopener()
    print opener.open('http://user:password@stockrt.github.com').read()

    See more in Python mechanize site , mechanize docs and ClientForm docs.

    Also, I have made this post to elucidate how to handle html forms and sessions with python mechanize and BeautifulSoup

