Weekend Codes

Browser cache invalidation with Javascript and querystring

Posted by Rogério Carvalho Schneider

30 Aug 2009

Some time ago I started my blog here at github and noticed that new posts didn’t come live right way I published them.

Quickly I spot the problem: They are sending HTTP Cache Headers for the index.html and all pages served by github, a 24 hour cache.

The problem

$ curl -I http://stockrt.github.com

HTTP/1.1 200 OK
Server: nginx/0.6.31
Date: Sat, 22 Aug 2009 01:36:49 GMT
Content-Type: text/html
Content-Length: 66829
Last-Modified: Sat, 22 Aug 2009 01:12:50 GMT
Connection: keep-alive
Expires: Sun, 23 Aug 2009 01:36:49 GMT
Cache-Control: max-age=86400
Accept-Ranges: bytes

So, to overcome this “problem” I made this tiny trick, and published it to others to take advantage of it, in case your are hosting your pages behind an web server with Expires configured.

The trick

Go and clone cache_invalidation and start using the provided javascripts in your site, this way:

<html>

<head>
 <script src="http://your_site/javascripts/querystring.js" type="text/javascript"></script>
 <script src="http://your_site/javascripts/cache_invalidation.js" type="text/javascript"></script>
</head>

<body>
</body>

</html>

Set the desired TTL inside de cache_invalidation.js file:

// TTL: set your cache threshold here
var ttl = 300;  // seconds

And it is all set.

But, why does it happen, and how it works?

It does happen because their web server (the great nginx) is configured with what we used to call mod_expires in Apache. This module activates the Expires HTTP Cache Header.

If you look at the response headers I got before, you would see:

Date: Sat, 22 Aug 2009 01:36:49 GMT
Expires: Sun, 23 Aug 2009 01:36:49 GMT

and:

Cache-Control: max-age=86400

Notice that:

$ bc <<< 86400/3600
24

They are saying to my browser that it should use the local copy, for the next 24 hours, when accessing this site. More precisely, when accessing index.html of this site.

I think that, for a blog, this is a pretty big time to update the user’s cache. This cache header means that if a reader accessed you site just before your posted something, and returned to your site after you posted, he would not see any difference. He would only notice your new post the next day.

But, you can bypass that, just passing any query string within the site’s address to the navigation bar in your browser.

This tricks the browser to go in the source and to fetch the page, instead of using a local copy. It would only use a local copy if you have no query string or if you have already cached that url with that query string (say, in a second time you visit the same query string).

Just because the browser would cache the same query string in a second access, I made the script to vary it on each access, and also it forces a refresh when accessing a querystring that is TTL seconds older than the current time, even if it is already cached from a previous access, say, when clicking a bookmark.

As a front end engineer I am, I only pray to my web developer colleagues don’t find this post, ever :)

Tags: cache github en-US
Meta: permalink atom
Share:

Favorites View Comments

Finding the next prime number from a given number

Posted by Rogério Carvalho Schneider

29 Aug 2009

Finding the next prime number online is useful if you do not have time to calculate it but need a good seed for your hash.

export start_number=250000
curl -s "http://www.numberempire.com/primenumbers.php?action=next&number=${start_number}" | sed -n -e 's#.*The smallest prime greater than.*<font color=.*>\(.*\)</font></div></td></tr><tr>.*#\1#p'

This can be good the help you finding which number to use when tuning Varnish for better performance with the classic hash algorithm, avoiding bucket collision when having a big number of objects, making hash lookups faster.

Tags: linux en-US
Meta: permalink atom
Share:

Favorites View Comments

sed quick tips

Posted by Rogério Carvalho Schneider

28 Aug 2009

A little collection of sed tips.

The basic one, substitution

You already know this one:

sed -i 's/old text/new text/g' file.txt

Deleting all lines containing an specific text

sed -i -e '/this line will disapear/d' file.txt

Deleting blank lines

sed -i -e '/^$/d' file.txt

Filtering text between delimiters

All matching text into the first defined group “()” will be printed:

curl -s -L http://www.terra.com.br | sed -n -e 's#.*\(http://.*\.\(js\|css\)\).*#\1#p'

Printing only from a given line number to another

Prints from line 20 to line 30:

sed -n 20,30p file.txt

More tips can be found here.

Do you know some trick that is worth sharing? Please post it as a comment!

Tags: linux en-US
Meta: permalink atom
Share:

Favorites View Comments

Sunset in Porto Alegre

Posted by Rogério Carvalho Schneider

16 Aug 2009

Beautiful sunset in the capital city of Rio Grande do Sul, Brazil.

My wife and I decided to take some of our spare time to appreciate that.

Do you think it was worth?

Tags: misc en-US
Meta: permalink atom
Share:

Favorites View Comments

Emulating a Browser in Python with mechanize

Posted by Rogério Carvalho Schneider

16 Aug 2009

It is always useful to know how to quickly instantiate a browser in the command line or inside your python scripts.

Every time I need to automate any task regarding web systems I do use this recipe to emulate a browser in python:

import mechanize
import cookielib

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# Want debugging messages?
#br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

Now you have this br object, this is your browser instance. With this its possible to open a page, to inspect or to interact with:

# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('http://google.com')
html = r.read()

# Show the source
print html
# or
print br.response().read()

# Show the html title
print br.title()

# Show the response headers
print r.info()
# or
print br.response().info()

# Show the available forms
for f in br.forms():
    print f

# Select the first (index zero) form
br.select_form(nr=0)

# Let's search
br.form['q']='weekend codes'
br.submit()
print br.response().read()

# Looking at some results in link format
for l in br.links(url_regex='stockrt'):
    print l

If you are about to access a password protected site (http basic auth):

# If the protected site didn't receive the authentication data you would
# end up with a 410 error in your face
br.add_password('http://safe-site.domain', 'username', 'password')
br.open('http://safe-site.domain')

Thanks to the Cookie Jar we’ve added before, you do not have to bother about session handling for authenticated sites, as in when you are accessing a service that requires a POST (form submit) of user and password. Usually they ask your browser to store a session cookie and expects your browser to contain that same cookie when re-accessing the page. All this, storing and re-sending the session cookies, is done by the Cookie Jar, neat!

You can also manage with browsing history:

# Testing presence of link (if the link is not found you would have to
# handle a LinkNotFoundError exception)
br.find_link(text='Weekend codes')

# Actually clicking the link
req = br.click_link(text='Weekend codes')
br.open(req)
print br.response().read()
print br.geturl()

# Back
br.back()
print br.response().read()
print br.geturl()

Downloading a file:

# Download
f = br.retrieve('http://www.google.com.br/intl/pt-BR_br/images/logo.gif')[0]
print f
fh = open(f)

Setting a proxy for your http navigation:

# Proxy and user/password
br.set_proxies({"http": "joe:password@myproxy.example.com:3128"})

# Proxy
br.set_proxies({"http": "myproxy.example.com:3128"})
# Proxy password
br.add_proxy_password("joe", "password")

But, if you just want to quickly open an webpage, without the fancy features above, just issue that:

# Simple open?
import urllib2
print urllib2.urlopen('http://stockrt.github.com').read()

# With password?
import urllib
opener = urllib.FancyURLopener()
print opener.open('http://user:password@stockrt.github.com').read()

See more in Python mechanize site , mechanize docs and ClientForm docs.

Also, I have made this post to elucidate how to handle html forms and sessions with python mechanize and BeautifulSoup