Coding with Jesse

Parse Accept-Language to detect a user's language

I'm an English-speaking Canadian living in Germany. Quite often I go to a website like Google or Kayak and find myself looking at a German version of the site.

Okay, I do live in Germany, but why assume that everyone within Germany speaks German? What about visitors from other countries, or even people living here that would prefer to use another language?

What must be happening is these sites are taking my IP address, looking up the geographical location of that address, and choosing the official language for that country. This may work most of the time, but there is an even easier way to choose a language.

Most browsers send an Accept-Language header. For example, mine is set to:

en-ca,en;q=0.8,en-us;q=0.6,de-de;q=0.4,de;q=0.2

What this basically says is that I prefer (in decreasing order of preference) Canadian English, generic English, US English, German spoken in Germany, and lastly generic German. Any web site I visit is capable of looking at this list and deciding what language I would prefer.

Of course, no matter what assumptions you make about a visitor, give them a chance to change their language if needed. For example, if you use an Internet cafe in Berlin, you shouldn't be stuck viewing websites in German!

One really nice thing: I often see Google Ads and other geographically targeted ads in German, and this makes ignoring the ads much easier! :)

Update: I was inspired to throw together a quick Accept-Language parser in PHP:

$langs = array();

if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
    // break up string into pieces (languages and q factors)
    preg_match_all('/([a-z]{1,8}(-[a-z]{1,8})?)\s*(;\s*q\s*=\s*(1|0\.[0-9]+))?/i', $_SERVER['HTTP_ACCEPT_LANGUAGE'], $lang_parse);

    if (count($lang_parse[1])) {
        // create a list like "en" => 0.8
        $langs = array_combine($lang_parse[1], $lang_parse[4]);
    	
        // set default to 1 for any without q factor
        foreach ($langs as $lang => $val) {
            if ($val === '') $langs[$lang] = 1;
        }

        // sort list based on value	
        arsort($langs, SORT_NUMERIC);
    }
}

// look through sorted list and use first one that matches our languages
foreach ($langs as $lang => $val) {
	if (strpos($lang, 'de') === 0) {
		// show German site
	} else if (strpos($lang, 'en') === 0) {
		// show English site
	} 
}

// show default site or prompt for language

This would produce the following structure for my Accept-Language string:

Array
(
    [en-ca] => 1
    [en] => 0.8
    [en-us] => 0.6
    [de-de] => 0.4
    [de] => 0.2
)

Published on May 4th, 2008. © Jesse Skinner

Twitter

I succumbed to twitter. If anybody here twitters, feel free to follow me at @jesseskinner or leave your id in the comments.

Published on May 1st, 2008. © Jesse Skinner

Three years of The Future of the Web

Three years ago today, I wrote my first post. I was just about to move to Berlin and was looking for a new job.

A lot has happened since then. I started freelancing a year later, and I couldn't have done it without this site. 100% of my clients come directly through my "hire me" page.

By the time this blog turns four, we should be living back in Canada, and I might set up a web development agency (in an office!)

That's enough about me! What're you guys up to these days?

Published on April 6th, 2008. © Jesse Skinner

Saving data to a file with PHP

Lately, I've been skipping using MySQL in situations where I just want to store a few variables, like configuration options, and don't necessarily want the hassle of setting up a database.

You can easily store data to a file using serialize and unserialize to turn a PHP object into a string, and then read and write the string in a file.

Here are a few functions that do just that:

function get_data($filename) {
    // create file if it doesn't exist
    if (!file_exists($filename)) {
        touch($filename);
    }

    return unserialize(file_get_contents($filename));
}

function get_option($filename, $key) {
    $data = get_data($filename);
    return $data[$key];
}

function set_option($filename, $key, $value) {
    $data = get_data($filename);
    $data[$key] = $value;

    // write to disk
    $fp = fopen($filename, 'w');
    fwrite($fp, serialize($data));
    fclose($fp);
}

// probably should put somewhere off the web root
$config = '../config.dat';

set_option($config, 'width', 1024);
echo get_option($config, 'width'); // will echo 1024

So there you have it. Feel free to use or modify this code as much as you like. If anyone has an idea for rewriting it to be cleaner, please share in the comments.

Published on February 24th, 2008. © Jesse Skinner

Easy web scraping with PHP

Web scraping is a technique of web development where you load a web page and "scrape" the data off the page to be used elsewhere. It's not pretty, but sometimes scraping is the only way to access data or content from a web site that doesn't provide RSS or an open API.

I'm not going to discuss the legal aspects of scraping, as it may be considered copyright infringement in some situations. However, there are also perfectly legal reasons to need to scrape, like if you have permission.

To make things really easy, we're going to let the power of regular expressions do all the work for us. If you're not familiar with regular expressions, you may want to google for a tutorial. Here is the documentation for PHP regular expression syntax.

First, we start off by loading the HTML using file_get_contents. Next, we use preg_match_all with a regular expression to turn the data on the page into a PHP array.

This example will demonstrate scraping this web site's blog page to extract the most recent blog posts. This is just for demo purposes - of course, the RSS feed is much better suited for this.

// get the HTML
$html = file_get_contents("http://www.thefutureoftheweb.com/blog/");

Here is what the HTML looks like for the blog posts:

<ul id="main">
    <li>
        <h1><a href="[link]">[title]</a></h1>
        <span class="date">[date]</span>
        <div class="section">
            [content]
        </div>
    </li>
</ul>

So we will use a regular expression that looks for all the li elements and capture the content using parentheses at the appropriate places (link, title, date & content).

preg_match_all(
    '/<li>.*?<h1><a href="(.*?)">(.*?)<\/a><\/h1>.*?<span class="date">(.*?)<\/span>.*?<div class="section">(.*?)<\/div>.*?<\/li>/s',
    $html,
    $posts, // will contain the blog posts
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
    $link = $post[1];
    $title = $post[2];
    $date = $post[3];
    $content = $post[4];

    // do something with data
}

There's a lot going on inside that regular expression, but there are really only a few "tricks" that are used. Anytime I want to say "skip over whatever is between" I use .*?. And any time I want to say "match whatever is in here" I use (.*?). And lastly, the s at the end tells PHP to allow the dot . to match newlines. That's about all there is to it.

The regular expression will only match blog posts, because they are the only <li> elements that contain an <h1>, <span class="date"> and <div class="section">.

Web scraping is highly unreliable - if the HTML structure were to change this code would break instantly. However, it's often quite easy to write this code, and usually produces a perfectly usable hack solution.

Published on February 17th, 2008. © Jesse Skinner
<< older posts newer posts >> All posts