Bringin’ on the Heartbreak

As web applications stretch beyond borders they need to adopt strategies to work in multiple languages. Without the right tools or adequate knowledge of Unicode, a programmer will quickly descend into hysteria. The explanations in this post won’t leave you in euphoria, but, like the previous one, it should adrenalize your efforts to understand character sets.

Previously, we touched on the relative simplicity of music vs. language (in terms of a focus on character sets). Once you know the pattern for the Am pentatonic on a guitar, you can move the fingering around the neck to transpose it to any other key. From there, it’s just a matter of finding a drummer who can count to four and you’re on your way to a band.

Unicode has its own patterns. We’ll eventually get around to discussing those. But first, let’s examine how browsers deal with the narrow set of HTML characters and the wide possibilities of text characters.

It takes an English band to make some great American rock & roll. However, there’s not much to show off between character sets for lyrics from England and American.* Instead, we’ll turn to continental Europe. I was going to chose Lordi as an example, but not nearly as many Finns visit this site as Ukranians. Plus, neither the words Lordi nor The Arockalypse require “interesting” characters with which to demonstrate encodings (sorry, ISO-8859-1, you’re just boring).

Okay. Consider the following HTML (and, hey, look at that doctype, this is official HTML5). We care about the two links. One has so-called “non-English” characters in the query string, the other has them in the path:

<!DOCTYPE html>
<html><head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" >
</head><body>
<a href="http://web.site/music?band=ВопліВідоплясова">query string</a>
<a href="http://web.site/bands/ВопліВідоплясова/?songs=all">path</a>
</body></html>

The charset is explicitly set to UTF-8. Check out what the web server’s logs record for a click on each link:

GET /music?band=%D0%92%D0%BE%D0%BF%D0%BB%D1%96%D0%92%D1%96%D0%B4%D0%BE%D0%BF%D0%BB%D1%8F%D1%81%D0%BE%D0%B2%D0%B0
GET /bands/%D0%92%D0%BE%D0%BF%D0%BB%D1%96%D0%92%D1%96%D0%B4%D0%BE%D0%BF%D0%BB%D1%8F%D1%81%D0%BE%D0%B2%D0%B0/?songs=all

Next, we’ll convert the page using the handy iconv tool:

iconv -f UTF-8 -t KOI8-U utf8.html > koi8u.html

Another tool you should be familiar with is xxd. (No time to cover it here; it’s easy to figure out.) We use it to examine the byte sequences of the converted query string values:

The page is converted to KOI8-U, but the <meta> tag still says it’s in UTF-8. This leads to bad bytes if a browser requests either link:

GET /music?band=%EF%BF%BD%EF%BF%BD%EF%BF%BD%CC%A6%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD
GET /bands/%EF%BF%BD%EF%BF%BD%EF%BF%BD%CC%A6%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD/?songs=all

If we fix the <meta> tag to set the encoding as KOI8-U, then things improve. However, notice the difference between encodings in the query string vs. those in the path:

GET /music?band=%F7%CF%D0%CC%A6%F7%A6%C4%CF%D0%CC%D1%D3%CF%D7%C1
GET /bands/%D0%92%D0%BE%D0%BF%D0%BB%D1%96%D0%92%D1%96%D0%B4%D0%BE%D0%BF%D0%BB%D1%8F%D1%81%D0%BE%D0%B2%D0%B0/?songs=all

The path becomes UTF-8, but the query string remains in its native character set. This isn’t a quirk the encoding scheme. It’s a behavior of browsers. To emphasize the point, here’s another example web page:

<!DOCTYPE html>
<html><head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" >
</head><body>
<a href="http://web.site/actors?name=成龍">query string</a>
<a href="http://web.site/actors/成龍/?movies=all">path</a>
</body></html>

When all is UTF-8, the web logs record the bytes we expect:

GET /actors?name=%E6%88%90%E9%BE%8D
GET /actors/%E6%88%90%E9%BE%8D/?movies=all

Now, convert the encoding to GBK:

iconv -f UTF-8 -t GBK utf8.html > gbk.html

And the unchanged <meta> tag produces bad bytes in the logs:

GET /actors?name=%EF%BF%BD%EF%BF%BD%EF%BF%BD
GET /actors/%EF%BF%BD%EF%BF%BD%EF%BF%BD/?movies=all

So, we fix the charset to GBK and all is well:

GET /actors?name=%B3%C9%FD%88
GET /actors/%E6%88%90%E9%BE%8D/?movies=all

So, if you were planning to use curl (an excellent tool and about the friendliest mailing list ever) to spider a web site and regexes (pcre, another excellent piece of software) to scrape its content for links, then you’ll have to be careful about character sets once you depart the land of UTF-8. (And you’ll have completely different worries should you ever venture into the just-about-uncharted territory of U+F8D0 – U+F8FF Unicode charts.)

Rather abrupt ending here. Need to wrap up because I’m packing my bags for the Misty Mountains. Bye.

=====

* I think the fabled English reserve also creates better innuendo. Led Zeppelin has “The Lemon Song”, although Def Leppard weren’t exactly subtle with their ultimate “Pour Some Sugar on Me”. Poison simply said, “I Want Action”. Down under is a different story, AC/DC were pretty straight-forward with “You Shook Me All Night Long”. Zep aside (for obvious reasons), the 80s were apparently big on hair, not ideas.

1 thought on “Bringin’ on the Heartbreak”

Comments are closed.