Bringin’ on the Heartbreak

As web applications stretch beyond borders they need to adopt strategies to work in multiple languages. Without the right tools or adequate knowledge of Unicode, a programmer will quickly descend into hysteria. The explanations in this post won’t leave you in euphoria, but, like the previous one, it should adrenalize your efforts to understand character sets.

Previously, we touched on the relative simplicity of music vs. language (in terms of a focus on character sets). Once you know the pattern for the Am pentatonic on a guitar, you can move the fingering around the neck to transpose it to any other key. From there, it’s just a matter of finding a drummer who can count to four and you’re on your way to a band.

Unicode has its own patterns. We’ll eventually get around to discussing those. But first, let’s examine how browsers deal with the narrow set of HTML characters and the wide possibilities of text characters.

It takes an English band to make some great American rock & roll. However, there’s not much to show off between character sets for lyrics from England and American.* Instead, we’ll turn to continental Europe. I was going to chose Lordi as an example, but not nearly as many Finns visit this site as Ukranians. Plus, neither the words Lordi nor The Arockalypse require “interesting” characters with which to demonstrate encodings (sorry, ISO-8859-1, you’re just boring).

Okay. Consider the following HTML (and, hey, look at that doctype, this is official HTML5). We care about the two links. One has so-called “non-English” characters in the query string, the other has them in the path:

<!DOCTYPE html>
<meta http-equiv="content-type" content="text/html; charset=utf-8" >
<a href="ВопліВідоплясова">query string</a>
<a href="ВопліВідоплясова/?songs=all">path</a>

The charset is explicitly set to UTF-8. Check out what the web server’s logs record for a click on each link:

GET /music?band=%D0%92%D0%BE%D0%BF%D0%BB%D1%96%D0%92%D1%96%D0%B4%D0%BE%D0%BF%D0%BB%D1%8F%D1%81%D0%BE%D0%B2%D0%B0
GET /bands/%D0%92%D0%BE%D0%BF%D0%BB%D1%96%D0%92%D1%96%D0%B4%D0%BE%D0%BF%D0%BB%D1%8F%D1%81%D0%BE%D0%B2%D0%B0/?songs=all

Next, we’ll convert the page using the handy iconv tool:

iconv -f UTF-8 -t KOI8-U utf8.html > koi8u.html

Another tool you should be familiar with is xxd. (No time to cover it here; it’s easy to figure out.) We use it to examine the byte sequences of the converted query string values:

The page is converted to KOI8-U, but the <meta> tag still says it’s in UTF-8. This leads to bad bytes if a browser requests either link:


If we fix the <meta> tag to set the encoding as KOI8-U, then things improve. However, notice the difference between encodings in the query string vs. those in the path:

GET /music?band=%F7%CF%D0%CC%A6%F7%A6%C4%CF%D0%CC%D1%D3%CF%D7%C1
GET /bands/%D0%92%D0%BE%D0%BF%D0%BB%D1%96%D0%92%D1%96%D0%B4%D0%BE%D0%BF%D0%BB%D1%8F%D1%81%D0%BE%D0%B2%D0%B0/?songs=all

The path becomes UTF-8, but the query string remains in its native character set. This isn’t a quirk the encoding scheme. It’s a behavior of browsers. To emphasize the point, here’s another example web page:

<!DOCTYPE html>
<meta http-equiv="content-type" content="text/html; charset=utf-8" >
<a href="成龍">query string</a>
<a href="成龍/?movies=all">path</a>

When all is UTF-8, the web logs record the bytes we expect:

GET /actors?name=%E6%88%90%E9%BE%8D
GET /actors/%E6%88%90%E9%BE%8D/?movies=all

Now, convert the encoding to GBK:

iconv -f UTF-8 -t GBK utf8.html > gbk.html

And the unchanged <meta> tag produces bad bytes in the logs:

GET /actors?name=%EF%BF%BD%EF%BF%BD%EF%BF%BD
GET /actors/%EF%BF%BD%EF%BF%BD%EF%BF%BD/?movies=all

So, we fix the charset to GBK and all is well:

GET /actors?name=%B3%C9%FD%88
GET /actors/%E6%88%90%E9%BE%8D/?movies=all

So, if you were planning to use curl (an excellent tool and about the friendliest mailing list ever) to spider a web site and regexes (pcre, another excellent piece of software) to scrape its content for links, then you’ll have to be careful about character sets once you depart the land of UTF-8. (And you’ll have completely different worries should you ever venture into the just-about-uncharted territory of U+F8D0 – U+F8FF Unicode charts.)

Rather abrupt ending here. Need to wrap up because I’m packing my bags for the Misty Mountains. Bye.


* I think the fabled English reserve also creates better innuendo. Led Zeppelin has “The Lemon Song”, although Def Leppard weren’t exactly subtle with their ultimate “Pour Some Sugar on Me”. Poison simply said, “I Want Action”. Down under is a different story, AC/DC were pretty straight-forward with “You Shook Me All Night Long”. Zep aside (for obvious reasons), the ’80s were apparently big on hair, not ideas.


Music has a universal appeal uninhibited by language. A metal head in Istanbul, Tokyo, or Oslo instinctively knows the deep power chords of Black Sabbath — it takes maybe two beats to recognize a classic like “N.I.B.” or “Paranoid.” The same guitars that screamed the tapping mastery of Van Halen or led to the spandex hair excess of ’80s metal also served The Beatles, Pink Floyd, and Eric Clapton. And before them was Chuck Berry, laying the ground work with the power chords of “Roll Over Beethoven”.

And all this with six strings and five notes: E – A – D – G – B – E. Awesome.

And then there’s the writing on the web. Thousands of symbols, 8 bits, 16 bits, 32 bits. With ASCII, or US-ASCII as RFC 2616 puts it. Or rather ISO-8859-1. But UTF-8 is easier because it’s like an extended ASCII. On the other hand if you’re dealing with GB2312 then UTF-8 isn’t necessarily for you. Of course, in that case you should really be using GBK instead of GB2312. Or was it supposed to be GB18030? I can’t remember.

What a wonderful world of character encodings can be found on the web. And confusion. Our metal head friends like their own genre of müzik / 音楽 / musikk. One word, three languages, and, in this example, one encoding: UTF-8. Programmers need to know programming languages, but they don’t need to know different spoken languages in order to work them into their web sites correctly and securely. (And based on email lists and flame wars I’ve seen, rudimentary knowledge in one spoken language isn’t a prerequisite for some coders.)

You don’t need to speak the language in order to work with its characters, words, and sentences. You just need Unicode. As some random dude (not really) put it, “The W3C was founded to develop common protocols to lead the evolution of the World Wide Web. The path W3C follows to making text on the Web truly global is Unicode. Unicode is fundamental to the work of the W3C; it is a component of W3C Specifications, from the early days of HTML, to the growing XML Family of specifications and beyond.”

Unicode has its learning curve. With Normalization Forms. Characters. Code Units. Glyphs. Collation. And so on. The gist of Unicode is that it’s a universal coding scheme to represent all that’s to come of the characters used for written language; hopefully never to be eclipsed.

The security problems of Unicode stem from the conversion from one character set to another. When home-town fans of 少年ナイフ want to praise their heroes in a site’s comment section, they’ll do so in Japanese. Yet behind the scenes, the browser, web site, or operating systems involved might be handling the characters in UTF-8, Shift-JIS, or EUC.

The conversion of character sets introduces the chance for mistakes and breaking assumptions. The number of bytes might change, leading to a buffer overflow or underflow. The string may no longer be the C-friendly NULL-terminated array. Unsupported characters cause errors, possibly causing an XSS filter to skip over a script tag. A lot of these concerns have been documented (and here). Some even demonstrated as exploitable vulns in the real world (as opposed to conceptual problems that run rampant through security conferences, but never see a decent hack).

Unicode got more popular scrutiny when it was proposed for Internationalized Domain Names (IDN). Researchers warned of “homoglyph” attacks, situations where phishers or malware authors would craft URLs that used alternate characters to spoof popular sites. The first attacks didn’t need IDNs, using trivial tricks like (replacing the letter L with a one, 1). However, IDNs provided more sophistication by allowing domains with harder-to-detect changes like deạ

What hasn’t been well documented (or hasn’t where I could find it) is the range of support for character set encodings in security tools. The primary language of web security seems to be English (at least based on the popular conferences and books). But useful tools come from all over. Wivet originated from Türkiye (here’s some more UTF-8: Web Güvenlik Topluluğu), but it goes easy on scanners in terms of character set support. Sqlmap and w3af support Unicode. So, maybe this is a non-issue for modern tools.

In any case, it never hurts to have more “how to hack” tools in non-English languages or test suites to verify that the latest XSS finder, SQL injector, or web tool can deal with sites that aren’t friendly enough to serve content as UTF-8. Or you could help out with documentation projects like the OWASP Development Guide. Don’t be afraid to care. It would be disastrous if an anti-virus, malware detector, WAF, or scanner was tripped up by encoding issues.

Sometimes translation is really easy. The phrase for “heavy metal” in French is “heavy metal” — although you’d be correct to use “Métal Hurlant” if you were talking about the movie. Character conversion can be easy, too. As long as you stick with a single representation. Once you start to dabble in the Unicode conversions from UTF-8, UTF-16, UTF-32, and beyond you’ll be well-served by keeping up to date on encoding concerns and having tools that spare you the brain damage of implementing everything from scratch.

p.s. Sorry, Canada, looks like I’ve hit my word count and neglected to mention Rush. Maybe next year.

p.p.s. And eventually I’ll work in a reference to all 10 tracks of DSotM in a single post.