O[Utf-8]12

Music has a universal appeal uninhibited by language. A metal head in Istanbul, Tokyo, or Oslo instinctively knows the deep power chords of Black Sabbath – it takes maybe two beats to recognize a classic like “N.I.B.” or “Paranoid.” The same guitars that screamed the tapping mastery of Van Halen or led to the spandex hair excess of 80s metal also served The Beatles and Pink Floyd. And before them was Chuck Berry, laying the ground work with the power chords of “Roll Over Beethoven”.

All that from six strings and five notes: E - A - D - G - B - E. Awesome.

Then there’s the written word on the web. Thousands of symbols in 8 bits, 16 bits, and 32 bits. In ASCII, or US-ASCII as RFC 2616 puts it, or rather ISO-8859-1. Or UTF-8, which is easier to adopt because it’s like an extended ASCII. On the other hand if you’re dealing with GB2312 then UTF-8 isn’t necessarily for you. Of course, in that case you should really be using GBK instead of GB2312. Or was it supposed to be GB18030?

Character encodings get messy and confusing quickly. Our metal head friends like their own genre of müzik / 音楽 / musikk – one word, three languages, many symbols. In this page, those symbols (or glyphs) share one encoding: UTF-8.

You don’t need to speak a language in order to work with its characters, words, and sentences. You just need Unicode. As Tim Berners-Lee put it,

The W3C was founded to develop common protocols to lead the evolution of the World Wide Web. The path W3C follows to making text on the Web truly global is Unicode. Unicode is fundamental to the work of the W3C; it is a component of W3C Specifications, from the early days of HTML, to the growing XML Family of specifications and beyond.

Unicode has its learning curve. With Normalization Forms. Characters. Code Units. Glyphs. Collation. And so on. The gist of Unicode is that it’s a universal coding scheme to represent all that’s to come of the characters used for written language; hopefully never to be eclipsed.

The security problems of Unicode stem from the conversion from one character set to another. When home-town fans of 少年ナイフ want to praise their heroes in a site’s comment section, they’ll do so in Japanese. Yet behind the scenes, the browser, web site, or operating systems involved might be handling the characters in UTF-8, Shift-JIS, or EUC.

The conversion of character sets introduces the chance for mistakes and breaking assumptions. The number of bytes might change, leading to a buffer overflow or underflow. The string may no longer be the C-friendly NULL-terminated array. Unsupported characters cause errors, possibly causing an XSS filter to skip over a script tag. A lot of these concerns have been documented (and here). Some have demonstrable exploits, as opposed to conceptual problems that run rampant through security conferences, but never see a decent hack.

Unicode got more scrutiny when it was proposed for Internationalized Domain Names (IDN). Researchers warned of “homoglyph” attacks, situations where phishers or malware authors would craft URLs that used alternate characters to spoof popular sites. (Here’s an example of JavaScript’s early problems.)

The first attacks didn’t need IDNs. They used trivial letter substitution with look-alikes, such as swapping l (the letter L) and 1 (the number one) in dead1iestattacks.com. IDNs provided more sophistication by allowing domains with changes visually harder to detect like deạdliestattacks.com.

What’s been less well documented (from what I could find) is the range of support for character set encodings in security tools. The primary language of web security seems to be English based on the popular conferences and books. But useful tools come from all over. Wivet originated from Türkiye (here’s some more UTF-8: Web Güvenlik Topluluğu). Sqlmap and w3af support Unicode. So maybe this is a non-issue for modern tools.

In any case, it never hurts to have more “how to hack” tools in non-English languages or test suites to verify that the latest XSS finder, SQL injector, or web tool can deal with sites that aren’t friendly enough to serve content as UTF-8. Or you could help out with documentation projects like the OWASP Developer Guide.

Sometimes translation is really easy. The phrase for “heavy metal” in French is “heavy metal” – although you’d be correct to use “Métal Hurlant” if you were talking about the magazine or movie. Character conversion can be easy, too. As long as you stick with a single representation. Once you start to dabble in the Unicode conversions from UTF-8, UTF-16, UTF-32, and beyond you’ll be well-served by keeping up to date on encoding concerns and having tools that spare you the brain damage of implementing everything from scratch.