When designing security filters against HTML injection you need to outsmart the attacker, not the browser. HTML’s syntax is more forgiving of mis-nested tags, unterminated elements, and entity-encoding compared to formats like XML. This is a good thing, because it ensures a User-Agent renders a best-effort layout for a web page rather than bailing on errors or typos that would leave visitors staring at blank pages or incomprehensible error messages.
It’s also a bad thing, because User-Agents have to make educated guesses about a page author’s intent when it encounters unexpected markup. This is the kind of situation that leads to browser quirks and inconsistent behavior.
One of HTML5’s improvements is a codified algorithm for parsing content. In the past, browsers not only had quirks, but developers would write content specifically to take advantage of those idiosyncrasies – giving us a world where sites worked well with one and only one version of Internet Explorer (or Mozilla, etc.). A great deal of blame lays at the feet of site developers who refused to consider good HTML design patterns in favor of the principle of Code Relying on Advanced Persistent Stubbornness.
Untidy markup is a security hazard. It makes HTML injection vulnerabilities more difficult to detect and block, especially for regex-based countermeasures.
Regular expressions have irregular success as security mechanisms for HTML. While regexes excel at pattern-matching they fare miserably in semantic parsing. Once you start building a state mechanism for element start characters, token delimiters, attribute names, and so on anything other than a narrowly-focused regex becomes unwieldy at best.
First, let’s take a look at some simple elements with uncommon syntax. Regular readers will recognize a favorite XSS payload of mine, the
Spaces aren’t required to delimit attribute name/value pairs when the value is marked by quotes. Also, the element name may be separated from its attributes with whitespace or the forward slash. We’re entering strange parsing territory. For some sites, this will be a trip to the undiscovered country.
Delimiters are fun to play with. Here’s a case where empty quotes separate the element name from an attribute. Note the difference in value delineation. The
id attribute has an unquoted value, so we separate it from the subsequent attribute with a space. The href has an empty value delimited with quotes. The parser doesn’t need whitespace after a quoted value, so we put
onclick immediately after.
User-Agents try their best to make sites work. As a consequence, they’ll interpret markup in surprising ways. Here’s an example that mixes start and end tag tokens in order to deliver an XSS payload:
We can adjust the end tag if there’s a filter watching for
</script>. Note there is a space between the last
Successful HTML injection thrives on bad mark-up to bypass filters and take advantage of browser quirks. Here’s another case where the browser accepts an incorrectly terminated tag. If the site turns the following payload’s %0d%0a into \r\n (carriage return, line feed) when it places the payload into HTML, then the browser might execute the
Or you might be able to separate the lack of closing
> character from the
alert function with an intermediate HTML comment:
The way browsers deal with whitespace is a notorious source of security problems. The Samy worm exploited IE’s tolerance for splitting a
<div id=mycode style="BACKGROUND: url('java script:eval(document.all.mycode.expr)')" expr="alert(9)"></div>
Or we can throw an entity into the attribute list. The following is bad markup. But if it’s bad markup that bypasses a filter, then it’s a good injection.
User Agents are often forced to second-guess the intended layout of error-ridden pages. HTML5 brings more sanity to parsing markup. But we still don’t have a mechanism to help browsers distinguish between typos, intended behavior, and HTML injection attacks. There’s no equivalent to prepared statements for SQL.
- Test with multiple browsers. A payload that takes advantage of a rendering quirk for browser A isn’t going to exhibit security problems if you’re testing with browser B.
- Prefer parsing to regex patterns. Regexes may be as effective as they are complex, but you pay a price for complexity. Trying to read someone else’s regex, or even maintaining your own, becomes more error-prone as the pattern becomes longer.
- Encode characters. You’ll be more successful at blocking HTML injection attacks if you consistently apply encoding rules for characters like
>and prevent quotes from breaking attribute values.
- Enforce rules strictly. Ambiguity for browsers enables them to recover from errors gracefully. Ambiguity for security weakens the system.
HTML injection attacks try to bypass filters in order to deliver a payload that a browser will render. Security filters should be strict, by not so myopic that they miss “improper” HTML constructs that a browser will happily render.