• When designing security filters against HTML injection you need to outsmart the attacker, not the browser. HTML’s syntax is more forgiving of mis-nested tags, unterminated elements, and entity-encoding compared to formats like XML. This is a good thing, because it ensures a User-Agent renders a best-effort layout for a web page rather than bailing on errors or typos that would leave visitors staring at blank pages or incomprehensible error messages.

    It’s also a bad thing, because User-Agents have to make educated guesses about a page author’s intent when it encounters unexpected markup. This is the kind of situation that leads to browser quirks and inconsistent behavior.

    One of HTML5’s improvements is a codified algorithm for parsing content. In the past, browsers not only had quirks, but developers would write content specifically to take advantage of those idiosyncrasies – giving us a world where sites worked well with one and only one version of Internet Explorer (or Mozilla, etc.). A great deal of blame lays at the feet of site developers who refused to consider good HTML design patterns in favor of the principle of Code Relying on Advanced Persistent Stubbornness.

    Parsing Disharmony

    Untidy markup is a security hazard. It makes HTML injection vulnerabilities more difficult to detect and block, especially for regex-based countermeasures.

    Regular expressions have irregular success as security mechanisms for HTML. While regexes excel at pattern-matching they fare miserably in semantic parsing. Once you start building a state mechanism for element start characters, token delimiters, attribute names, and so on anything other than a narrowly-focused regex becomes unwieldy at best.

    First, let’s take a look at some simple elements with uncommon syntax. Regular readers will recognize a favorite XSS payload of mine, the img tag:

    <img/alt=""src="."onerror=alert(9)>
    

    Spaces aren’t required to delimit attribute name/value pairs when the value is marked by quotes. Also, the element name may be separated from its attributes with whitespace or the forward slash. We’re entering strange parsing territory. For some sites, this will be a trip to the undiscovered country.

    Delimiters are fun to play with. Here’s a case where empty quotes separate the element name from an attribute. Note the difference in value delineation. The id attribute has an unquoted value, so we separate it from the subsequent attribute with a space. The href has an empty value delimited with quotes. The parser doesn’t need whitespace after a quoted value, so we put onclick immediately after.

    <a""id=a href=""onclick=alert(9)>foo</a>
    

    User-Agents try their best to make sites work. As a consequence, they’ll interpret markup in surprising ways. Here’s an example that mixes start and end tag tokens in order to deliver an XSS payload:

    <script/<a>alert(9)</script>
    

    We can adjust the end tag if there’s a filter watching for </script>. Note there is a space between the last </script and </a>.

    <script/<a>alert(9)</script </a>
    

    Successful HTML injection thrives on bad mark-up to bypass filters and take advantage of browser quirks. Here’s another case where the browser accepts an incorrectly terminated tag. If the site turns the following payload’s %0d%0a into \r\n (carriage return, line feed) when it places the payload into HTML, then the browser might execute the alert function.

    <script%0d%0aalert(9)</script>
    

    Or you might be able to separate the lack of closing > character from the alert function with an intermediate HTML comment:

    <script%20<!--%20-->alert(9)</script>
    

    The way browsers deal with whitespace is a notorious source of security problems. The Samy worm exploited IE’s tolerance for splitting a javascript: scheme with a line feed.

    <div id=mycode style="BACKGROUND: url('java script:eval(document.all.mycode.expr)')"
    expr="alert(9)"></div>
    

    Or we can throw an entity into the attribute list. The following is bad markup. But if it’s bad markup that bypasses a filter, then it’s a good injection.

    <a href=""&amp;/onclick=alert(9)>foo</a>
    

    HTML entities have a special place within parsing and injection attacks. They’re most often used to bypass string-matching. For example, the following three JavaScript schemes use an entity for the “s” character:

    java&#115;cript:alert(9) java&#x73;cript:alert(9) java&#x0073;cript:alert(9)
    

    The danger with entities and parsing is that you must keep track of the context in which you decode them. But you also need to keep track of the order in which you resolve entities (or otherwise normalize data) and when you apply security checks. In the previous example, if you had checked for “javascript” in the scheme before resolving the entity, then your filter would have failed. Think of it as a time of check to time of use (TOCTOU) problem that’s affected by data transformation rather than the more commonly thought-of race condition.

    Security

    User Agents are often forced to second-guess the intended layout of error-ridden pages. HTML5 brings more sanity to parsing markup. But we still don’t have a mechanism to help browsers distinguish between typos, intended behavior, and HTML injection attacks. There’s no equivalent to prepared statements for SQL.

    • Fix the vulnerability, not the exploit. It’s not uncommon for developers to denylist a string like alert or javascript under the assumption that doing so prevents attacks. That sort of thinking mistakes the payload for the underlying problem. The problem is placing user-supplied data into HTML without taking steps to ensure the browser renders the data as text rather than markup.
    • Test with multiple browsers. A payload that takes advantage of a rendering quirk for browser A isn’t going to exhibit security problems if you’re testing with browser B.
    • Prefer parsing to regex patterns. Regexes may be as effective as they are complex, but you pay a price for complexity. Trying to read someone else’s regex, or even maintaining your own, becomes more error-prone as the pattern becomes longer.
    • Encode characters. You’ll be more successful at blocking HTML injection attacks if you consistently apply encoding rules for characters like < and > and prevent quotes from breaking attribute values.
    • Enforce rules strictly. Ambiguity for browsers enables them to recover from errors gracefully. Ambiguity for security weakens the system.

    HTML injection attacks try to bypass filters in order to deliver a payload that a browser will render. Security filters should be strict, by not so myopic that they miss “improper” HTML constructs that a browser will happily render.

    • • •
  • I know what you’re thinking.

    “Did my regex block six XSS attacks or five?”

    You’ve got to ask yourself one question: “Do I feel lucky?”

    Well, do ya, punk?

    Maybe you read a few HTML injection (cross-site scripting) tutorials and think a regex solves this problem. Maybe. Let’s revisit that thinking. We’ll need an attack vector. It could be a URL parameter, form field, header, or any other part of an HTTP request.

    Choose an Attack Vector

    Many web apps implement a search functionality. That’s an ideal attack vector because the nature of a search box is to accept an arbitrary string, then display the search term along with any relevant results. It’s the display, or reflection, of the search term that often leads to HTML injection.

    For example, the following screenshot shows how Google reflects the search term “html injection attack” at the bottom of its results page. And the text node created in the HTML source.

    Google search Google search results Google search html source

    Here’s another example that shows how Twitter reflects the search term “deadliestwebattacks” in its results page. And the text node created in the HTML source.

    Twitter search Twitter search html source

    Let’s take a look at another site with a search box. Don’t worry about the text (it’s a Turkish site, the words are basically “search” and “results”). First, we try a search term, “foo”, to check if the site echoes the term into the response’s HTML. Success! It appears in two places: a title attribute and a text node.

    <a title="foo için Arama Sonuçları">Arama Sonuçları : "foo"</a>
    

    Next, we probe the page for tell-tale validation and output encoding weaknesses that indicate the potential for this vulnerability to be present. In this case, we’ll try a fake HTML tag, <foo/>.

    <a title="<foo/> için Arama Sonuçları">Arama Sonuçları : "<foo/>"</a>
    

    The site inserts the tag directly into the response. The <foo/> tag is meaningless for HTML, but the browser recognizes that it has the correct mark-up for a self-enclosed tag. Looking at the rendered version displayed by the browser confirms this:

    Arama Sonuçları : ""
    

    The <foo/> term isn’t displayed because the browser interprets it as a tag. It creates a DOM node of <foo> as opposed to placing a literal <foo/> into the text node between <a> and </a>.

    Inject a Payload

    The next step is to find a tag with semantic meaning to the browser. An obvious choice is to try <script> as a search term since that’s the containing element for JavaScript.

    <a title="<[removed]> için Arama Sonuçları">Arama Sonuçları : "<[removed]>"</a>
    

    The site’s developers seem to be aware of the risk of writing raw <script> elements into search results. In the title attribute, they replaced the angle brackets with HTML entities and replaced “script”  with “[removed]”.

    A persistent hacker would continue to probe the search box with different kinds of payloads. Since it seems impossible to execute JavaScript within a <script> element, we’ll try JavaScript execution within the context of an element’s event handler.

    Try Alternate Payloads

    Here’s a payload that uses the onerror attribute of an <img> element to execute a function:

    <img src="x" onerror="alert(9)">
    

    We inject the new payload and inspect the page’s response. We’ve completely lost the attributes, but the element was preserved:

    <a title="<img> için Arama Sonuçları">Arama Sonuçları : "<img>"</a>
    

    Let’s tweak the. We condense it to a format that remains valid to the browser and HTML spec. This demonstrates an alternate syntax with the same semantic meaning.

    <img/src="x"onerror=alert(9)>
    

    HTML injection payload

    Unfortunately, the site stripped the onerror function the same way it did for the <script> tag.

    <a title="<img/src="x"on[removed]=alert(9)>">Arama Sonuçları :
        "<img/src="x"on[removed]=alert(9)>"</a>
    

    Additional testing indicates the site apparently does this for any of the onfoo event handlers.

    Refine the Payload

    We’re not defeated yet. The fact that the site is looking for malicious content implies that it’s relying on a deny list of regular expressions to block common attacks.

    Ah, how I love regexes. I love writing them, optimizing them, and breaking them. Regexes excel at pattern matching, but fail miserably at parsing. And parsing is fundamental to working with HTML.

    So, let’s unleash a mighty regex bypass based on a trivial technique – add a greater than (>) symbol:

    <img/src="x>"onerror=alert(9)>
    

    HTML injection payload with anti-regex

    Look what happens to the site. We’ve successfully injected an <img> tag. The browser parses the element, but it fails to load the image called x> so it triggers the error handler, which pops up the alert.

    <a title="<img/src=">"onerror=alert(9)> için Arama Sonuçları">Arama Sonuçları :
        "<img/src="x>"onerror=alert(9)>"</a>
    

    alert(9)

    Why does this happen? I don’t have first-hand knowledge of the specific regex, but I can guess at its intention.

    HTML tags start with the < character, followed by an alpha character, followed by zero or more attributes (with tokenization properties that create things name/value pairs), and close with the > character. It’s likely the regex was only searching for on... handlers within the context of an element, i.e. between the start and end tokens of < and >. A > character inside an attribute value doesn’t close the element.

    <tag attribute="x>" onevent=code>
    

    The browser’s parsing model understood the quoted string was a value token. It correctly handled the state transitions between element start, element name, attribute name, attribute value, and so on. The parser consumed each character and interpreted it based on the context of its current state.

    The site’s poorly-chosen regex didn’t create a sophisticated enough state machine to handle the x> properly. (Regexes have their own internal state machines for pattern matching. I’m referring to the pattern’s implied state machine for HTML.) It looked for a start token, then switched to consuming characters until it found an event handler or encountered an end token – ignoring the possible interim states associated with tokenization based on spaces, attributes, or invalid markup.

    This was only a small step into the realm of HTML injection. For example, the web site reflected the payload on the immediate response to the attack’s request. In other scenarios the site might hold on to the payload and insert it into a different page. It’s still reflected by the site, but not on the immediate response. That would make it a persistent type of vuln because the attacker does not have to re-inject the payload each time the affected page is viewed. For example, lots of sites have phrases like, “Welcome back, Mike!”, where they print your first name at the top of each page. If you told the site your name was <script>alert(9)</script>, then you’d have a persistent HTML injection exploit.

    Rethink Defense

    For developers:

    • When user-supplied data is placed in a web page, encode it for the appropriate context. For example, use percent-encoding (e.g. < becomes %3c) for an href attribute; use HTML entities (e.g. < becomes &lt;) for text nodes.
    • Prefer inclusion lists (match what you want to allow) to exclusion lists (predict what you think should be blocked).
    • Work with a consistent character encoding. Unpredictable transcoding between character sets makes it harder to ensure validation filters treat strings correctly.
    • Prefer parsing to pattern matching. However, pre-HTML5 parsing has its own pitfalls, such as browsers’ inconsistent handling of whitespace within tag names. HTML5 codified explicit rules for acceptable markup.
    • If you use regexes, test them thoroughly. Sometimes a “dumb” regex is better than a “smart” one. In this case, a dumb regex would have just looked for any occurrence of “onerror” and rejected it.
    • Prefer to reject invalid input rather than massage it into something valid. This avoids a cuckoo-like attack where a single-pass filter would remove any occurrence of a script tag from a payload like <scr<script>ipt>, unintentionally creating a <script> tag.
    • Reject invalid character code points (and unexpected encoding) rather than substitute or strip characters. This prevents attacks like null-byte insertion, e.g. stripping null from <%00script> after performing the validation check, overlong UTF-8 encoding, e.g. %c0%bcscript%c0%bd, or Unicode encoding (when expecting UTF-8), e.g. %u003cscript%u003e.
    • Escape metacharacters correctly.

    For more examples of payloads that target different HTML contexts or employ different anti-regex techniques, check out the HTML Injection Quick Reference (HIQR). In particular, experiment with different payloads from the “Anti-regex patterns” at the bottom of Table 2.

    Page 71

    • • •
  • Escape from Normality

    Any John Carpenter fan knows the only way you’ll escape from New York is if Snake Plissken is there to get you out. When it comes to web security, don’t bother waiting for Kurt Russell’s help. You’re on your own.

    Maybe you read a book on web security. Maybe you even remembered some of it. Maybe all you know is how to use escape characters in JavaScript Strings. In any case, maybe you should make sure the maximum security application you’ve created is as strong as you think it is.

    The setup is simple: An app has a search box; it accepts queries via parameter q of a form, and rewrites the input box’s value with a one-line JavaScript call. Using JavaScript seems a little more complicated than updating the <input> field’s value directly, which would be as trivial as the following (with “abc” as the search term):

    <input id="searchResult" type="text" name="q" value="abc">
    

    It’s not necessarily a bad idea to update the element’s value with JavaScript. Building HTML on the server with string concatenation is a notorious vector for XSS. Writing the value with JavaScript might be more secure than rebuilding the HTML every time because the assignment avoids several encoding problems. This works if you’re keeping the HTML static and trading JSON messages with the server.

    On the other hand, if you move the server-side string concatenation from the <input> field to a <script> tag, then you’ve shifted the problem without stepping towards a solution. The <input> field’s value was delimited with quotation marks (“). The JavaScript code uses apostrophes (‘) to delimit the string.

    <script type='text/javascript'>
    document.getElementById('searchResult').value = 'abc';
    </script>
    

    Rather than strip apostrophes from the search variable’s value, the developers have decided to escape them with backslashes. Here’s how it’s expected to work when a user searches for abc’.

    document.getElementById('searchResult').value = 'abc\\'';
    

    Escaping the payload’s apostrophe preserves the original string delimiters, prevents the JavaScript syntax from being manipulated, and blocks HTML injection attacks. So it seems. What if the escape is escaped? Say, by throwing in a backslash of your own to search for something like abc\‘.

    document.getElementById('searchResult').value = 'abc\\\\'';
    

    The developers caught the apostrophe, but missed the backslash. When JavaScript tokenizes the string it sees the escape working on the second backslash instead of the apostrophe. This corrupts the syntax, as follows:

    //              ⬇ end of string token
    value = 'abc\\\\'';
    //               ⬆ dangling apostrophe
    

    From here we just starting throwing HTML injection payloads against the app. JavaScript interprets \\ as a single backslash, accepts the apostrophe as the string terminator, and parses the rest of our payload.

    https://web.site/search?q=abc**\\';alert(9)//**

    document.getElementById('searchResult').value = 'abc\\\\';alert(9)//';
    

    JavaScript’s semantics are lovely from the hacker’s perspective. Here’s an example payload using the String concatenation operator (+) to glue the alert function to the value:

    https://web.site/search?q=**abc\\'%2balert(9)//**

    document.getElementById('searchResult').value = 'abc\\\\'+alert(9)//';
    

    Or we could try a payload that uses the modulo operator (%) between the String and our alert.

    abc\\'%alert(9)//
    

    Maybe the developers added the alert function to a denylist, e.g. a regex for alert\(, by checking for an opening parenthesis. Look up the function in the window object’s property list; this makes it look like a string:

    abc\\'%window["alert"](9)//
    

    What happens if the denylist contained the word alert altogether? Build the string character by character:

    abc\\'window[String.fromCharCode(0x61,0x6c,0x65,0x72,0x74)](9)//
    

    By now we’ve turned an evasion of an escaped apostrophe into an exercise in obfuscation and filter bypasses. Check out the HIQR for more anti-regex patterns and JavaScript obfuscation techniques.

    Let’s do a quick recap of some security concepts. In this case, the clear mistake was forgetting all the permutations of escape sequences in a JavaScript string. Here’s an additional checklist:

    • Normalize the data, whether this entails character set conversion, character encoding, substitution, or removal.
    • Apply security checks, preferring inclusion lists over exclusion lists (it’s a lot easier to guess what’s safe than assume what’s dangerous).
    • In the design phase, be suspicious of string concatenation. Figure out if there’s a safer method to bind user-supplied data to HTML.
    • In the design phase, make sure your security check’s assumptions match the context where the data will be written.

    Normalization is an important first step. Any time you transform data you should reapply security checks. Snake Plissken was never one for offering advice. Instead, think of The Hitchhiker’s Guide to the Galaxy and recall Trillian’s report as the Infinite Improbability Drive powers down (p. 61):

    “…we have normality, I repeat we have normality….Anything you still can’t cope with is therefore your own problem.”

    Good luck with normality, and trying to escape the right characters. Security isn’t certain, but one thing is, at least according to Queen – there’s ”no escape from reality.”

    • • •