• Just as there can be appsec truths, there can be appsec laws.

    Science fiction author Arthur C. Clarke succinctly described the wondrous nature of technology in what has come to be known as Clarke’s Third Law (from a letter published in Science in January 1968):

    Any sufficiently advanced technology is indistinguishable from magic.

    The sentiment of that law can be found in an earlier short story by Leigh Brackett, “The Sorcerer of Rhiannon,” published in Astounding Science-Fiction Magazine in February 1942:

    Witchcraft to the ignorant… Simple science to the learned.

    With those formulations as our departure point, we can now turn towards crypto, browser technologies, and privacy.

    The Latinate Lex Cryptobellum:

    Any sufficiently advanced cryptographic escrow system is indistinguishable from ROT13.

    Or in Leigh Brackett’s formulation:

    Cryptographic escrow to the ignorant . . . Simple plaintext to the learned.

    A few Laws of Browser Plugins:

    Any sufficiently patched Flash is indistinguishable from a critical update.

    Any sufficiently patched Java is indistinguishable from Flash.

    A few Laws of Browsers:

    Any insufficiently patched browser is indistinguishable from malware.

    Any sufficiently patched browser remains distinguishable from a privacy-enhancing one.

    For what are browsers but thralls to Laws of Ads:

    Any sufficiently targeted ad is indistinguishable from chance.

    Any sufficiently distinguishable person’s browser has tracking cookies.

    Any insufficiently distinguishable person has privacy.

    Writing against deadlines:

    Any sufficiently delivered manuscript is indistinguishable from overdue.

    Which leads us to the foundational Zeroth Law of Content:

    Any sufficiently popular post is indistinguishable from truth.

    * * *
  • I know what you’re thinking.

    “Did my regex block six XSS attacks or five?”

    You’ve got to ask yourself one question: “Do I feel lucky?”

    Well, do ya, punk?

    Maybe you read a few HTML injection (cross-site scripting) tutorials and think a regex solves this problem. Maybe. Let’s revisit that thinking. We’ll need an attack vector. It could be a URL parameter, form field, header, or any other part of an HTTP request.

    Choose an Attack Vector

    Many web apps implement a search functionality. That’s an ideal attack vector because the nature of a search box is to accept an arbitrary string, then display the search term along with any relevant results. It’s the display, or reflection, of the search term that often leads to HTML injection.

    For example, the following screenshot shows how Google reflects the search term “html injection attack” at the bottom of its results page. And the text node created in the HTML source.

    Google search Google search results Google search html source

    Here’s another example that shows how Twitter reflects the search term “deadliestwebattacks” in its results page. And the text node created in the HTML source.

    Twitter search Twitter search html source

    Let’s take a look at another site with a search box. Don’t worry about the text (it’s a Turkish site, the words are basically “search” and “results”). First, we try a search term, “foo”, to check if the site echoes the term into the response’s HTML. Success! It appears in two places: a title attribute and a text node.

    <a title="foo için Arama Sonuçları">Arama Sonuçları : "foo"</a>
    

    Next, we probe the page for tell-tale validation and output encoding weaknesses that indicate the potential for this vulnerability to be present. In this case, we’ll try a fake HTML tag, <foo/>.

    <a title="<foo/> için Arama Sonuçları">Arama Sonuçları : "<foo/>"</a>
    

    The site inserts the tag directly into the response. The <foo/> tag is meaningless for HTML, but the browser recognizes that it has the correct mark-up for a self-enclosed tag. Looking at the rendered version displayed by the browser confirms this:

    Arama Sonuçları : ""
    

    The <foo/> term isn’t displayed because the browser interprets it as a tag. It creates a DOM node of <foo> as opposed to placing a literal <foo/> into the text node between <a> and </a>.

    Inject a Payload

    The next step is to find a tag with semantic meaning for a browser. An obvious choice is to try <script> as a search term since that’s the containing element for JavaScript.

    <a title="<[removed]> için Arama Sonuçları">Arama Sonuçları : "<[removed]>"</a>
    

    The site’s developers seem to be aware of the risk of writing raw <script> elements into search results. In the title attribute, they replaced the angle brackets with HTML entities and replaced “script”  with “[removed]”.

    A good hacker would continue to probe the search box with different kinds of payloads. Since it seems impossible to execute JavaScript within a <script> element, we’ll try JavaScript execution within the context of an element’s event handler.

    Try Alternate Payloads

    Here’s a payload that uses the onerror attribute of an <img> element to execute a function:

    <img src="x" onerror="alert(9)">
    

    We inject the new payload and inspect the page’s response. We’ve completely lost the attributes, but the element was preserved:

    <a title="<img> için Arama Sonuçları">Arama Sonuçları : "<img>"</a>
    

    So, let’s modify out payload a bit. We condense it to a format that remains valid (i.e. a browser interprets it and it doesn’t violate the HTML spec). This step just demonstrates an alternate syntax with the same semantic meaning.

    <img/src="x"onerror=alert(9)>
    

    HTML injection payload

    Unfortunately, the site stripped the onerror function the same way it did for the <script> tag.

    <a title="<img/src="x"on[removed]=alert(9)>">Arama Sonuçları :
        "<img/src="x"on[removed]=alert(9)>"</a>
    

    Additional testing indicates the site apparently does this for any of the onfoo event handlers.

    Refine the Payload

    We’re not defeated yet. The fact that the site is looking for malicious content implies that it’s relying on regular expressions to deny list common attacks.

    Oh, how I love regexes. I love writing them, optimizing them, and breaking them. Regexes excel at pattern matching, but fail miserably at parsing. And parsing is fundamental to working with HTML.

    So, let’s unleash a mighty anti-regex hack. I’d call for a drum roll to build the suspense, but the technique is too trivial for that. All we do is add a greater than (>) symbol:

    <img/src="x>"onerror=alert(9)>
    

    HTML injection payload with anti-regex

    Look what happens to the site. We’ve successfully injected an <img> tag. The browser parses the element, but it fails to load the image called x> so it triggers the error handler, which pops up a friendly alert.

    <a title="<img/src=">"onerror=alert(9)> için Arama Sonuçları">Arama Sonuçları :
        "<img/src="x>"onerror=alert(9)>"</a>
    

    alert(9)

    Why does this happen? I don’t have first-hand knowledge of the specific regex, but I can guess at its intention.

    HTML tags start with the < character, followed by an alpha character, followed by zero or more attributes (with tokenization properties that create things name/value pairs), and close with the > character. It’s likely the regex was only searching for “on…” handlers within the context of an element, i.e. between < and > (the start and end tokens). A > character inside an attribute value doesn’t close the element. <_tag_ _attribute_="x>"..._onevent_=_code_> The browser’s parsing model understood the quoted string was a value token. It correctly handled the state transitions between element start, element name, attribute name, attribute value, and so on. The parser consumed each character and interpreted it based on the context of its current state.

    The site’s poorly-formed regex didn’t create a sophisticated enough state machine to handle the x> properly. (Regexes have their own internal state machines for pattern matching. I’m referring to the pattern’s implied state machine for HTML.) It looked for a start token, then switched to consuming characters until it found an event handler or encountered an end token – ignoring the possible interim states associated with tokenization based on spaces, attributes, or invalid markup.

    This was only a small step into the realm of HTML injection. For example, the web site reflected the payload on the immediate response to the attack’s request. In other scenarios the site might hold on to the payload and insert it into a different page. It’s still reflected by the site, but not on the immediate response. That would make it a persistent type of vuln because the attacker does not have to re-inject the payload each time the affected page is viewed. For example, lots of sites have phrases like, “Welcome back, Mike!”, where they print your first name at the top of each page. If you told the site your name was <script>alert(9)</script>, then you’d have a persistent HTML injection exploit.

    Rethink Defense

    For developers:

    • When user-supplied data is placed in a web page, encode it for the appropriate context. For example, use percent-encoding (e.g. < becomes %3c) for an href attribute; use HTML entities (e.g. < becomes &lt;) for text nodes.
    • Prefer inclusion lists (match what you expect) to exclusion lists (predict what you think should be blocked).
    • Work with a consistent character encoding. Unpredictable transcoding between character sets makes it harder to ensure validation filters treat strings correctly.
    • Prefer parsing to pattern matching. However, pre-HTML5 parsing has its own pitfalls, such as browsers’ inconsistent handling of whitespace within tag names. HTML5 codified explicit rules for acceptable markup.
    • If you use regexes, test them thoroughly. Sometimes a “dumb” regex is better than a “smart” one. In this case, a dumb regex would have just looked for any occurrence of “onerror” and rejected it.
    • Prefer to reject invalid input rather than massage it into something valid. This avoids a cuckoo-like attack where a single-pass filter would remove any occurrence of the word “script” from a payload like <scrscriptipt>, unintentionally creating a <script> tag.
    • Prefer to reject invalid character code points (and unexpected encoding) rather than substitute or strip characters. This prevents attacks like null-byte insertion, e.g. stripping null from &lgt;%00script> after performing the validation check, overlong UTF-8 encoding, e.g. %c0%bcscript%c0%bd, or Unicode encoding (when expecting UTF-8), e.g. %u003cscript%u003e.
    • Escape metacharacters correctly.

    For more examples of payloads that target different HTML contexts or employ different anti-regex techniques, check out the HTML Injection Quick Reference (HIQR). In particular, experiment with different payloads from the “Anti-regex patterns” at the bottom of Table 2.

    Page 71

    * * *
  • iPhone zombie

    This is how it began. Over two years ago I unwittingly planted the seeds of an undead horde into the pages of my book, Seven Deadliest Web Application Attacks.

    Only recently did I discover the rotted fruit of those seeds festering within the pages of Amazon.

    • Visit the book’s Amazon page.
    • Click on the “Look Inside!” feature.
    • Use the “Search Inside This Book” function to search for zombie.
    • Cower before approaching mass of flesh-hungry brutes. Or just click OK a few times.

    On page 16 of the book there is an example of an HTML element’s syntax that forgoes the typical whitespace used to separate attributes. The element’s name is followed by a valid token separator, albeit one rarely used in hand-written HTML. The printed text contains this line:

    <img/src="."alt=""onerror="alert('zombie')"/>
    

    onerror alert zombie

    The “Search Inside” feature lists the matches for a search term. It makes the search term bold (i.e. adds <b> markup) and includes the context in which the search term was found (hence the surrounding text with the full <img/src="."alt="" /> element). Then it just pops the contextual find into the list, to be treated as any other “text” extracted from the book.

    <img src="." alt="" onerror="alert('<b>zombie</b>')"/>
    

    Finally, the matched term is placed within an anchor so you can click on it to find the relevant page. Notice that the <img> tag hasn’t been inoculated with HTML entities; it’s a classic HTML injection attack.

    <a ... href="javascript:void(0)">
    <span class="sitbReaderSearch-result-page">Page 16 ...t require spaces to
        delimit their attributes.
        **<img src="." alt="" onerror="alert('<b>zombie</b>')"/>** JavaScript
        doesn't have to...
    

    This has actually happened before. In December 2010 a researcher in Germany, Dr. Wetter, reported the same effect via <script> tags when searching for content in different security books. He even found <script> tags whose src attribute pointed to a live host, which made the flaw infinitely more entertaining.

    iPad zombie

    In fact, this was such a clever example of an unexpected vector for HTML injection that I included Dr. Wetter’s findings in the new Hacking Web Apps book (pages 40 and 41, the same <img...onerror> example shows up a little later on page 59). Behold, there’s a different infestation on page 31. Try searching for zombie again. This time the server responds with a JSON payload that contains straight-forward <script> tags. This one was more tedious to track down. The <script> tags don’t appear in the search listing, but they do exist in the excerpt property of the JavaScript object that contains, applies bold tags, etc. for matches:

    {...,"totalResults":2,"results":[[52,"Page 31","... encoded characters with
    their literal values:  <a href=\"http://\"/>**<script>alert('<b>zombie</b>')
    </script>**@some.site/\">search again</a>   Abusing the authority component
    of a ...", ...}
    

    I only discovered this injection flaw when I recently searched the older book for references to the living dead. (Yes, a weird – but true – reason.)

    How did this happen?

    One theory is that an anti-XSS filter relied on a deny list to catch potentially malicious tags. In this case, the <img> tag used a valid, but uncommon, token separator that would have confused any filter expecting whitespace delimiters. One common approach to regexes is to build a pattern based on what we think browsers know. For example, a quick filter to catch <script> tags or opening tags (e.g. <iframe src...> or <img src...>) might look like this:

    <\[\[:alpha:\]\]+(\\s+|>)
    

    A payload like <img/src> bypasses the regex and the browser correctly parses the syntax to create an image element. Of course, the src attribute fails to resolve, thereby triggering the onerror event handler, leading to yet another banal alert() declaring the presence of an HTML injection attack.

    The <script> example is less clear without knowing more about the internals of the site. Perhaps a sequences of stripping quotes and buggy regexes misunderstood the href to actually contain an authority section? Don’t have a good guess for this one.

    This highlights one problem of relying on regexes to parse a grammar like HTML. Yes, it’s possible to create strong, effective regexes. However, a regex does not represent the parsing state machine of browsers, including their quirks, exceptions, and “fix-up” behaviors. Fortunately, HTML5 brings a measure of sanity to this mess by clearly defining rules of interpretation. On the other hand, web history foretells that we’ll be burdened with legacy modes and outdated browsers for years to come. So, be wary of those regexes.

    No. That’s not it.

    How did this really happen?

    Well, I listen to various music while I write. You might argue that it was the demonic influence (or awesome Tony Iommi riffs) of Black Sabbath that ensorcelled the pages or that Judas Priest made me do it. Or that March 30, 2010 – right around the book’s release – was a full moon. Maybe in one of Amazon’s vast, diversley-stocked warehouses an oil drum with military markings spilled over, releasing a toxic gas that infected the books. Me? I think we’ll never know.

    Maybe one day we’ll be safe from this kind of attack. HTML5 and Content Security Policy make more sandboxes and controls available for implementing countermeasures. But I just can’t shake the feeling that somehow, somewhere, there’re more lurking about.

    Until then, the most secure solution is to –

    – wait, what’s that noise at the door…?

    They're coming to get you, Barbara.

    * * *