I know what you’re thinking.

“Did my regex block six XSS attacks or five?”

You’ve got to ask yourself one question: “Do I feel lucky?”

Well, do ya, punk?

Maybe you read a few HTML injection (aka cross-site scripting) tutorials and think a regex can solve this problem. Let’s revisit that thinking.

Choose an Attack Vector

Many web apps have a search feature. It’s an ideal attack vector because a search box is expected to accept an arbitrary string and then display the search term along with any relevant results. That rendering of the search term – arbitrary content from the user – is a classic HTML injection scenario.

For example, the following screenshot shows how Google reflects the search term “html injection attack” at the bottom of its results page and the HTML source that shows how it creates the text node to display the term.

Google search Google search results Google search html source

Here’s another example that shows how Twitter reflects the search term “deadliestwebattacks” in its results page and the text node it creates.

Twitter search Twitter search html source

Let’s take a look at another site with a search box. Don’t worry about the text. (It’s a Turkish site. The words are basically “search” and “results”). First, we search for “foo” to check if the site echoes the term into the response’s HTML. Success! It appears in two places: a title attribute and a text node.

<a title="foo için Arama Sonuçları">Arama Sonuçları : "foo"</a>

Next, we probe the page for tell-tale validation and output encoding weaknesses that indicate the potential for a vuln. We’ll try a fake HTML tag, <foo/>.

<a title="<foo/> için Arama Sonuçları">Arama Sonuçları : "<foo/>"</a>

The site inserts the tag directly into the response. The <foo/> tag is meaningless for HTML, but the browser recognizes that it has the correct mark-up for a self-enclosed tag. Looking at the rendered version displayed by the browser confirms this:

Arama Sonuçları : ""

The <foo/> term isn’t displayed because the browser interprets it as a tag. It creates a DOM node of <foo> as opposed to placing a literal <foo/> into the text node between <a> and </a>.

Inject a Payload

The next step is to find a tag with semantic meaning to the browser. An obvious choice is to try <script> as a search term since that’s the containing element for JavaScript.

<a title="<[removed]> için Arama Sonuçları">Arama Sonuçları : "<[removed]>"</a>

The site’s developers seem to be aware of the risk of writing raw <script> elements into search results. In the title attribute, they replaced the angle brackets with HTML entities and replaced “script”  with “[removed]”.

A persistent hacker would continue to probe the search box with different kinds of payloads. Since it seems impossible to execute JavaScript within a <script> element, we’ll try JavaScript execution within the context of an element’s event handler.

Try Alternate Payloads

Here’s a payload that uses the onerror attribute of an <img> element to execute a function:

<img src="x" onerror="alert(9)">

We inject the new payload and inspect the page’s response. We’ve completely lost the attributes, but the element was preserved:

<a title="<img> için Arama Sonuçları">Arama Sonuçları : "<img>"</a>

Let’s tweak the. We condense it to a format that remains valid to the browser and HTML spec. This demonstrates an alternate syntax with the same semantic meaning.

<img/src="x"onerror=alert(9)>

HTML injection payload

Unfortunately, the site stripped the onerror function the same way it did for the <script> tag.

<a title="<img/src="x"on[removed]=alert(9)>">Arama Sonuçları :
    "<img/src="x"on[removed]=alert(9)>"</a>

Additional testing indicates the site apparently does this for any of the onfoo event handlers.

Refine the Payload

We’re not defeated yet. The fact that the site is looking for malicious content implies that it’s relying on a deny list of regular expressions to block common attacks.

Ah, how I love regexes. I love writing them, optimizing them, and breaking them. Regexes excel at pattern matching and fail miserably at parsing. That’s bad since parsing is fundamental to working with HTML.

Now, let’s unleash a mighty regex bypass based on a trivial technique – the greater than (>) symbol:

<img/src="x>"onerror=alert(9)>

HTML injection payload with anti-regex

Look how the app handles this. We’ve successfully injected an <img> tag. The browser parses the element, but it fails to load the image called x> so it triggers the error handler, which pops up the alert.

<a title="<img/src=">"onerror=alert(9)> için Arama Sonuçları">Arama Sonuçları :
    "<img/src="x>"onerror=alert(9)>"</a>

alert(9)

Why does this happen? I don’t have first-hand knowledge of this specific regex, but I can guess at its intention.

HTML tags start with the < character, followed by an alpha character, followed by zero or more attributes (with tokenization properties that create things name/value pairs), and close with the > character. It’s likely the regex was only searching for on... handlers within the context of an element, i.e. between the start and end tokens of < and >. A > character inside an attribute value doesn’t close the element.

<tag attribute="x>" onevent=code>

The browser’s parsing model understood the quoted string was a value token. It correctly handled the state transitions between element start, element name, attribute name, attribute value, and so on. The parser consumed each character and interpreted it based on the context of its current state.

The site’s poorly-chosen regex didn’t create a sophisticated enough state machine to handle the x> properly. (Regexes have their own internal state machines for pattern matching. I’m referring to the pattern’s implied state machine for HTML.) It looked for a start token, then switched to consuming characters until it found an event handler or encountered an end token – ignoring the possible interim states associated with tokenization based on spaces, attributes, or invalid markup.

This was only a small step into the realm of HTML injection. For example, the site reflected the payload on the immediate response to the attack’s request. In other scenarios the site might hold on to the payload and insert it into a different page. That would make it a persistent type of vuln because the attacker does not have to re-inject the payload each time the affected page is viewed.

For example, lots of sites have phrases like, “Welcome back, Mike!”, where they print your first name at the top of each page. If you told the site your name was <script>alert(9)</script>, then you’d have a persistent HTML injection exploit.

Rethink Defense

For developers:

  • When user-supplied data is placed in a web page, encode it for the appropriate context. For example, use percent-encoding (e.g. < becomes %3c) for an href attribute; use HTML entities (e.g. < becomes &lt;) for text nodes.
  • Prefer inclusion lists (match what you want to allow) to exclusion lists (predict what you think should be blocked).
  • Work with a consistent character encoding. Unpredictable transcoding between character sets makes it harder to ensure validation filters treat strings correctly.
  • Prefer parsing to pattern matching. However, pre-HTML5 parsing has its own pitfalls, such as browsers’ inconsistent handling of whitespace within tag names. HTML5 codified explicit rules for acceptable markup.
  • If you use regexes, test them thoroughly. Sometimes a “dumb” regex is better than a “smart” one. In this case, a dumb regex would have just looked for any occurrence of “onerror” and rejected it.
  • Prefer to reject invalid input rather than massage it into something valid. This avoids a cuckoo-like attack where a single-pass filter would remove any occurrence of a script tag from a payload like <scr<script>ipt>, unintentionally creating a <script> tag.
  • Reject invalid character code points (and unexpected encoding) rather than substitute or strip characters. This prevents attacks like null-byte insertion, e.g. stripping null from <%00script> after performing the validation check, overlong UTF-8 encoding, e.g. %c0%bcscript%c0%bd, or Unicode encoding (when expecting UTF-8), e.g. %u003cscript%u003e.
  • Escape metacharacters correctly.

For more examples of payloads that target different HTML contexts or employ different anti-regex techniques, check out the HTML Injection Quick Reference (HIQR). In particular, experiment with different payloads from the “Anti-regex patterns” at the bottom of Table 2.

Page 71