Battling Geologic Time

65 million years ago, dinosaurs ruled the earth. (Which also seems about the last time I wrote something new here.)

In 45 million lines of code, Windows XP dominated the desktop. Yes it had far too many security holes and people held onto it for far too long — even after Microsoft tried to pull support for the first time. But its duration is still a testament to a certain measure of success.

Much of today’s web still uses code that dates from the dawn of internet time, some new code is still written by dinosaurs, and even more code is written by the avian descendants of dinosaurs. These birds flock to new languages and new frameworks. Yet, looking at some of the trivial vulns that emerge (like hard-coded passwords and SQL built from string concatenation), it seems the bird brain hasn’t evolved as much security knowledge as we might wish.

I’m a fan of dead languages. I’ve mentioned before my admiration of Latin (as well as Harry Potter Latin). And hieroglyphs have an attractive mystery to them. This appreciation doesn’t carry over to Perl. (I wish I could find the original comment that noted an obfuscated Perl contest is a redundant effort.)

But I do love regular expressions. I’ve crafted, tweaked, optimized, and obscured my fair share of regexes over the years. And I’ve discovered the performance benefits of pcre_study() and JIT compilation mode.

Yet woe betide anyone using regexes as a comprehensive parser (especially for HTML). And if you’re trying to match quoted strings, be prepared to deal with complexities that turn a few character pattern into a monstrous composition.

Seeing modern day humans still rely on poorly written regexes to conduct code scanning made me wonder how little mammals have advanced beyond the dinosaurs of prehistory. They might not be burning themselves with fire, but they’re burning their chances of accurate, effective scans.

That was how I discovered pfff and its companion, sgrep. At the SOURCE Seattle conference this year I spoke a little about lessons learned from regexes and the advancements possible should you desire to venture into the realm of OCaml: SOURCE Seattle 2015 – Code Scanning. Who knows, if you can conquer fire you might be able to handle stone tools.

Bad Code Entitles Good Exploits

I have yet to create a full taxonomy of the mistakes developers make that lead to insecure code.
As a brief note towards that effort, here’s an HTML injection (aka cross-site scripting) example that’s due to a series of tragic assumptions that conspire to not only leave the site vulnerable, but waste lines of code doing so.

The first clue lies in the querystring’s state parameter. The site renders the state‘s value into a title element. Naturally, a first probe for HTML injection would be attempting to terminate that tag. If successful, then it’s trivial to append arbitrary markup such as <script> tags. A simple probe looks like this:

The site responds by stripping the payload’s </title> tag (plus any subsequent characters). Only the text leading up to the injected tag is rendered within the title.


This seems to have effectively countered the attack and not expose any vuln. Of course, if you’ve been reading this blog for any length of time, you’ll know this trope of deceitful appearances always leads to a vuln. That which seems secure shatters under scrutiny.

The developers knew that an attacker might try to inject a closing </title> tag. Consequently, they created a filter to watch for such things and strip them. This could be implemented as a basic case-insensitive string comparison or a trivial regex.

And it could be bypassed by just a few characters.

Consider the following closing tags. Regardless of whether they seem surprising or silly, the extraneous characters are meaningless to HTML yet meaningful to our exploit because they foil the assumption that regexes make good parsers.

</title id="">

After inspecting how the site responds to each of the tags, it’s apparent that the site’s filter only expected a so-called “good” </title> tag. Browsers don’t care about an attribute on the closing tag. (They’ll ignore such characters as long as they don’t violate parsing rules.)

Next, we combine the filter bypass with a payload. In this case, we’ll use an image onerror event.

The attack works! We should have been less sloppy and added an opening <TITLE> tag to match the newly orphaned closing one. A good exploit should not leave the page messier than it was before.

<TITLE>abc</title id="a"><img src=x onerror=alert(9)> Vulnerable & Exploited Information Resource Center</TITLE>

The tragedy of this vuln is that it proves the site’s developers were aware of the concept of HTML injection exploits, but failed to grasp the fundamental characteristics of the vuln. The effort spent blocking an attack (i.e. countering an injected closing tag) not only wasted lines of code on an incorrect fix, but left the naive developers with a false sense of security. The code became more complex and less secure.

The mistake also highlights the danger of assuming that well-formed markup is the only kind of markup. Browsers are capricious beasts; they must dance around typos, stomp upon (or skirt around) errors, and walk bravely amongst bizarrely nested tags. This syntactic havoc is why regexes are notoriously worse at dealing with HTML than proper parsers.

There’s an ancillary lesson here in terms of automated testing (or quality manual pen testing, for that matter). A scan of the site might easily miss the vuln if it uses a payload that the filter blocks, or doesn’t apply any attack variants. This is one way sites “become” vulnerable when code doesn’t change, but attacks do.

And it’s one way developers must change their attitudes from trying to outsmart attackers to focusing on basic security principles.

A Lesser XSS Attack Greater Than Your Regex Security

I know what you’re thinking. “Did my regex block six XSS attacks or five?” You’ve got to ask yourself one question: “Do I feel lucky?” Well, do ya, punk?

Maybe you read a few HTML injection (cross-site scripting) tutorials and think a regex solves this problem. Maybe. Let’s revisit that thinking. We’ll need an attack vector. It could be a URL parameter, form field, header, or any other part of an HTTP request.

Choose an Attack Vector

Many web apps implement a search functionality. That’s an ideal attack vector because the nature of a search box is to accept an arbitrary string, then display the search term along with any relevant results. It’s the display, or reflection, of the search term that often leads to HTML injection.

For example, the following screenshot shows how Google reflects the search term “html injection attack” at the bottom of its results page. And the text node created in the HTML source.

Google search boxGoogle searchGoogle search html source

Here’s another example that shows how Twitter reflects the search term “deadliestwebattacks” in its results page. And the text node created in the HTML source.

Twitter searchTwitter search html source

Let’s take a look at another site with a search box. Don’t worry about the text (it’s a Turkish site, the words are basically “search” and “results”). First, we try a search term, “foo”, to check if the site echoes the term into the response’s HTML. Success! It appears in two places: a title attribute and a text node.

<a title="foo için Arama Sonuçları">Arama Sonuçları : "foo"</a>

Next, we probe the page for tell-tale validation and output encoding weaknesses that indicate the potential for this vulnerability to be present. In this case, we’ll try a fake HTML tag, “<foo/>”.

<a title="<foo/> için Arama Sonuçları">Arama Sonuçları : "<foo/>"</a>

The site inserts the tag directly into the response. The <foo/> tag is meaningless for HTML, but the browser recognizes that it has the correct mark-up for a self-enclosed tag. Looking at the rendered version displayed by the browser confirms this:

Arama Sonuçları : ""

The “<foo/>” term isn’t displayed because the browser interprets it as a tag. It creates a DOM node of <foo> as opposed to placing a literal “<foo/>” into the text node between <a> and </a>.

Inject a Payload

The next step is to find a tag with semantic meaning for a browser. An obvious choice is to try “<script>” as a search term since that’s the containing element for JavaScript.

<a title="<[removed]> için Arama Sonuçları">Arama Sonuçları : "<[removed]>"</a>

The site’s developers seem to be aware of the risk of writing raw “<script>” elements into search results. In the title attribute, they replaced the angle brackets with HTML entities and replaced “script”  with “[removed]”.

A good hacker would continue to probe the search box with different kinds of payloads. Since it seems impossible to execute JavaScript within a <script> element, we’ll try JavaScript execution within the context of an element’s event handler.

Try Alternate Payloads

Here’s a payload that uses the onerror attribute of an <img> element to execute a function:

<img src="x" onerror="alert(9)">

We inject the new payload and inspect the page’s response. We’ve completely lost the attributes, but the element was preserved:

<a title="<img> için Arama Sonuçları">Arama Sonuçları : "<img>"</a>

So, let’s modify out payload a bit. We condense it to a format that remains valid (i.e. a browser interprets it and it doesn’t violate the HTML spec). This step just demonstrates an alternate syntax with the same semantic meaning.


HTML injection payload

Unfortunately, the site stripped the onerror function the same way it did for the <script> tag.

<a title="<img/src="x"on[removed]=alert(9)>">Arama Sonuçları : "<img/src="x"on[removed]=alert(9)>"</a>

Additional testing indicates the site apparently does this for any of the onfoo event handlers.

Refine the Payload

We’re not defeated yet. The fact that the site is looking for malicious content implies that it’s relying on regular expressions to blacklist common attacks.

Oh, how I love regexes. I love writing them, optimizing them, and breaking them. Regexes excel at pattern matching, but fail miserably at parsing. And parsing is fundamental to working with HTML.

So, let’s unleash a mighty anti-regex hack. I’d call for a drum roll to build the suspense, but the technique is too trivial for that. All we do is add a greater than (>) symbol:


HTML injection payload with anti-regex

Look what happens to the site. We’ve successfully injected an <img> tag. The browser parses the element, but it fails to load the image called “x>” so it triggers the error handler, which pops up a friendly alert.

<a title="<img/src=">"onerror=alert(9)> için Arama Sonuçları">Arama Sonuçları : "<img/src="x>"onerror=alert(9)>"</a>


Why does this happen? I don’t have first-hand knowledge of the specific regex, but I can guess at its intention.

HTML tags start with the < character, followed by an alpha character, followed by zero or more attributes (with tokenization properties that create things name/value pairs), and close with the > character. It’s likely the regex was only searching for “on…” handlers within the context of an element, i.e. between < and > (the start and end tokens). A > character inside an attribute value doesn’t close the element.

<tag attribute="x>"...onevent=code>

The browser’s parsing model understood the quoted string was a value token. It correctly handled the state transitions between element start, element name, attribute name, attribute value, and so on. The parser consumed each character and interpreted it based on the context of its current state.

The site’s poorly-formed regex didn’t create a sophisticated enough state machine to handle the “x>” properly. (Regexes have their own internal state machines for pattern matching. I’m referring to the pattern’s implied state machine for HTML.) It looked for a start token, then switched to consuming characters until it found an event handler or encountered an end token — ignoring the possible interim states associated with tokenization based on spaces, attributes, or invalid markup.

This was only a small step into the realm of HTML injection. For example, the web site reflected the payload on the immediate response to the attack’s request. In other scenarios the site might hold on to the payload and insert it into a different page. It’s still reflected by the site, but not on the immediate response. That would make it a persistent type of vuln because the attacker does not have to re-inject the payload each time the affected page is viewed. For example, lots of sites have phrases like, “Welcome back, Mike!”, where they print your first name at the top of each page. If you told the site your name was “<script>alert(9)</script>”, then you’d have a persistent HTML injection exploit.

Rethink Defense

For developers:

  • When user-supplied data is placed in a web page, encode it for the appropriate context. For example, use percent-encoding (e.g. < becomes %3c) for an href attribute; use HTML entities (e.g. < becomes &lt;) for text nodes.
  • Prefer inclusion lists (match what you expect) to exclusion lists (predict what you think should be blocked).
  • Work with a consistent character encoding. Unpredictable transcoding between character sets makes it harder to ensure validation filters treat strings correctly.
  • Prefer parsing to pattern matching. However, pre-HTML5 parsing has its own pitfalls, such as browsers’ inconsistent handling of whitespace within tag names. HTML5 codified explicit rules for acceptable markup.
  • If you use regexes, test them thoroughly. Sometimes a “dumb” regex is better than a “smart” one. In this case, a dumb regex would have just looked for any occurrence of “onerror” and rejected it.
  • Prefer to reject invalid input rather than massage it into something valid. This avoids a cuckoo-like attack where a single-pass filter would remove any occurrence of the word “script” from a payload like “<scrscriptipt>”, unintentionally creating a <script> tag.
  • Prefer to reject invalid character code points (and unexpected encoding) rather than substitute or strip characters. This prevents attacks like null-byte insertion, e.g. stripping null from &lgt;%00script> after performing the validation check, overlong UTF-8 encoding, e.g. %c0%bcscript%c0%bd, or Unicode encoding (when expecting UTF-8), e.g. %u003cscript%u003e.
  • Make sure to escape metacharacters correctly.

For more examples of payloads that target different HTML contexts or employ different anti-regex techniques, check out the HTML Injection Quick Reference (HIQR). In particular, experiment with different payloads from the “Anti-regex patterns” at the bottom of Table 2.

Page 71

Regex-based security filters sink without anchors

In June 2010 the Stanford Web Security Research Group released a study of clickjacking countermeasures employed across Alexa Top-500 web sites. It’s an excellent survey of different approaches taken by web developers to prevent their sites from being framed (i.e. subsumed by an <iframe> tag). To better understand the dangers of framing pages, read the paper and check out Chapter Three of The Book.

One interesting point emphasized in the paper is how easily regular expressions can be misused or misunderstood as security filters. Regexes can be used to create positive or negative security models — either match acceptable content (whitelisting) or match attack patterns (blacklisting). Inadequate regexes lead to more vulnerabilities than just clickjacking.

One of the biggest mistakes made in regex patterns is leaving them unanchored. Anchors determine the span of a pattern’s match against an input string. The ‘^‘ anchor matches the beginning of a line. The ‘$‘ anchor matches the end of a line. (Just to confuse the situation, when ‘^‘ appears inside grouping brackets it indicates negation, e.g. ‘[^a]+‘ means match one or more characters that is not ‘a‘.)

Consider the example of the’s document.referrer check as shown in Section 3.5 of the Stanford paper. The weak regex is highlighted below:

if(window.self != &&

As the study’s authors point out (and anyone who is using regexes as part of a security or input validation filter should know), the pattern is unanchored and therefore easily bypassed. The site developers intended to check the referrer for links like these:

Since the pattern isn’t anchored, it will look through the entire input string for a match, which leaves the attacker with a simple bypass technique. In the following example, the pattern matches the text in red — clearly not the developers’ intent:


The devs wanted to match a URI whose domain included “”, but the pattern would match anywhere within the referrer string.

The regex would be improved by requiring the pattern to begin at the first character of the input string. The new, anchored pattern would look more like this:


The same concept applies to input validation for form fields and URI parameters. Imagine a web developer, we’ll call him Wilberforce for alliterative purposes, who wishes to validate U.S. zip codes submitted in credit card forms. The simplest pattern would check for five digits, using any of these approaches:


At first glance the pattern works. Wilberforce even tests some basic XSS and SQL injection attacks with nefarious payloads like <script src=...> and 'OR 19=19. The regex rejects them all.

Then our attacker, let’s call her Agatha, happens to come by the site. She’s a little savvier and, whether or not she knows exactly what the validation pattern looks like, tries a few malicious zip codes (the five digits are underlined):


Poor Wilberforce’s unanchored pattern finds a matching string in all three cases, thereby allowing the malicious content through the filter and enabling Agatha to compromise the site. If the pattern had been anchored to match the complete input string from beginning to end then the filter wouldn’t have failed so spectacularly:


Unravelling Strings

Even basic string-matching approaches can fall victim to the unanchored problem; after all they’re nothing more than regex patterns without the syntax for wildcards, alternation, and grouping. Let’s go back to the Stanford paper for an example of’s document.referrer check based on a JavaScript String object’s IndexOf function. This function returns the first position in the input string of the argument or -1 in case the argument isn’t found:

if(top.location != location) {
  if(document.referrer &&
     document.referrer.indexOf("") == -1)

Sigh. As long as the document.referrer contains the string “” the anti-framing code won’t trigger. For Agatha, the bypass is as simple as putting her booby-trapped clickjacking page on a site with a domain name like “” or maybe using a URI fragment, http://evil.lair/ The developers neglected to ensure that the host from the referrer URI ends in rather than merely contains

The previous sentence is very important. Read it again. The referrer string isn’t supposed to end in, the referrer’s host is supposed to end with that domain. That’s an important distinction considering the bypass techniques we’ve already mentioned:

Parsers Before Patterns

Input validation filters often require an understanding of a data type’s grammar. Some times this is simple, such as a five digit zip code. More complex cases, such as email addresses and URIs, require that the input string be parsed before pattern matching is applied.

The previous indexOf string example failed because it doesn’t actually parse the referrer’s URI; it just looks for the presence of a string. The regex pattern in the example was superior because it at least tried to understand the URI grammar by matching content between the URI’s scheme (http or https) and the first slash (/)1.

A good security filter must understand the context of the pattern to be matched. The improved referrer check is shown below. Notice that the get_hostname_from_url function now uses a regex to extract the host name from the referrer’s URI and the string comparison ensures the host name either exactly matches or ends with “”. (You could quibble that the regex in get_hostname_from_url isn’t anchored, but in this case the pattern works because it’s not possible to smuggle malicious content inside the URI’s scheme. The pattern would fail if it returned the last match instead of the first match. And, yes, the typo in the comment in the killFrames function is in the original JavaScript.)

function killFrames() {
  if (top.location != location) {
    if (document.referrer) {
      var referrerHostname = get_hostname_from_url(document.referrer);
      var strLength = referrerHostname.length;
      if ((strLength == 11) && (referrerHostname != "")){  // to take care of url - length of "" string is 11.
      } else if (strLength != 11 && referrerHostname.substring(referrerHostname.length - 12) != "") {  // length og "" string is 12.
function get_hostname_from_url(url) {
  return url.match(/:\/\/(.[^/?]+)/)[1];


Regexes and string matching functions are ubiquitous throughout web applications. If you’re implementing security filters with these functions, keep these points in mind:

Normalize the character set. Ensure the string functions and regex patterns match the character encoding, e.g. multi-byte string functions for multi-byte sequences.

Always match the entire input string. Anchor patterns to the start (‘^‘) and end (‘$‘) of input strings. If you expect input strings to include multiple lines, understand how multiline (?m) and single line (?s) flags will affect the pattern — if you’re not sure then treat it as a single line. Where appropriate to the context, the results of string matching functions should be tested to see if the match occurred at the beginning, within, or end of a string.

Prefer a positive security model over a negative one. Whitelist content that you expect to receive and reject anything that doesn’t fit. Whitelist filters should be as strict as possible to avoid incorrectly matching malicious content. If you go the route of blacklisting content, make the patterns as lenient as possible to better match unexpected scenarios — an attacker may have an encoding technique or JavaScript trick you’ve never heard of.

Consider a parser instead of a regex. If you want to match a URI attribute, make sure your pattern extracts the right value. URIs can be complex. If you’re trying to use regexes to parse HTML content…good luck.

Don’t shy away from regexes because their syntax looks daunting, just remember to test your patterns against a wide array of malicious and valid input strings.

1 Technically, the pattern should match the host portion of the URI’s authority. Check out RFC 3986 for specifics, especially the regexes mentioned in Appendix B.