Login forms

Designing a web application scanner is easy. A good design requires a few sentences; a great design might need two paragraphs or so. It’s easy to find messages on e-mail lists that describe the One True Way to scan a web site.

Implementing a scanner is hard. The core of a web vulnerability scanner performs two functions: find a link, test that link. The task of finding links falls to a crawling engine. The crawler must be fundamentally strong, otherwise links will be missed and a missed link is an untested link. Untested links lead to holes in the site coverage which raise uncertainty in the state of the site’s security. It’s rarely necessary to hit every link of a web site in order to adequately scan it. Security testing requires comprehensive coverage of the site’s functionality, which is different from covering every single link. A SQL injection vulnerability in the thread ID of a forum can be found by crawling a few sample discussion threads. It’s not necessary to fully enumerate 100,000 threads about nerfing warlocks or debating Mal vs. Kirk.

In addition to crawling strategies, scanners must also be able to crawl a site as an authenticated user. Maintaining an authenticated state requires coordinating several pieces of information (tracking the session cookie, avoiding logout links). But first the scanner must find and submit the login form.

Simple login forms have a text field, password field, and submit button. The HTML standard provides the markup to create these forms. The standard only defines syntax, not usage. This gives web developers leeway to abuse HTML through ignorance, inefficiency, and what can only be termed outright malice.

Consider the login form created by Sun’s OpenSSO Enterprise 8.0. The HTML roughly breaks down to the following:

<head>
<script language="JavaScript">
var defaultBtn = 'Submit';
var elmCount = 0;
/** submit form with default command button */
function defaultSubmit() {
  LoginSubmit(defaultBtn);
}
</script>
</head>
<body>
<script language="javascript">
elmCount++;
</script>
<form name="frm1" action="blank" onSubmit="defaultSubmit(); return false;" method="post">
User Name: <input type="text" name="IDToken1" id="IDToken1" value="" class="TxtFld">
</form>
<form name="frm2" action="blank" onSubmit="defaultSubmit(); return false;" method="post">
Password: <input type="password" name="IDToken2" id="IDToken2" value="" class="TxtFld">
</form>
<form name="Login" action="/login/UI/Login?AuthCookie=..."  method="post">
<script language="javascript">
if (elmCount != null) {
  for (var i = 0; i < elmCount; i++) {
    document.write("<input name=\"IDToken" + i + "\" type=\"hidden\">");
  }
  document.write("<input name=\"IDButton" + "\" type=\"hidden\">");
}
</script>
<input type="hidden" name="goto" value=" aHR0cHM6Ly93d3cuZGVhZGxpZXN0d2ViYXR0YWNrcy5jb20vc2VjcmV0L2xpbmsvaW4vYmFzZS82 NC8=">
<input type="hidden" name="SunQueryParamsString" value="">
<input type="hidden" name="encoded" value="true">
<input type="hidden" name="gx_charset" value="UTF-8">
</form>
</body>

So we have a single page with three forms, two of which have no purpose other than to display a form field, and a final one with JavaScript whose sole purpose is to copy values from the other two forms into its own hidden fields. There’s also a single <script> tag dedicated to nothing more than incrementing an element counter. Not only might this offend the sensibilities of JavaScript developers who appreciate more programmatic approaches as found in JQuery or Prototype.JS, it also causes headaches for web security scanners. The two forms’ actions are “blank” and the onSubmit events always return false. That should at least inform the scanner that they shouldn’t be submitted directly – but that’s an assumption that might prove false since the site may go into an error state if it receives an incomplete set for form fields.
Even uglier login form patterns exist in the wild. In some cases the login form is wrapped within its own HTML element:

<html>
<head></head>
<body>
...other content
<div>
<html><form>
...
</form></html>
</div>
...other content
</body>
</html>

Some forms use unnamed input fields and programmatically enumerate them via JavaScript upon submission:

Username: <input type="text" value="">
Password: <input type="password" value="">

Then there’s the doPostBack function in .NET sites along with their penchant for multiple submit buttons (e.g. one for authentication, another for a search). Now the scanner has to identify the salient fields for authentication and hit the correct submit button; it’s no good to fill out the username and password only to submit the search button.

Sure, a user could manually coax the scanner through any of these login processes, but that places an unnecessary burden on the user’s time. This is less of a problem when dealing with a single web site, but becomes overwhelming when trying to scan a dozen or even hundreds of web applications.

These types of logins also highlight the difficulty scanners have with understanding the logic of a web page, let alone the logic of a group of pages or some workflow within the site.
It’s still possible to automate the login process for these forms, doing so requires customization at the expense of having a generic authentication mechanism. In the end, dealing with login pages often provides insight into the madness of HTML editing (it’s hard to call some of these methods programming) and the bizarre steps developers will take just to “make it work.”

Scanners should automate the crawl and test phases as much as possible. After all, it’s dangerous to tie too much of a scan’s effectiveness to the user’s knowledge of web security. It may not be every day that a web developer answers your question about the robots.txt file with, “I don’t know what that is,” but it’s a good idea to have a scanner that will be comprehensive and accurate regardless of whether the user knows the UTF-7 encoding for an angle bracket or wonders why web sites don’t just strip the alert function to prevent XSS attacks.