Web Scanner Evaluation: Accuracy

This is the first in a series of essays describing suggested metrics for evaluating web application security scanners.

Accuracy measures the scanner’s ability to detect vulnerabilities. The basic function of a web scanner is to use automation to identify the same, or most of the same, vulnerabilities as a web security auditor. Rather than focus on the intricate details of specific vulnerabilities, this essay describes two major areas of evaluation for a scanner: its precision and faults.

Precise scanners produce results that not only indicate the potential for compromise of the web site, but provide actionable information that helps developers understand and address security issues.
Comprehensive tests should be measured on the different ways a vulnerability might manifest as opposed to establishing a raw count of payloads. The scope of a test is affected by several factors:
  • Alternate payloads (e.g. tests for XSS within an href attribute or the value attribute of an input element, using intrinsic events, within a JavaScript variable, or that create new elements)
  • Encoding and obfuscation (e.g. employing various character sets or encoding techniques to bypass possible filters)
  • Applicability to different technologies in a web platform (e.g. PHP superglobals, .NET VIEWSTATE)
Robust detection mechanisms will correctly confirm the presence of a vulnerability, which typically fall into one of three categories:
  • Signatures such as HTTP response code, string patterns, and reflected content (e.g. matching ODBC errors or looking for alert(1) strings)
  • Inference based on interpreting the results of a group of requests (e.g. “blind” SQL injection that affects the content of a response based on specific SQL query constructions)
  • Time-based tests measure the responsiveness of the web site to various payloads. Not only can they extend inference-based tests, but they can also indicate potential areas of Denial of Service if a payload can cause the site to spend more resources processing a request than an attacker requires making the request (e.g. a query that performs a complete table scan of a database).
Injection vector refers to the areas that a scanner applies security tests. The most obvious injection points are query string parameters and visible form fields. The web browser should at least be considered an untrusted source if not an adversarial one. Consequently, a web scanner should be able to run security checks via all aspects of HTTP communication including:
  • URI parameters
  • POST data
  • Client-site headers, especially the ones more commonly interpreted by web sites including User-Agent, the-forever-misspelled Referer, and Cookie.
Errors in a scanner’s results take away from the time-saving gains of automation by requiring users to dig into non-existent vulnerabilities or spending too much time repeating the scanner’s tests in order to satisfy that certain vulnerabilities do not exist.
False positives indicate insufficient analysis of a potential vulnerability. The cause of a false positive can be hard to discern without intimate knowledge of the scanner’s internals, but often falls into one of these categories:
  • Misdiagnosis of generic error page or landing page
  • Poor test implementation that misinterprets correlated events to infer cause from effect (e.g. changing a profile page’s parameter value from Mike to Mary to view another user’s public information is not a case of account impersonation – the web site intentionally displays the content)
  • Sole reliance on inadequate test signature to claim the vulnerability exists (e.g. a poor regex or stating an HTTP 200 response code for ../../../../etc/passwd indicates the password file is accessible)
  • Web application goes into error state due to load (e.g. database error occurs because the server has become overloaded by scan traffic, not because a double quote character was injected into a parameter)
  • Lack of security impact (e.g. an unauthenticated, anonymous search form is vulnerable to CSRF – search engines like Google, Yahoo!, and Bing are technically vulnerable but the security relevance is questionable)
The effort expended to invalidate an erroneous vulnerability wastes time better spent investigating and verifying actual vulnerabilities. False positives also reduce trust in the scanner.
False negatives expose a more worrisome aspect of the scanner because the web site owner may gain a false sense of security by assuming, incorrectly, that a report with no vulnerabilities implies the site is fully secure. Several situations lead to a missed vulnerability:
  • Lack of test. The scanner simply does not try to identify the particular type of vulnerability.
  • Poor test implementation that too strictly defines the vulnerability (e.g. XSS tests that always contain <script> or javascript: under the mistaken assumption that those are required to exploit an XSS vuln)
  • Inadequate signatures (e.g. the scanner does not recognize SQL errors generated by Oracle)
  • Insufficient replay of requests (e.g. a form submission requires a valid e-mail address in one field in order to exploit an XSS vulnerability in another field)
  • Inability to automate (e.g. the vulnerability is related to a process that requires understanding of a sequence of steps, knowledge of the site’s business logic). The topic of vulnerabilities for which scanners cannot test (or have great difficulty testing) will be addressed separately.
  • Lack of authentication state (e.g. the scanner is able to authenticate at the beginning of the scan, but unknowingly loses its state, perhaps by hitting a logout link, and does not attempt to restore authentication)
  • Link not discovered by the scanner. This falls under the broader scope of site coverage, which will be addressed separately.
The optimistic aspect of false negatives is that a scanner’s test repository can always grow. In this case a good metric is determining the ease with which false negatives are addressed.
Accuracy is an important aspect of a web scanner. Inadequate tests might make a scanner more cumbersome to use than a simple collection of tests scripted in Perl or Python. Too many false positives reduces the user’s confidence in the scanner and wastes valuable time on items that should never have been identified. False negatives may or may not be a problem depending on how the web site’s owners rely on the scanner and whether the missed vulnerabilities are due to lack of tests or poor methodology within the scanner.
One aspect not addressed here is measuring how accuracy scales against larger web sites. A scanner might be able to effectively scan a hundred-link test application, but suffer in the face of a complex site with various technologies, error patterns, and behaviors.
Finally, accuracy is only one measure of the utility of a web application scanner. Future essays will address other topics such as site coverage, efficiency, and usability.

Observations on Larry Suto’s Paper about Web Application Security Scanners

Note: I’m the lead developer for the Web Application Scanning service at Qualys and I worked at NTO for about three years from July 2003 — both tools were included in this February 2010 report by Larry Suto. Never the less, I most humbly assure you that I am the world’s foremost authority on my opinion, however biased it may be.

The February 2010 report, Analyzing the Accuracy and Time Costs of Web
Application Security Scanners, once again generated heated discussion about web application security scanners. (A similar report was previously published in October 2007.) The new report addressed some criticisms of the 2007 version and included more scanners and more transparent targets. The 2010 version, along with some strong reactions to it, engender some additional questions on the topic of scanners in general:

How much should the ability of the user affect the accuracy of a scan?

Set aside basic requirements to know what a link is or whether a site requires credentials to be scanned in a useful manner. Should a scan result be significantly more accurate or comprehensive for a user who has several years of web security experience than for someone who just picked up a book in order to have spare paper for origami practice?

I’ll quickly concede the importance of in-depth manual security testing of web applications as well as the fact that it cannot be replaced by automated scanning. (That is, in-depth manual testing can’t be replaced; unsophisticated tests or inexperienced testers are another matter.) Tools that aid the manual process have an important place, but how much disparity should there really be between “out of the box” and “well-tuned” scans? The difference should be as little as possible, with clear exceptions for complicated sequences, hidden links, or complex authentication schemes. Tools that require too much time to configure, maintain, and interpret don’t scale well for efforts that require scanning to more than a handful of sites at a time. Tools whose accuracy correlates to the user’s web security knowledge scale at the rate of finding smart users, not at the rate of deploying software.

What’s a representative web application?

A default installation of osCommerce or Amazon? An old version of phpBB or the WoW forums? Web sites have wildly different coding styles, design patterns, underlying technologies, and levels of sophistication. A scanner that works well against a few dozen links might grind to a halt against a few thousand.

Accuracy against a few dozen hand-crafted links doesn’t necessarily scale against more complicated sites. Then there are web sites — in production and earning money no less — with bizarre and inefficient designs such as 60KB .NET VIEWSTATE fields or forms with seven dozen fields. A good test should include observations on a scanner’s performance at both ends of the spectrum.

Isn’t gaming a vendor-created web site redundant?

A post on the Accunetix blog accuses NTO of gaming the Accunetix test site based on a Referer field from web server log entries. First, there’s no indication that the particular scan cited was the one used in the comparison; the accusation has very flimsy support. Second, vendor-developed test sites are designed for the very purpose of showing off the web scanner. It’s a fair assumption that Accunetix created their test sites to highlight their scanner in the most positive manner possible, just as HP, IBM, Cenzic, and other web scanners would (or should) do for their own products. There’s nothing wrong with ensuring a scanner — the vendor’s or any other’s — performs most effectively against a web site offered for no other purpose than to show off the scanner’s capabilities.

This point really highlights one of the drawbacks of using vendor-oriented sites for comparison. Your web site probably doesn’t have the contrived HTML, forms, and vulnerabilities of a vendor-created intentionally-vulnerable site. Nor is it necessarily helpful that a scanner proves it can find vulnerabilities in a well-controlled scenario. Vendor sites help demonstrate the scanner, they provide a point of reference for discussing capabilities with potential customers, and they support marketing efforts. You probably care how the scanner fares against your site, not the vendor’s.

What about the time cost of scaling scans?

The report included a metric that attempted to demonstrate the investment of resources necessary to train a scanner. This is useful for users who need tools to aid in manual security testing or users who have only a few web sites to evaluate.

Yet what about environments where there are dozens, hundreds, or — yes, it’s possible — thousands of web sites to secure within an organization? The very requirement of training a scanner to deal with authentication or crawling works against running scans at a large scale. This is why point-and-shoot comparison should be a valid metric. (In opposition to at least one other opinion.)

Scaling scans don’t just require time to train a tool. It also requires hardware resources to manage configurations, run scans, and centralize reporting. This is a point where Software as a Service begins to seriously outpace other solutions.

Where’s the analysis of normalized data?

I mentioned previously that point-and-shoot should be one component of scanner comparison, but it shouldn’t be the only point — especially for tools intended to provide some degree of customization, whether it simply be authenticating to the web application or something more complex.

Data should be normalized not only within vulnerabilities (e.g. comparing reflected and persistent XSS separately, comparing error-based SQL injection separately from inference-based detections), but also within the type of scan. Results without authentication shouldn’t be compared to results with authentication. Other steps would be to compare the false positive/negative rates for tests scanners actual perform rather than checks a tool doesn’t perform. It’s important to note where a tools does or does not perform a check versus other scanners, but not performing a check has a different reflection on accuracy versus performing a check that still doesn’t identify a vulnerability.

What’s really going on here?

Designing a web application scanner is easy, implementing one is hard. Web security has complex problems, many of which have different levels of importance, relevance, and even nomenclature. The OWASP Top 10 project continues to refine its list by bringing more coherence to the difference between attacks, countermeasures, and vulnerabilities. The WASC-TC aims for a more comprehensive list defined by attacks and weaknesses. Contrasting the two approaches highlights different methodologies for testing web sites and evaluating their security.

So, if performing a comprehensive security review of a web site is already hard, then it’s likely to have a transitive effect on comparing scanners. Comparisons are useful and provide a service to potential customers, who want to find the best scanner for their environment, and useful to vendors, who want to create the best scanner for any environment. The report demonstrates areas not only where scanners need to improve, but where evaluation methodologies need to improve. Over time both of these aspects should evolve in a positive direction.

Earliest(-ish) hack against web-based e-mail

The book starts off with a discussion of cross-site scripting (XSS) attacks along with examples from 2009 that illustrate the simplicity of these attacks and the significant impact they can have. What’s astonishing is how little many of the attacks have changed. Consider the following example, over a decade old, of HTML injection before terms like XSS became so ubiquitous. The exploit appeared about two years before the blanket CERT advisory that called attention to insecurity of unchecked HTML.

On August 24, 1998 a Canadian web developer, Tom Cervenka, posted a message to the comp.lang.javascript newsgroup that claimed

We have just found a serious security hole in Microsoft’s Hotmail service (http://www.hotmail.com/) which allows malicious users to easily steal the passwords of Hotmail users. The exploit involves sending an e-mail message that contains embedded javascript code. When a Hotmail user views the message, the javascript code forces the user to re-login to Hotmail. In doing so, the victim’s username and password is sent to the malicious user by e-mail.

The discoverers, in apparent ignorance of the 1990’s labeling requirements for hacks to include foul language or numeric characters, simply dubbed it the “Hot”Mail Exploit. (They demonstrated further lack of familiarity with disclosure methodologies by omitting greetz, lacking typos and failing to remind the reader of near-omnipotent skills — surely an anomaly at the time. The hacker did not fail on all aspects. He satisfied the Axiom of Hacking Culture by choosing a name, Blue Adept, that referenced pop culture, in this case the title of a fantasy novel by Piers Anthony.)

The attack required two steps. First, they set up a page on Geocities (a hosting service for web pages distinguished by being free before free was co-opted by the Web 2.0 fad) that spoofed Hotmail’s login.

The attack wasn’t particularly sophisticated, but it didn’t need to be. The login form collected the victim’s login name and password then mailed them, along with the victim’s IP address, to the newly-created Geocities account.

The second step involved executing the actual exploit against Hotmail by sending an e-mail with HTML that contained a rather curious img tag:

<img src=”javascript:errurl=’http://www.because-we-can.com/users/anon/hotmail/getmsg.htm&#8217;;

The JavaScript changed the browser’s DOM such that any click would take the victim to the spoofed login page at which point the authentication credentials would be coaxed from the unwitting visitor. The original payload didn’t bother to obfuscate the JavaScript inside the src attribute. Modern attacks have more sophisticated obfuscation techniques and use tags other than the img element. The problem of HTML injection, although well known for over 10 years, remains a significant attack against web applications.

Factor of Ultimate Doom

Vulnerability disclosure presents a complex challenge to the information security community. A reductionist explanation of disclosure arguments need only present two claims. One end of the spectrum goes, “Only the vendor need know so no one else knows the problem exists, which means no one can exploit it.” The information-wants-to-be-free diametric opposition simply states, “Tell everyone as soon as the vulnerability is discovered”.

The Factor of Ultimate Doom (FUD) is a step towards reconciling this spectrum into a laser-focused compromise of agreement. It establishes a metric for evaluating the absolute danger inherent to a vulnerability, thus providing the discoverer with guidance on how to reveal the vulnerability.

The Factor is calculated by simple addition across three axes: Resources Expected, Protocol Affected, and Overall Impact. Vulnerabilities that do not meet any of the Factor’s criteria may be classified under the Statistically Irrelevant Concern metric, which will be explored at a later date.

Resources Expected
(3) Exploit doesn’t require shellcode; merely a JavaScript alert() call
(2) Exploit shellcode requires fewer than 12 bytes. In other words, it must be more efficient than the export PS1=# hack (to which many operating systems, including OS X, remain vulnerable)
(1) Exploit shellcode requires a GROSS sled. (A GROSS sled uses opcode 144 on Intel x86 processors, whereas the more well-known NOP sled uses opcode 0x90.)

Protocol Affected
(3) The Common Porn Interchange Protocol (TCP/IP)
(2) Multiple online rhetorical opinion networks
(1) Social networks

Overall Impact
(3) Control every computer on the planet
(2) Destroy every computer on the planet
(1) Destroy another planet (obviously, the Earth’s internet would not be affected — making this a minor concern)

The resulting value is measured against an Audience Rating to determine how the vulnerability should be disclosed. This provides a methodology for verifying that a vulnerability was responsibly disclosed.

Audience Rating (by Factor of Ultimate Doom*)
(> 6) Can only be revealed at a security conference
(< 6) Cannot be revealed at a security conference
(< 0) Doesn’t have to be revealed; it’s just that dangerous

(*Due to undisclosed software patent litigation, values equal to 6 are ignored.)


So, I was asked to comment about clickjacking today. Technically, it isn’t a new vulnerability (IE6 fixed a variant in 2004, Firefox fixed a variant in September 2008), but a refinement of previous exploits and ennobled with a catchier name. It gained widespread coverage in October 2008 prior to the OWASP NYC conference when Jeremiah Grossman and Robert Hansen first said they would describe the vulnerability, then cancelled their talk for fear of unleashing Yet Another Exploit of Ultimate Doom.* The updated technique combines devious DOM manipulation with well-established attack patterns to make a respectable type of attack.

I still hope that this doesn’t make it into the OWASP Top 10. (I’ll explain why elsewhere.)
Anyway, in the interest of further polluting the internet with opinionated cruft, here’s more information about clickjacking:
Clickjacking tricks a user into clicking on an attacker-supplied page while the user only sees the appearance and effect of clicking on a plain link. The attacker identifies an area in the target HTML that should receive the click event. This HTML is placed within an IFRAME such that the X and Y offsets of the frame place the target area in the upper left-hand corner of the frame’s visible area. This target IFRAME is visually hidden from the user (though the element remains part of the DOM). Then, the IFRAME is set within a second page (the content of which doesn’t matter) beneath the mouse cursor and, very importantly, dynamically moves to always be underneath the mouse. Then, when the user clicks somewhere within the second page the click is actually sent to the target area even though it appears to the user that the mouse is only above some innocuous link.
Essentially, an attacker chooses some web page that, if the victim clicked some point (link, button, etc.) on that page, would produce some benefit to the attacker (e.g. generate click-fraud revenue, change a security setting, etc.). Next, the attacker takes the target page and places a second, innocuous page over it. The trick is to get the victim to make a mouse click on what appeared to be the innocuous page, but was actually an invisible element of the target page that has been automatically, but invisbly, placed beneath the cursor.
The attack relies on luring a user to a server under the attacker’s control or a site that has been compromised by the attacker. Web site owners who ensure their site is free of cross-site scripting or other vulnerabilities can prevent their sites from being used as a relay point for the attacker. Yet other successful attacks, such as phishing, also rely on luring users to a server under the attacker’s control. The relative success of phishing implies that just securing web applications at the server isn’t the only solution because users can be tricked into visiting malicious web sites.
The core of the attack occurs in the browser, which is where the real fix needs to appear. The problem is that browsers are intended to handle HTML from many sources and provide mechanisms to manipulate the location and visibility of elements within a web page. Consequently, any solution would have to block this attack while not inhibiting legitimate uses of this functionality.
*Yes, I made you scroll all the way down here to get the link. How evil is that?

The Internet is dead! Long live the Internet!

In 1998, L0pht claimed before Congress that in under 30 minutes their seven member group could make online porn and Trek fan sites unusable for several days. (That’s all that existed on the Internet in 1998.) In February 2002 an SNMP vulnerability threatened the very fabric of space and time (at least as it related to porn and Trek fan sites — if you still don’t believe me, consider that Google added Klingon language support the same month). More recently, a DNS vulnerability was (somewhat re-)discovered that could enable attackers to redirect traffic going to sites like google.com and wikipedia.com to sites that served porn, even though many people wouldn’t notice the difference. (Dan Kaminsky compiled a list of other apocalyptic vulnerabilities similar to the issues that plagued DNS.)

This year at the OWASP NYC AppSec 2008 Conference Jeremiah Grossman and Robert “RSnake” Hansen shared another vulnerability, clickjacking, in the Voldemort “He Who Must Not Be Named” style. In other words, yet another eschatonic vulnerability existed, but its details could not be shared. This disclosure method continued the trend from Black Hat 2008 prior to which the media and security discussion lists talked about the secretly-held, unsecretly-guessed DNS vulnerability information with the speculation usually retained for important things like when Gn’Fn’R would finally release Chinese Democracy. [If you don’t care about gory details of the disclosure drama and just want to skim the abattoir, then read this summary.]

Yet none of these doom-laden vulnerabilities have caused to Internet to go pfft like a certain parrot that need not be named.

Until now.
I’ve discovered a web-based vulnerability that can be trivially exploited called Cross-Hype Attack Forgery Exploit (CHAFE). It affects all web browsers and can’t be patched (nor will you be protected by FireFox’s NoScript or using lynx). In fact, if you’re reading this entry then I guarantee you can be vulnerable to it. Public release of the details would be self-defeating, but I’m willing to sell the details to the highest bidder — as well as anyone else who wants to pay for the information. To ensure the validity of this vulnerability, consider that it has both “cross” and “forgery” in the name. So, it clearly has a working exploit associated with it. No peer review is necessary to establish the vulnerability’s credibility. To build further confidence, I’ll hint that the vulnerability builds on prior research, but who really cares about dusty problems from 1991 when you can have a working exploit in 2008?
Since I haven’t gotten around to creating PayPal account yet (although a reminder to update my account information just arrived in my InBox a few moments ago), send an e-mail to chafe@hackculture.com if you’re interested in the details and you have some money from which you’d like to be departed.