Web scanner evaluations collect metrics by comparing scan results against a (typically far too small) field of test sites. One quick way to build the test field might be to collect intentionally vulnerable sites from the Web. That approach, though fast, does a disservice to the scanners and more importantly the real web applications that need to be scanned. After all, does your web application really look like WebGoat or the latest HacmeFoo? Does it even resemble an Open Source application like WordPress or phpBB?
The choice of targets highly influences results — not necessarily in terms of a scanner’s perceived accuracy or capabilities, but whether those properties will translate to a completely different1 site. This doesn’t imply that a scanner that fails miserably against a test site will miraculously work against your own. There should always be a baseline that establishes confidence, however meager, in a scanner’s capabilities. However, success against one of those sites doesn’t ensure equal performance against a site with completely different workflows and design patterns.
Peruse Alexa’s top 500 web sites for a moment. They differ not only in category — adult and arts, science and shopping — but in technology and design patterns. Category influences the types of workflows that might be present. In terms of interaction, a shopping site looks and works differently from a news site. A search portal works differently than an auction site. I, of course, don’t know how the adult sites work, other than they make lots of money and are often laced with malware (even tame ones). Dealing with workflows in general creates problems for web scanners.
Remember that scanners have no real understanding of a site’s purpose nor visibility into its source code. Scan reports provide identification of, not insight into, a vulnerability. For example, a cross-site scripting bug might be due to a developer who neglected the site’s coding guidelines and used an insecure print() function rather than a centralized, well-tested print_safe() version. Perhaps the site has no centralized library for displaying user-supplied data in a secure manner. The first case was a mistake in a single page, the latter points to a fundamental problem that won’t disappear after fixing one page. Identifying underlying security problems remains the purview of manual testing and analysis — along with great hourly rates.
A web application scanner should be an agnostic to the mishmash of acronyms and technologies that underpin a site. As long as the server communicates via HTTP and throws up (in varying senses) HTML, then the scanner should be able to look for vulnerabilities. Neither HTTP nor HTML change in any appreciable way whether a site uses PHP-On-Dot-Rails or other language of the jour. If pages render in a web browser, then it’s a web site. Such is the ideal world of scanners and unicorns.
How well does the scanner scale? If it can deal with a site that has 1,000 links, what happens when it hits 10,000? It might melt LAN ports scanning a site one network hop away that serves pages with an average size of 1KB only to blow up against a site with a network latency of a few hundred milliseconds and page sizes on the average of a few hundred kilobytes.
Scaleability also speaks to the time and resources necessary to scan large numbers of sites. This point won’t concern users who are only dealing with a single property, but some organizations have to deal with dozens, hundreds, or possibly a few thousand web properties. Manual approaches, although important for in-depth assessments, do not scale. For its part automation needs to enhance scaleability, not create new hindrances. This applies to managing scan results over time as well managing the dozens (or more!) scans necessary. If a scanner requires a high-end desktop system to scan a single site in a few hours, then simple math tells you how long it will take to test N sites that each require M hours on average to complete. You could parallelize the scans, but buying and maintaining more systems induces additional costs for hardware, software, and maintenance.
A test site (preferably plural) must provide areas where scanners can be evaluated for accuracy. A good field of test sites includes applications that exercise different aspects of web security scanning, especially those most relevant to the types of sites to be secured. Web sites require different levels of emphasis on
- Customized error handling
- Large numbers of links, including highly redundant areas like product pages
- Large page sizes
- Varying degrees of server responsiveness (there’s a big difference between scanning a site supported by load-balanced web farm and bludgeoning a Mac Mini with hundreds of requests per second)
A good web security scanner adapts to the peculiarities of its targets, from mistyped markup that browsers silently fix to complex pages rife with bizarre coding styles. HTML may be an established web standard, but few sites follow standard ways of creating and using it. Not only does this challenge web scanner developers, it complicates the process of in-depth scanner comparisons. No example site with a few dozen links will ever be representative or fully expose the pros and cons of a scanner.
1 At this point we’ve passed the Twitterati’s mystical 140-character limit four times over. So, for those who choose TLDR as a way of life, stop reading and go enjoy something completely different