The Futility of Web Pen Testing

I previously lamented the death of web scanners1 so it’s only fair that to turn this nihilistic gaze to the futility of manual web security testing. This isn’t to say it’s impossible to perform a comprehensive review of a web app. The problem lies in repeating that review after developers modify the app. Or the problem of obtaining consistent results from different testers. Or if we continue to look for problems: Repeating an in-depth review of one app across the few thousand that might exist in an organization.

It’s not controversial to state the web apps need to be tested. The trick is trying to keep up with the pace of development for a single web app, compounded by the pace at which new apps arise. Security testers must deal with the fear that a once-secure app might be crippled by the introduction of a new feature or a change that inadvertently breaks an old one. With this in mind, the fundamental challenges of manual pen testing is managing the effort required to maintain one site’s security and scaling that effort to thousands.

If we look at recent compromises, and accept reasoning from an anecdotal basis, we see simple attacks succeeding against huge, well-known web sites. It’s not like Google, Twitter, Facebook, and similar don’t understand security, don’t have budgets for security testing, or don’t have people testing their apps. Google actively encourages ethical testing against several of its properties.2 This nod to explicit permission to find vulns isn’t a concession to the impossibility of writing secure code; it’s a nod towards the difficulty of scale in manually testing large, complex web applications. It’s also interesting to see the vulns considered worthy of reward versus those deemed cruft or a “vector for petty mischief”.3 This highlights the minefield within the subjectivity of risk when terms like risk, attack, threat, and impact are ambiguously defined or too-broadly applied.

Consider the Sony Playstation Network (PSN) breach4 that compromised user data “by hacking into an application server behind a Web server and two firewalls”.5 This attack6 is a topical example of the impact of (what seems to be7) straight-forward web vulns like SQL injection. One could wonder whether the vulnerable web site ever received security testing. If so, the quality of the test must be called into question, especially if the alleged attack vector was so simple. On the other hand, we could ruminate on reasons why the site wasn’t tested. Missed and forgotten because the site was so old? Assumed it had been tested before? Too many other sites considered more important needed to be tested first? We could go on, but the point of the exercise is to express the difficulty of maintaining web security, not second-guessing situations we know too little about.

The presence of a vulnerability is usually indisputable. On the other hand, its risk or exploitation impact often falls to debate. SQL injection is a clear example of a vuln that should be addressed immediately rather than arguing if the vuln exposes an empty database or encrypted credit card numbers. SQL injection flaws are due to fundamentally improper programming. Even if an exploit is questionable, there’s bad code sitting on the server that should be fixed.

CSRF is a different matter, especially depending on how it manifests. Arguments over risk ratings too often devolve into opinions biased by imaginative threats rather than focus on the situation. (For example, mistakenly considering a CSRF countermeasure broken because it can be bypassed by XSS or incorrectly assuming CSRF tokens prevent sniffing attacks.)

The details of web vulns differ enough that it’s hard, and perhaps unwise, to assign each a static risk. Efforts like CVSS8 try to bring consistency and terminology to describing software vulns, but the scoring systems seems infrequent within web security. Fortunately, the risk calculations can be avoided without great loss if you treat vulns like the software defects they are. Assign priorities using methodologies already familiar to your dev team. And if the effort required to fix a bug is greater than the effort required to prove it’s really, really, truly a problem, then have a discussion about impact and risk.

Manual testing has a high degree of variance in quality and coverage. I don’t make these statements without being blameless. In the past I’ve written a chapter or two that omitted an attack variant or didn’t highlight something well enough. These are things I’ve tried to address in revised editions and on this web site. My point is that manual tests have an unavoidable bias in focus that depends on the person conducting them. Rather than passing judgement on quality, this is more of a judgement on coverage and that inevitable human factor of making mistakes. (After all, lots of vulns boil down to mistakes in code, albeit fairly consequential ones.)

CSRF arrived on the OWASP Top 10 in 2007. How many people were testing their web apps before that year? (To be fair, how many apps were being compromised that way? Especially when SQL injection remains(!) so much easier.) I like to pick on CSRF because the vuln is easily misunderstood and its purported impacts and countermeasures vary immensely.

There are two methodologies for addressing test coverage: blackbox (review the deployed app) and whitebox (review the app’s code). Blackbox testing rarely requires knowledge of the app’s underlying language. Although cases like PHP show that prior knowledge can be helpful because configuration settings influence code execution. The same code may run securely on one server and be wide open under a different configuration. Blackbox testing is the easiest path because it requires nothing more than a browser to begin. The snag is that blackbox testing won’t necessarily find the dusty corners of the web site where an insecure link or form is hiding.

Then why not just step towards a code review where every corner can be checked? A pen tester could spend an hour investigating different XSS vectors against a web page whereas reviewing the page’s source code could provide higher confidence whether it’s secure. After all, not every security fix is securely fixed.9

The drawback is that whitebox testing reduces the population of testers capable of finding security problems. This exacerbates the problem of scale in matching testers to web apps. The tester needs to have good comprehension of the app’s programming language in addition to security concepts. Someone very good at reviewing Java may miss a vuln in C#. Knowledge of good software design translates well between languages. Yet the problem remains that too few people have too many sites to review.

As with the article on web scanner mortality, this one requires a caveat. There will be a vocal group hurling cliched invectives against the idea that manual testing is useless, never requires tools, and is a worthless endeavor. None of those were asserted here. In fact, manual testing must be part of any exhaustive security testing. Humans have the ability to analyze the design of a web site, where an automated scanner largely focuses on implementation problems. Humans also possess the creative thinking required to turn QA’s use cases into abuse and misuse cases that bypass security.

Web QA testing groups are likely already overloaded with the combinatorial craziness inherent to reviewing UIs. Yet this is a perfect step for identifying security problems alongside bugs in features. Tools like Selenium10 would be prime platforms for security testing. Yet how many times did a web pen test produce findings, provide a PDF, then leave? Why haven’t Selenium scripts (or anything similar) become a lingua franca of pen test results?

One of the biggest changes necessary to manual testing is translating the hands-on tests that a human performs into a script that someone else (or a tool) can repeat.11 Treating pen tests as snapshots in time misses the opportunity to build a repository of security knowledge. Rather than just manage vulns, a group could manage the techniques and scripts used to find such vulns. This not only enables a one-time security review to become repeatable, but the quality of testing can improve.

The pessimism of this article and the previous one isn’t intended to be a capitulation to web vulnerabilities. By identifying the fundamental challenges to security testing it’s possible to start thinking of creative ways to solve them. It’s important to understand a problem well to avoid branching off into solutions that address false issues or have too narrow of a focus. In future articles we’ll turn the tables on this bleak landscape and look at the effective ways to apply automation and manual testing to web sites.








7 I’ve yet to find definitive explanations of the attack, so reserve a little skepticism for reports like this:



10 11 Dinis Cruz recognized this type of problem and started the O2 project, However, O2 focuses more on the tools to implement repeatability rather than defining a grammar to describe vuln tests. Selenium is a similar tool that uses JavaScript to define tests that can be driven by one of several different programming languages; however, it does not have an explicit security bent.

Published by Mike Shema

Mike works with product security and DevSecOps teams to build safer applications. He also writes about information security, with an infusion of references to music (80s), sci-fi (apocalyptic), and horror (spooky) to keep the topics entertaining. He hosts the Application Security Weekly podcast.

One reply on “The Futility of Web Pen Testing”

Comments are closed.

%d bloggers like this: