Links rot and content drifts, we are all aware of that. But how can one actually detect link rot? Web servers do not always return a proper '404' HTTP status code if the requested page can't be found. Often, the replacement page that tells the user that the requested page was not found, is accompanied by a '200' status code, signifying everything is OK. This called a soft-404.
So we can't always rely on the HTTP status code to know whether a page is available. Since the robustify.js website add-on depends on knowing if a page can be found or not, I implemented an algorithm that attempts to detect these soft-404s.
I followed an approach that was suggested to me on Twitter. By sending the server a request with a random url, we know that the returning page must be a '404'. Now with a technique known as fuzzy hashing we can compare this known '404' page with the requested page. If they are identical or at least very similar it is very likely that the requested page is also a '404'.
There is room for optimizing this algorithm. To start with, the required level of similarity is something that can be tweaked. Robustify.js now only performs the soft-404 test if the original request results in one or more redirects. However, soft-404s can be generated at the original request url, without redirection. So with this approach we will miss out some soft-404s. Further, if the random request actually does return a '404' status code, we might assume this server is configured well and we might skip the comparison.
Additionally, we might try to 'read' the page and see if it contains strings like 'error' or '404'. Such an approach is clearly less elegant than the fuzzy hashing approach and would require more training (internationalization) and maintenance. Perhaps it might work well, on top of the hashing approach if the similarity between the page and the forced '404' is somewhat indecisive.
Improving the soft-404 detection algorithm will necessarily require a lot of manual testing. Without a perfect soft-404 detection there is no easy way to create a test set and without a large enough test set we can't be very effective at improving the algorithm. Since soft-404 detection is not just valuable for robustify.js but also for many other applications, the heritrix crawler being a fine example, I do hope that with some community effort with can further improve it. A first step might be for all you crawl engineers out there to send suspect crawl artefacts to the statuscode.php service (part of robustify.js) with soft-404 detection enabled and see it it does recognize it as a soft-404.
For example, test result of http://www.trouw.nl/tr/nl/4324/Nieuws/archief/article/detail/1593578/2010/05/12/Een-hel-vol-rijstkoeken-en-insecten.dhtml
is provided by
which shows at the end of the JSON output there is a 100% match between this page and the forced '404' (thus recording a '404' status code).
Without soft-404 detection the script gives a 200 status code: