vrijdag 1 september 2017

Harvesting metadata from OAI-PMH repositories

OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting is a commonly used protocol for harvesting metadata. Many institutions like libraries, universities and archives use OAI-PMH to offer public access to metadata records.

To facilitate harvesting and processing metadata from OAI-PMH repository I wrote a bash shell script that should run an any Unix based environment like Macs and Linux, and probably even on Windows 10 using the Linux subsystem. A special feature of this script, oai2linerec.sh available through GitHub, is that is stores the harvested records in a single file. Often, a single file will be much easier and faster to process than thousands of separate files. Of course, storing metadata in a database will make it much more complex to process and analyse the data.

The trick of oai2linerec.sh is that is serializes each XML metadata record to a single line. A single file can be processed with a few lines in bash like:


# set IFS to use only newlines as separator, store prev. IFS:

# now walk through each line of input.txt
for line in `cat input.txt` ; do
    # this example just sends the record to xmllint 
    # to present is nicely formatted:
    echo $line | xmllint --format -

#restore IFS
To save space when harvesting big metadata repositories, oai2linerec.sh can optionally also compress each line (record) separately. Processing such a file is as easy as typing zcat in stead of cat:
# now walk through each line of input.txt.gz
for line in `zcat input.txt.gz` ; do
    echo $line | xmllint --format -
To further speed up the processing of the metadata, use tools like parallel, have a look at A beginners guide to processing 'lots of' data.

vrijdag 3 maart 2017

Fixing dockerfile issues running macOS

I have become a great fan of Docker as an easy way to test software and to run various development enviroments on the Mac. However, not all docker images that are available through dockerhub / Kitematic work out of the box on the Mac. The issues I've come across are generally linked to one or two causes:

  1. Images need to explicitly expose a port when running on the Mac.
  2. Exposed volumes don't always work as expected.
For both issues I've found a solution.

Exposing ports

As stated, dockerfiles on macOS need to explicitly expose the ports they listen to. So the original Dockerfile MUST have a statement like

EXPOSE 30000
If it doesn't, you can fix this by creating you own Dockerfile and building you own image. In this example we'll do this to fix the ascdc/iiif-manifest-editor Dockerfile. First, create a folder to store the new Dockerfile. Now create a Dockerfile with simply this content:
FROM ascdc/iiif-manifest-editor
That is all! Now just use this to build your image:
docker build -t myimage .
Run this container:
docker run -d -p 3000:3000 --name mycontainer myimage
The container will now be available at http://localhost:3000 with your browser on macOS.

Fixing volume-related issues

Directories inside a docker container can be exposed on macOS through the VOLUME directive in the Dockerfile. Unfortunately, data in this volume can't always be accessed properly in Docker. For example tenforce/virtuoso exposes a volume that is actually a symbolic link to an other folder. Docker on macOS doesn't properly deal with that. My solution was to create my own Dockerfile (as explained above) that exposes not the symlinked directory but the original directory the symlink points to. For virtuoso I also had to explicitly expose the ports so it ended up like:

FROM tenforce/virtuoso
VOLUME /var/lib/virtuoso/db
WORKDIR /var/lib/virtuoso/db
Given this incompatibility between the filesytems used by Docker and macOS you will probably run into problems when for example you try to have a MySQL container store it datafiles on the macOS side. Preferably just keep them inside the container.

woensdag 21 oktober 2015

A beginners guide to processing 'lots of' data

Working at a library, there is often the need to process a lot of data. Processing data, for example using XSLT or scripts, is one thing, but when there is 'a lot' of data to process additional rules apply. I consider myself a beginner when it comes to processing larger amounts of data, but here are some basic rules that I've learned so far:

1. Keep it simple, learn to use Unix-tools

For most processing tasks or data analysis jobs, just a small set of very simple tools will do the job. Complex taks can be executed efficiently by chaining ('piping') simple tools that are just very good at one specific task. On the other hand, using complex tools will often unnecessarily complicate the process, for example by having very high requirements with respect to available system memory. To start with, tools that will bring you a long way are 'grep', 'sed', 'sort', 'uniq' and 'wc -l'.

2. Keep your job running, use 'screen'

Usually, data is not processed on the actual computer that is at your desk but on a remote machine. When you work on a computer through a remote connection, you don't want to run the risk that your processing job quits just because your console lost its connection. Well, there is a tool for that called 'screen'. Before you start your job, start a 'screen' session. Now your process will just keep on running if you loose the connection to the console. When that happens, just login to the shell again and give the command 'screen -R'. Now you are back in your last session, the processing job still running.

3. Keep your data in one file

When you need to process for example one million records, your OS won't be happy if you store all these records as separate files on your filesystem. Even a simple 'ls' command wil now take ages to complete. Storing the data in a database might be a good idea but has some downsides, one of them being that it complicates required scripts and tools and usually just creates technical overkill.
My approach is to store all data in one file. My data harvesting script serializes XML to one line per OAI-PMH record. Now using a very simple shell script I can process each line separately.

4. Keep your cores working using 'parallel'

When processing data, make best use of available processing power. This can be done 'manually' splitting up your task into a number of jobs equal to the number of cores on your machine and run your processes in the background (by adding a '&' to the invocation). Besides making better use of available processing power, splitting inputfiles into smaller ones will significantly speed up the work.
Great news is that both parallelizing jobs and splitting inputfiles can all be done using one tool named GNU parallel. An example call

<in.txt parallel --pipe --blocksize 3000000 '/parse.sh > /tmp/out_{#}'
will process all the lines in 'in.txt' using the script 'parse.sh' and store results in separate files in /tmp/.

5. Your OS matters

I am experienced with Unix-tools. MS Windows does not appear to be designed to do the stuff you can easily do using Unix (or Linux or MacOS). Fortunately, with Cygwin your can run most Unix-tool on Windows. Downside is that, at least in my experience, scripts under Cygwin are much slower than on an equivalent Linux machine.

6. Your hardware matters

Maybe this goes without saying. Just would like to conclude that when it comes to I/O speed, systems with SSD are vastly superior over systems with traditional drives. Personally, I will be only buying SSDs.

vrijdag 13 februari 2015

Link rot: detecting soft-404s

Links rot and content drifts, we are all aware of that. But how can one actually detect link rot? Web servers do not always return a proper '404' HTTP status code if the requested page can't be found. Often, the replacement page that tells the user that the requested page was not found, is accompanied by a '200' status code, signifying everything is OK. This called a soft-404.

So we can't always rely on the HTTP status code to know whether a page is available. Since the robustify.js website add-on depends on knowing if a page can be found or not, I implemented an algorithm that attempts to detect these soft-404s.

I followed an approach that was suggested to me on Twitter. By sending the server a request with a random url, we know that the returning page must be a '404'. Now with a technique known as fuzzy hashing we can compare this known '404' page with the requested page. If they are identical or at least very similar it is very likely that the requested page is also a '404'.

There is room for optimizing this algorithm. To start with, the required level of similarity is something that can be tweaked. Robustify.js now only performs the soft-404 test if the original request results in one or more redirects. However, soft-404s can be generated at the original request url, without redirection. So with this approach we will miss out some soft-404s. Further, if the random request actually does return a '404' status code, we might assume this server is configured well and we might skip the comparison.
Additionally, we might try to 'read' the page and see if it contains strings like 'error' or '404'. Such an approach is clearly less elegant than the fuzzy hashing approach and would require more training (internationalization) and maintenance. Perhaps it might work well, on top of the hashing approach if the similarity between the page and the forced '404' is somewhat indecisive.

Improving the soft-404 detection algorithm will necessarily require a lot of manual testing. Without a perfect soft-404 detection there is no easy way to create a test set and without a large enough test set we can't be very effective at improving the algorithm. Since soft-404 detection is not just valuable for robustify.js but also for many other applications, the heritrix crawler being a fine example, I do hope that with some community effort with can further improve it. A first step might be for all you crawl engineers out there to send suspect crawl artefacts to the statuscode.php service (part of robustify.js) with soft-404 detection enabled and see it it does recognize it as a soft-404.

For example, test result of http://www.trouw.nl/tr/nl/4324/Nieuws/archief/article/detail/1593578/2010/05/12/Een-hel-vol-rijstkoeken-en-insecten.dhtml
is provided by
which shows at the end of the JSON output there is a 100% match between this page and the forced '404' (thus recording a '404' status code).

Without soft-404 detection the script gives a 200 status code:

maandag 2 februari 2015

robustify.js: Returning a Memento iso a “404”

To end link rot on the archaeological site I run (Vici.org), I wanted a tool that would stop sending users to pages that return a “404 File not found” error. So I wrote a nifty little script called robustify.js.

robustify.js checks the validity of each link a user clicks. If the linked page is not available, robustify.js will try to redirect the user to an archived version of the requested page. The script implements Herbert Van de Sompel's Memento Robust Links - Link Decoration specification (as part of the Hiberlink project) in how it tries to discover an archived version of the page. As a default, it will use the Memento Time Travel service as a fallback. You can easily implement robustify.js on your web pages in so that it redirects pages to your preferred web archive.

robustify.js can be found on GitHub. A demo can be seen here.

vrijdag 30 januari 2015

Webarchaeology: finding old pages still online

Here in the Netherlands, the legal framework isn't much supportive of doing a full domain harvest. This is one of the reasons that the KB, the National Library of the Netherlands, follows a selective approach. The owner of each selected site is asked for permission to have the site harvested, stored and made available by the KB.

Almost by definition, a selective approach does not result in such a complete representation of the national web as a domain harvest. Selecting websites for the archive is a labour intensive job. So because of this approach, potentially valuable parts of the Dutch web are at risk.

Regarding the history of the Dutch web, many early sites have disappeared before they were archived by the KB or other organisations like the Internet Archive. However, some remains of that early web are still online. But how to find those pieces to preserve them for the future?

In the days before Facebook, people used personal home pages to publish on the web. Many commercial providers that hosted the early home pages have gone. Other home pages disappeared because people moved to other providers and didn't bother to keep the home page. Others were simply removed, due to a lack of interest, privacy concerns or sadly because their owners died and stopped paying the bills.

In the none-commercial world, the situation has been a bit more stable. Most of the scientific institutions that witnessed the rise of the web still exist. Generally, those organisations have experienced little pressure to save money by removing the unused home pages of employees that moved or retired. Probably payroll administration wasn't even tied to the administration of user accounts. Often employees of some of those institutions played a vital role in the early days of the web. Overall this has created a beneficial situation for what might be called 'internet archeology', using the internet to dig up stuff that was thought to be long gone.

Following this approach, a great deal of pages from the early web were found. Some great examples are the home pages found at CWI, site:homepages.cwi.nl and at NIKHEF, Willem van Leeuwen's homepage.

An other method of finding old pages is to use an other early site as 'bait'. One of the early Dutch websites was DDS.nl (see: Internet Archive), a highly successful Freenet. Overwhelmed by its success and lacking sufficient funds, DDS.nl stopped in 1999. Combined this makes DDS.nl great bait for finding old pages. A Google search for link:dds.nl -site:dds.nl will result in many pages from before the turn of the century

dinsdag 27 januari 2015

Linkrot and the mset^H^H^H^H data-versiondate attribute

I run an archaelogical website (http://vici.org/) that has a database consisting of over 20000 records as its backend. Many of these records provide external links. Of course, every now and then linkrot creeps in and links stop working or direct to a page with content other than intended.

To overcome this I've started creating a tool that wil auto-archive all external links. When a user clicks on a link, a javascript will invoke a tiny service that returns the HTTP status code of the requested page. If the page is not available (returning a 404) the user will be redirected to a web archive. Aim is that the site will eventually run its own webarchive, auto-archiving each newly discovered link.

When directing a user to an archived version of a page, ideally we link to that very version of the page the author had in mind when he created the link. So we need more information than just a hyperlink. This issue can be solved by following an approach originally suggested by Ryan Westphal, Herbert Van de Sompel and  Michael L. Nelson in "The mset Attribute". Basically it proposes to enrich hyperlinks with an attribute that provides either temporal context or refers to a specific archived copy or both. Their draft has now been superseeded by the Memento Robust Links specification (Robust Links - Link Decoration, see also Robust Links - Motivation).

A hyperlink following this specification could look like:

<a href="http://www.w3.org/spec.html" data-versiondate="2014-03-17">HTML</a>


<a href="http://www.w3.org/spec.html" data-versiondate="2014-03-17"

I intend to implement the data-versiondate attribute in the CMS of the website. When a new link is added to a record, the CMS will insert a data-versiondate attribute using the current date.

Update 2015-01-27 17:09 (CET): added the Robust Link specs and changed examples accordingly.

PS: See also the W3.org community on Robustness and Archiving.