woensdag 21 oktober 2015

A beginners guide to processing 'lots of' data

Working at a library, there is often the need to process a lot of data. Processing data, for example using XSLT or scripts, is one thing, but when there is 'a lot' of data to process additional rules apply. I consider myself a beginner when it comes to processing larger amounts of data, but here are some basic rules that I've learned so far:

1. Keep it simple, learn to use Unix-tools

For most processing tasks or data analysis jobs, just a small set of very simple tools will do the job. Complex taks can be executed efficiently by chaining ('piping') simple tools that are just very good at one specific task. On the other hand, using complex tools will often unnecessarily complicate the process, for example by having very high requirements with respect to available system memory. To start with, tools that will bring you a long way are 'grep', 'sed', 'sort', 'uniq' and 'wc -l'.

2. Keep your job running, use 'screen'

Usually, data is not processed on the actual computer that is at your desk but on a remote machine. When you work on a computer through a remote connection, you don't want to run the risk that your processing job quits just because your console lost its connection. Well, there is a tool for that called 'screen'. Before you start your job, start a 'screen' session. Now your process will just keep on running if you loose the connection to the console. When that happens, just login to the shell again and give the command 'screen -R'. Now you are back in your last session, the processing job still running.

3. Keep your data in one file

When you need to process for example one million records, your OS won't be happy if you store all these records as separate files on your filesystem. Even a simple 'ls' command wil now take ages to complete. Storing the data in a database might be a good idea but has some downsides, one of them being that it complicates required scripts and tools and usually just creates technical overkill.
My approach is to store all data in one file. My data harvesting script serializes XML to one line per OAI-PMH record. Now using a very simple shell script I can process each line separately.

4. Keep your cores working using 'parallel'

When processing data, make best use of available processing power. This can be done 'manually' splitting up your task into a number of jobs equal to the number of cores on your machine and run your processes in the background (by adding a '&' to the invocation). Besides making better use of available processing power, splitting inputfiles into smaller ones will significantly speed up the work.
Great news is that both parallelizing jobs and splitting inputfiles can all be done using one tool named GNU parallel. An example call

 
<in.txt parallel --pipe --blocksize 3000000 '/parse.sh > /tmp/out_{#}'
will process all the lines in 'in.txt' using the script 'parse.sh' and store results in separate files in /tmp/.

5. Your OS matters

I am experienced with Unix-tools. MS Windows does not appear to be designed to do the stuff you can easily do using Unix (or Linux or MacOS). Fortunately, with Cygwin your can run most Unix-tool on Windows. Downside is that, at least in my experience, scripts under Cygwin are much slower than on an equivalent Linux machine.

6. Your hardware matters

Maybe this goes without saying. Just would like to conclude that when it comes to I/O speed, systems with SSD are vastly superior over systems with traditional drives. Personally, I will be only buying SSDs.