Code for Arduino Sensorium and Python serial reader. For other Arduino code, follow links from entries in Moosteria.

Open source

Most of this is old and of dubious value, but in case it helps anyone, here is code that I have open sourced. If you use this code or have any comments on it, email me at

Part of speech taggers

I have written several part of speech taggers based on Hidden Markov Models (HMMs) over the years. I wrote the first one while working at Acorn Computers on Esprit Project 860 (Linguistic Analysis of the European Languages). This was a collaborative project between Acorn and various European universities to develop software for grapheme-phoneme and phoneme-grapheme conversion, and to build tagged corpora in several languages. Our initial corpora were 100,000 words in English, French, German, Dutch, Italian, Spanish and Greek, with the source being EEC regulations. A second corpus of similar size was drawn from newspaper text; for English, we obtained three issues of The Independent as typesetting tapes. These are laughably small by today's standards, but at the time, it was still a challenge to process them efficiently. In some respects, we were ahead of our time, in that statistical, corpus-based NLP was just beginning to take off. I don't know if the other partners were aware of this. Certainly, in the Acorn team, I don't think we really knew how what we were doing fitted into the broader direction of research in the NLP community. The Acorn group's work on HMM tagging was largely ignored by the more senior partners in the project, and fell by the wayside when the project ended; I think it is also a little sad that those same partners did not make the corpora available outside the project, as they would have been a great resource at that point in the development of statistical NLP, in being not only tagged, but also (for the EEC corpus) translation-equivalent.

After finishing my PhD at the end of 1992, my supervisor Ted Briscoe allocated some funding to me from the Acquilex project, in order to write a new HMM tagger. Writing the program went very quickly, and I used it as a vehicle for conducting experiments on the practicalities of making taggers work well, leading to three publications: one showing that Baum-Welch re-estimation (an expectation-maximization technique) was not as straightforward to use as many people thought, a second on detecting errors in HMM tagging, and a third on the impact of tagset design. Copies are available on my publications page. Variants of the Acquilex tagger was subsequently used in several of Ted's projects, such as RASP, and by several other organizations and research groups.

I have made the Acquilex tagger available as an open source project, under the GNU General Public Licence. You can download it here, in source form. It should compile on Unix, cygwin and Windows (with Visual Studio 6). This version is a descendant of the original Acquilex tagger, and includes some minor additions and bug fixes. Unlike some taggers such as that of the demonic Eric Brill, it does not come with a tagging model, and you will need to train it from a tagged corpus (or otherwise, as described in my ANLP paper).

There is a second tagger, written in a more modern style, and without the restrictions of the GPL. Download it here. Documentation is included.

Simple Good-Turing estimation

Good-Turing estimation is a technique for taking observed frequency values of some observation and smoothing them to reflect the facts that the observations are only a sample. Geoff Sampson and William Gale devised a practical version of the technique, which they dubbed Simple Good-Turing. A description and pointers to relevant publications can be found here. Sampson also made his C implementation of the algorithm available. I recoded the algorithm in C++ and encapsulated it in a class so that it can be used in other programs. You can download the source here. The code has been tested with gcc and Visual C++, at least as far as showing that it gives the same output as Sampson and Gale's version on some test files.

Windows Media Utilities

One of the annoying things about Windows Media Player is that if you drag-and-drop a number of files into it, the resulting playlist is not in a sensible order. For example, if you have converted an audio CD to WMA format, using Windows Media Players own Copy From Audio CD function, the tracks have names which consist of a numeric prefix followed by the name of the track, so that if they were sorted lexicographically, they would be in the same order as the tracks on the CD. You can fix this by reordering the tracks and then saving the playlist, but it is tedious to do this. I have written two Perl scripts which take one or more directory names and create a playlist in each directory from the WMA files found there, with the tracks in the right order. See the comments at the start of the scripts for details of how to use them. One of them outputs a WPL file, and the other an ASX file.

Image Organizer

I wrote a utility for helping to organize images, in Java. It's called FEAST, and you can get the jar file for it here, with some rather limited documentation here.

GMail Contacts Converter

I find it useful to export my gmail contacts as a HTML page, so I can print it or download it. Here is a simple Perl script which allows you to do this. It works on the gmail-format CSV file that gmail can export.

PDF Combiner

This little Python program combines PDF files. It looks at files in the directory it is running in, and if it finds a file called Foo.pdf and others called Foo_0001.pdf, Foo_0002.pdf and so on, it will merge them into a single file called Foo.pdf, with the pages from the numbered files preceding those from Foo.pdf itself. The numbered files are then deleted. It does this for all such sets of files in the directory, prompting for confirmation first. I wrote this so that when I scan documents, such as bank statements, I can scan new pages to a numbered file and then prepend them to the existing one. The program requires PDF libraries from