David Elworthy, david-at-friendlymoose.com, April 2004. (Minor revisions, January 2006.)
The C+ tagger is a bigram HMM tagger together with a collection of utilities for training and support. Here I describe how to use the programs and give the major file formats.
The programs are:
Also in this file, you will find a short history, the licence terms, and file formats.
You are welcome to report, and fix, any bugs you find. There may have been some bit rot since I last tested it thoroughly.
I wrote a tagger in C for the Acquilex project (available from http://www.friendlymoose.com/software.htm, under the GPL) in early 1993. Not long after this, I started to shift from using C as my main programming language to C++, and decided to write a tagger as a way of learning the language. As if often the case with newcomers to object-oriented programmer, I hugely overdesigned it in term of using inheritance, and was unhappy with the result. The current tagger, the C+ tagger, was originally a halfway point from the C tagger to the C++ one. Most of the tagger was written in early 2001, while I was working as a consultant, and then tidied up in 2004. In terms of structure, it is loosely related to the tagger used in the ANVIL system which I wrote while at Canon Research in 1995-99, although the programs have no more than a superficial resemblance.
The C+ tagger is release under an open source licence which essentially allows you to use it for any purposes. Here is the licence agreement, which also appears in each source file:
Copyright (C) David Elworthy 2004. All rights reserved.
Redistribution and use in source and binary forms for any purpose, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the disclaimer that follows these conditions in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
You may contact me at david-at-friendlymoose.com.
To build the programs, use the make files: Makefile for building with gcc, and winmake for Visual C++. The current version has been tested only with gcc on cygwin, and Visual C++ version 6 on Windows XP.
train reads one or more tagged corpora and builds the statistical model needed for tagging. The command line arguments are:
The tagset defines all the tags and their properties, the endings file lists rules for how to handle unknown words during tagging, and the equivs (equivalences) file allows low frequency lexical entries to be merged. The output file, or model, is normally in a binary form, and encodes the statistical model (lexicon and transition probabilities), the tagset, and endings rules and the equivalences. Option -T outputs the model in a textual form, for human inspection. More details of the files, and in particular their formats, can be found below (or follow the links).
Some of the other options need a little further explanation.
Lexicon size: during processing of the corpus, the program builds up an internal word list (lexicon) in an array. The -s option indicates how much space to allocate for the internal representation of the word list. If you specify a number which is smaller than the program needs, it will succeed, but may take longer to run.
Number handling: the tagger treats all numbers as being representatives of a single lexical entry, and hence having the same tags. The options -n and -N tell the programs how to decide if a word is a number or not. If neither is specified, any word containing a digit is treated as a number; with -n, only if the word is all digits and nothing else; and with -N, if the word has the form of an optional + or - at start, followed by a sequence of digits. and commas, followed optionally by a decimal point and more digits. The test with -N is fairly weak. For example, +12,34.00 is treated as a number, even though the comma is in the wrong place.
Correction options: -g and -G allow some statistical correction to be made to the frequencies observed for tags and transitions, to allow for possible sparseness in the training corpus. -G is actually of little value, and if correction is wanted, -g is almost always better.
Case preservation: -l means keep the case of words as seen in the training data. Otherwise, everything is converted to lower case. For a large enough corpus, -l is a good idea.
Lexicon reduction: the -r option allows the size of the lexicon to be reduced. Entries which have a frequency less than or equal to the threshold are deleted from the lexicon. Note that this mechanism makes the model slightly approximate, as the transition and total tag frequencies are not adjusted. If the -R option is specified as well, the entries which were removed are written to the indicated file.
Example:
train -t DataFiles/tags.map -o model -i p -n -F -- DataFiles/corpus
The tagger is the main program for assigning tags to corpora. It can label untagged corpora, or can take tagged corpora and test its choice of tag against the manually assigned ones. The command line arguments are:
- -m <model file> (i.e. output of training)
- -i Input corpus format:
- p Penn treebank format
- t Tabbed format
- u Untagged (default)
- -l Enable case hold
- -L Disable case hold (default)
- -o <output file>
- -f Output corpus format:
- p Penn treebank format (default)
- a Flagged format
- -a Write all tags, in order (FB only)
- -A Disable -a
- -s Write scores with tags (FB only)
- -S Disable -s
- -t <threshold>
- -T Disable -t
- -c <corpus file> ...
- -v Use Viterbi algorithm (default is FB)
- -F Use FB algorithm.
- -e Evaluate results and report on cerr
The tagger can process multiple input files and switch from one output file to another. Many of the other options, such as the tagging algorithm, can also be switched. The command line arguments are processed in order, which is in some cases a bit counter-intuitive. For example, you have to specify the output file before the input file, since as soon as the input file argument is encountered, the tagger will start processing it. Similarly, any other arguments, such as the model, which affect the tagging process must be specified before the input file. Error messages are displayed if things really can't be done, but few warnings are given (e.g. with -s in effect switching to -v).
Both the input and output file can be a dash (-), in which case the standard input and output streams are used. These are also the defaults.
Here are additional explanations of some of the argument:
Case holding: normally when a token is read, it is looked up in the lexicon first in its raw form, and then in lower case. This may be disabled by enabling case holding (-l), in which case no change to the case is made.
Output format modifiers: -a causes all tags to be written; -A gives just the one with the highest score. -s display the (normalised) score for the tag, -S disables it. -t sets a threshold on the output score. In conjunction with -a, tags are output only if the score exceeds the threshold. Without -a, the best tag is written only if its score exceeds the threshold. In both cases, the result can be a word with no tag. If the threshold starts with a -, then tags are output until the total score is greater than or equal to the threshold. These options only apply if the tagging algorithm is FB, as the Viterbi algorithm selects a single tag and does not assign a score to it.
Tagging algorithm: either FB (Forward-Backwards) or Viterbi can be used as the tagging algorithm. Viterbi is slightly faster but makes less information available.
Evaluation: if the input is tagged, the -e option will compare the tagger's decisions and output an accuracy measure.
Example:
tagger -m model -f p -i p -o corpus.out -F -e -c DataFiles/corpus.test
bwtrain is used for Baum-Welch (re-estimation) training, in which an existing model is adapted using an untagged corpus, and a new model is written. The corpus may be passed through several times, given by the number of iterations. The program accepts tagged corpora, but ignores the tags. Many of the options are similar to those used in either train or tagger.
- -m <input model file>
- -o <output model file>
- -i Input corpus format:
- p Penn treebank
- t Tabbed format
- u Untagged (default)
- -l Enable case hold
- -I <number> Number of iterations
- -s <size> Size to allocate for temporary lexicon (default = 10000)
- -n recognise numbers if all digits only (default: any digit)
- -N recognise numbers if the "parse"
- -q <equivs file> read equivalence classes from the file (applied at end)
- -T write output in textual form
- -r <threshold> Reduce the lexicon
- -R <file> File for reduced lexicon
- -- marks end of arguments: followed by corpus list
The merge program allows you to combine models. This can be useful so you can train on several corpora separately and then combine their models. The models must not have been constructed using equivalence classes, although you can apply equivalence classing to merge's output. The same applies to endings rules. In addition, all the models must have been constructed using the same tagset. The command line arguments are as follows; for details see train and tagger.
- -e <endings file>
- -q <equivs file>
- -g Apply Simple Good-Turing correction
- -G Apply plus one correction
- -s <size> Size to allocate for temporary lexicon ( default = 10000 )
- -o <output model>
- -T Write output in textual form (must appear before -o)
- -r <threshold> Reduce the lexicon
- -R <file> File for reduced lexicon
- -- <model>... List of input models
Merge can be used with a single input model, to change its characteristics.
Remodel reads a binary format model and outputs the tag and transitions data, in the form:
<tag1> <tag2> <score>
The command line names the input and output file names. The input file is binary, the output file is textual.
The tagset file lists all the tags which are to be used, with their properties. Each line of the tagset file consists of a tag, optionally follows by a space and the letters c or a. c means the tag represents a closed category, such as determiners, and will not be used for unknown words. a means the tag is an anchor. Anchors are special words inserted at the start and end of the corpus to initialise the transitions. Every tagset must include exactly one anchor tag; commonly ^ is used. Tags themselves may not contain spaces, and are case sensitive. Here is an extract from an example tagset:
^ a
DT0 c
NN
NNS
VBG
VBZ
Endings rules are used when the tagger sees a word which is not in the lexicon. All non-closed class tags are put forward as candidates for the word. If endings rules have been specified, the range of tags can be restricted and a bias applied to the probabilities. The rules test patterns against the endings and other parts of the word.
Rules have the form:
[<num>][<:num>][~]<pattern> [~]<tags>
where <tags> is a list of tags separated by spaces. The rule means if the word meets (~ => fails to meet) the pattern, then restrict the tags to the given ones (~ => eliminate the listed tags). If the rules fires and a :<num>is given, then jump to the rule with label <num>. If the rules fires and there is no such continuation, no further rules are tested. If the continuation does not exist, then nothing happens. The continuations allow rules to be skipped, so that more specific rules can block more general ones.
Patterns can be:
-x|y matching a word ending in y but not xy.
y|x- matching a word starting in y but not yx.
If x is empty, the separator | can be omitted. y may be empty only if x is not. Comparisons are not case sensitive.
Special patterns: I for initial capital, A for all capitals, C for any capital.
Each tag may be followed by a slash and a floating point number to give an initial bias to the probability.
This is a very crude mechanism, but can be put to good use. It is up to the rule set author to make sure the rules appear in a sensible order. In particular, and all caps rule ought to precede an initial cap or any cap rule.
Example:
:1 I JJ JJS JJR NN NP NPS SYM
-l|ly RB
1 -er ~JJS
-est ~JJR
-ing ~VB VBD VBN VBP VBZ
-u|s ~VB VBG VBN VBP VBZ NN NP
-ed ~VB VBG VBZ
:2 ~-s ~NNS NPS
2 ~-ing ~VBG
Equivalence classes allow entries in the lexicon to share the same tag data. If a word only occurs a few times in the training corpus, the statistics about it can be unreliable. By using equivalence classes, the data from words with similar properties can be be merged, so increasing the absolute frequency of the whole set (i.e. equivalence class) of words. This process of merging is carried out at the end of training. The equivalence rule have the form:
<level> <tag> ...
One rule appears on each line of the file. The interpretation is this: after the lexicon has been constructed, locate all entries which have a relative frequency of occurrence in the training data of less than or equal to the indicated level (a floating point number) and which have exactly the tags listed, and combine their data. The lines are applied in the order in which they appear.
The programs can work with several corpus formats, both for training and in tagging. The format is specified by an option to the program. Formats are:
Words are separated by white space. Each word is followed by a slash character (/) and one or more tags. If there are multiple tags, a bar character (|) appears between them. Words with no tag are ignored during training. Lines which start with the characters *x* are ignored.
Each line consists of a word, a tab character, and one or more tags separated by a dash character (-). Lines which start with the character < are ignored, and an anchor character is inserted instead. Note that words can contain spaces with this input format.
In untagged format, the corpus simply consists of words. It is split at white space, with no special provision for multi-word units or for separating punctuation from words.
The form of the output corpora from the tagger is modified by the various extra flags. Here we give the general description.
Each word appears on a separate line. The word is followed by a slash (/), and one or more tags separated by bars (|). Each tag may have a score, which follows the tag, separated from it by a slash (/). The tags are output in order of decreasing score, whether or not the score is displayed.
Flagged format is identical to the Penn treebank format, except that at the start of each line three characters appear, followed by a space. The first character is a if the word was ambiguous (had more than one tag) or space otherwise; the second is u if the word was unknown or a space otherwise; and the third is - if the word was tagged incorrectly compared to the input, or a space if it was correct. All words are marked as correct if the input corpus was untagged.
The model file is written out at the end of the training process and read by the tagger. It is normally kept in a binary form; in outline it contains the list of tags together with their transition probabilities to other tags, endings rules, and the lexicon consisting of words and their tags and lexical probabilities (or pointer to other words for equivalence classes). The model can be written out in a textual form. In this case the endings rules are omitted. The file consists of a section marked <TAGS>...</TAGS>, containing one line for each tag. The lines consist of the tag, its anchor or closed flag (a, c or . if neither), and the transition probability to each other tag in order. The lexicon follows the tags, and starts with an entry of the form <NUMBERTEST test> where test is one of AllDigit, AnyDigit, Parsable and Never, followed by the bulk of the lexicon between <LEX> and </LEX>. Each entry in the lexicon appears on a separate line, and consists of the word, a tab, the leader, a tab, and the tags with their probabilities. The leader marks equivalence classes, and it is either 00NONE (for words which are not in an equivalence class), 00SELF (for words which are in an equivalence class, where the data for class is in the current word's lexical entry), or the name of a word holding the data for the current one. There are not tags in the latter case. Two special entries appear: the unknown entry which holds tag data for unknown words, and consists of the tags and probabilities enclosed by <UNKNOWN>...</UNKNOWN>, and an entry for anchor words, used at the start and end of tagging, which consist of the name of the anchor word (usually ^), enclosed by <ANCHOR>...</ANCHOR>.