The "Acquilex" Part-of-speech Tagger
David Elworthy
January 2004
Table of contents
Introduction
The Acquilex tagger is a HMM-based part-of-speech tagger intended both as a practical system for tagging text corpora and as a research tool for
investigating how taggers perform under various conditions. It was originally written for a
short-term research project which I carried out in the first half of 1993, just after completing
by PhD at the University of Cambridge Computer Laboratory. The work was funded
as part of the Acquilex project, led by Ted Briscoe. This version is a
descendant of the original Acquilex tagger; it has not changed much since 1995
in this version. Other versions have been branched off and used by a number of
people including Ted Briscoe and John Carroll. One version can be found in the
RASP tools (http://www.cogs.susx.ac.uk/lab/nlp/rasp/).
Three publications are based on experiments conducted using the tagger: see
Elworthy (1994a, 1994b, 1995). Some of these experiments are referred to in the
tagging chapter of Manning and Schütze's book Foundations of Statistical
Natural Language Programming.
I am releasing the tagger now (January 2004) under an open source
licence.
These days, it is no big deal to write a HMM tagger, and there are several which
are available commercially or otherwise. If you use it, please drop me a line at david'at'friendlymoose.com,
and include an acknowledgement in any publications. As noted, the code was
written as a research vehicle, and some parts of the code (particularly to do
with combining parsing and tagging) are probably not much use. I believe the
main part of the code to be free from serious bugs and memory leaks. It is in
ANSI C, and has been tested with GCC and Microsoft Visual C++.
There are several programs in the tagger suite, described in the following
sections. With a HMM tagger, you must first construct a model containing lexical
and transition probabilities, normally using a tagged corpus. The tagger program
can be used to do this, as well as its main task of tagging a corpus. I am not
including any tagged corpora with the software, to avoid potential problems with
copyright or licence infringement. If you need to get hold of a tagged corpus, I
recommend visiting the Linguistic Data
Consortium or the British National
Corpus. The
program can tag a corpus which has already been annotated with the correct tags,
and collect statistics for evaluation purposes, or it can tag a corpus which
contains no tag information. There are a number of secondary programs for
examining and manipulating the tagger model files. I admit that the description of how
to use the programs is not particularly clear; experiment and follow the
examples is my advice.
Licence
The tagger is released under the GNU General Public Licence. Personally, I
have some reservations about the GPL and the GNU Project in general, but I think
it is the most appropriate licence given that the software was originally
written in an academic research environment. The licensing terms do not apply to
any other variants of the Acquilex tagger prior to this one, released in January
2004. Be sure to follow the terms of the licence agreement, or you must just
open the door one day to find Richard Stallman waiting to beat you with a big
stick with nails sticking out of it. The licence terms can be found in a file
called LICENSE. You must include this file if you distribute the tagger further.
If the licence file is missing for some reason, you can obtain a copy here
or here. If you need a tagger
that can be used commercially, I may be releasing a separate one under a less
restrictive open source licence; check my
website to see.
User Guide to the Tagger (label)
The tagger is used for training the statistical model and for tagging corpora
on the basis of a model. There is a large number of option for selecting
exactly what the tagger will do, and for specifying variations on the input and output formats and the basic algorithms. Options come in two sorts:
boolean flags, and options which take arguments. Where there is an argument, it may
immediately follow the option letter, or may be separated from it by white space. Arguments are terminated by white space on the command line.
Inconsistent collections of options are reported. Options are case sensitive, and may be preceded by "-", which is ignored.
Note that there are many options for experimental variants, and the descriptions
below may not be particularly clear; in some cases, I now have only the haziest
idea of what I did when when I wrote the code 10 years ago. For simple use, see
the examples.
The tagger is invoked with a command line of the form:
label <in-corpus> <options>
where in-corpus is the name of the input corpus, and options specifies options as follows.
Tagging algorithm
These options specify the basic action of the tagger.
- V
Use the Viterbi algorithm.
- F
Use the forward-backward algorithm. The default algorithm is forward-backward, and this option need only be specified if training
is being used and re-estimation is not wanted (i.e. use option "l" for this case).
- B n Baum-Welch re-estimation with n iterations. (Default: 1)
- b
Apply numerical stabilisation to Baum-Welch. This is recommended. See
Cutting (1992) for the method used.
- l
First iteration is a training run, which builds the dictionary and transitions from a
tagged corpus. May be modified by option I. With option B, training counts as the first iteration. If a word list is
specified (option w), the training corpus may be untagged, in which case unknown words (ones not in the word list) are reported and
skipped; in this case the results are not reliable, and these words should be added to the word list and the program
re-executed. The idea of this option is to allow you to train from a corpus
and then see how the accuracy varies as you perform iterations of
re-estimation. Also use it for training from a tagged corpus when no
re-estimation is wanted, in conjunction with option F. Confused? See the examples.
- f
Pick the tag which is the most frequent one for the word in the dictionary. Gives a baseline for performance.
- p
Set transitions to product of from and to scores during training. Default is to use "from" score alone; only makes a difference with a
word list. For the input formats currently supported, this option can be
ignored;
it was used for some early experiments.
- a
Tag sequences delimited by anchors (sentence level tagging). Normally, sequences delimited by unambiguous words are tagged.
Many authors of HMM taggers assume that you should tag an entire sentence at
a time. This is unnecessary, as all you need is a word with a single tag to
anchor the end of the sequence. The option is provided in case you wish to
do sentence-by-sentence tagging. An anchor in the special word ^.
- L n When re-estimating, do not use data from any hypotheses which
fall below the given score.
- g
Apply plus-one correction to transitions in training. This adds one to all transition frequencies before calculating probabilities, and helps
to correct for cases where there is too little training data (for example, it
can improve accuracy by 1-2%). It may very slightly reduce accuracy when there is a large
training corpus.
- G
Apply plus-one correction to lexical probabilities. This tends to be less effective that the transition adjustment.
(Options g and G are so called because I originally described them as
Good-Turing correction, before I knew what it really was.)
Input and output data files
Output data files are produced from training or Baum-Welch re-estimation. Input data files are needed unless training is being used.
- d fn Filename of input dictionary.
- t fn Filename of input transitions.
- r fn Root of input dictionary and transition file names. The
dictionary file name is formed by adding .lex and the transitions file
name by adding .trn. Use this as a fast way of specifying "d" and
"t" names. The dictionary and/or transitions files can be omitted for some run
types.
- w
Treat the dictionary as a word list. This applies only when training. Instead of using the tag specified in the input
corpus for each word, the tags from the dictionary are used, and the corpus
is used to provide the frequencies only.
- D fn Filename of output dictionary.
- T fn Filename of output transitions.
- R fn Root of output dictionary and transition file names. The
full names are formed as for option "r".
- A n Allocate dictionary of given size. If the option is not
specified, and a dictionary is being read, the size allocated is taken from the dictionary file; when training, the default is fixed constant specified by
the define DICTLEN in the source code.
- i fn Filename of inference rules for tags. Inference rules allow
extra tags to be added to the dictionary; see below for details.
Tag list files
Tag list files are used to list the valid part of speech tags and specify
their properties. More details appears below. Options associated with tag list
files are:
- m fn Filename for tag list. Defaults to
"tags.map".
- M fn Reduce the tag set, using
reduction file fn. The reduction file consists of lines containing pairs
of tags. The tags read from a corpus are looked up in the first column and the tag in the second column used instead. This allows for
experimenting with modified tag sets, as in Elworthy (1995). The dictionary, transitions and output are in terms of the reduced tag set. Lines in the file where the
second (target) tag is "_" (underscore) are ignored: they are there to allow
commenting out. In addition, anything after the second tag on the line is ignored, and may also be used for comments.
Input corpus
Input corpus formats are described in more detail in a later section.
- C n Corpus format. Default is tagged in LOB format, skipping
untagged words. n is one of:
1: Untagged input.
2: LOB format, skipping words with ditto tags.
4: Tagged Penn treebank ("postexts") format.
8: Lancpars format.
- x fn Use an exclusion list. This is a file in the same format as
a dictionary (see file formats description). Words occurring in this list are skipped. For a tagged corpus,
such words are skipped only if the correct tag is one of the ones specified in
the file.
Output corpus
Output corpus formats are described in more detail in a later section.
- o fn Filename for output corpus. Defaults to standard output
(stdout).
- O n Output format. n is either 0, or results from summing codes as
follows. Default value: 1.
0: No output.
1: Standard output format: flags, word, correct tag, chosen tag on error, no scores; may be modified by some of the values that follow.
The flags indicate whether a word was ambiguous, unknown and correctly tagged,
plus some special cases.
2: Output sequences containing errors with respect to tags in the input
corpus. Only sequences of word which contain an incorrect tagging are reported, separated
by lines of dashes. The context consists of the sequences of words which was being tagged.
4: Don't output flags.
8: Output all tags (default: correct tag + error tag).
16: Output scores with tags (default: no scores).
32: Compressed format. Uses more compact output lines.
64: Delimit sentences. Prints a line before each anchor.
128: Output results in a suitable format for analysing errors. Overrides
all other options. Each sequence of words containing one or more errors is output preceded by a separator, and a line of the form:
correct tag/chosen tag (number of errors). A sequence is defined as a collection of ambiguous words, or as a sentence if
option "a" is set. Where more than one error occurs in a sequence, it is
output once for each error. The errors are sorted by correct tag and then by chosen tag.
256: Report observation and tag probabilities. These are in a special format; it is intended that some post-processing is run on the output
corpus to extract the data. The lines start with O for observation, R for relative tag probability, followed by + for correct tagging and -
for incorrect tagging. R is output only for ambiguous words (it is always 1 otherwise). Suppresses other output.
512: One word to a line with the chosen tag (if any). Overrides other output options. Untagged words are not output.
- S Report performance statistics at the end of the output file. Requires
a tagged input corpus. The statistics consist of the total number of words in
various classes, how many were correct and what percentage this represents. In
addition, the average overall observation probability of each sequence tagged
is reported, as is the average perplexity.
- u Report unknown words in the form "word" or
"word_tag" (if the input is tagged) on stderr. The results can be used to build a dictionary, which can them be merged with
the existing dictionary with the dmerge program. The option is intended for tagging
a new corpus with existing information, but avoiding unknown word problems: a first run would be carried out to collect unknown words, which
would then be tagged by hand (if necessary), and added to the dictionary in the manner just described.
- c n Reject hypotheses on ambiguous words which have a relative
tag probability below the threshold n. Such words are marked "r" in the
output, and are not counted towards success statistics, or are counted as being assigned the correct tag by an oracle, depending on a source
code switch.
Unknown words
The performance on unknown words can be improved by specifying rules for
selecting tags based on the form of the word. Normally all non-closed class
tags are used. The unknown word rules use very simple surface analysis of the
word to reduce the range of possible tags. See below for the file format and
how the rules work.
- U fn Read unknown word rules from the specified file.
Numbers
The tagger performance is improved if all numbers are treated
as being instance of the same "word". If neither of the following
options is specified, each distinct number is treated as a separate word; for example,
every number will have a separate entry in the dictionary during training. With
these options, numbers are mapped to a special entry in the tagger dictionary, These options specify how a number will be defined.
- n Treat any word containing a digit as a number.
- N Treat words as numbers if they "parse", i.e. if they may have
leading + or *+ (for LOB), followed by a string containing at least
one digit and made up of only digits, commas, points and minuses as
numbers.
Initialisation
These override the values in the dictionary and/or
transitions, and are intended for use with Baum-Welch re-estimation where the
initial data is not reliable and hence may need smoothing. These options were
used for the experiments reported in the ANLP paper and summarised in Manning
& Schütze.
- I n Do initialisation as specified by
n. Some options may be
combined; add the codes to achieve this effect. Denoting the raw
frequency from a dictionary entry (i.e. the number of occurrences of a
tag on a word) by d, the raw transition frequency as t, and the raw
count (gamma) for a
tag as g, we normally use d/g and t/g. The option allows some other possibilities. Let
N be the number of tags on a word, T be the number of tags in
the tag list, and s be the total score on a dictionary word. Then the
dictionary and transition scores are set as follows, depending on the argument
n to the I options; n is the sum of the following values:
0: Dict d/g, Trans t/g (default).
+1: Dict d/n.
+2: Dict d/T.
+4: Trans t/T.
+8: Set all dict values to 1 first (in place of d)
+16: Set all trans values to 1 first (in place of t).
+32: Dict d/s.
The
same applies to pi values (i.e. sequence-initial probabilities) as to transition values.
If options 1 or 2 and option 4 are in effect, no transitions file need
be specified. If neither 1 nor 2, or not 4, there must be one.
With the current algorithm, +1 and +2 have the same effect (other
than absolute magnitudes of some quantities). These options do not apply when training from a tagged corpus (option
l).
Phrasal tagging
The phrasal tagging code was intended for use in some experiments with
recognizing phrases and incorporating them into the tagging process. The code
here is unlikely to be of much use. The phrase recognition can be done using
finite state machines or context free grammars.
Finite state machines
These options require the programs to be build with the define Use_FSM.
- e fn Read finite state machine definitions from the file
fn.
- E
Trace finite state machines, by printing out the stack after reading
each word.
Parser
These options require the programs to be build with the define Use_Parser.
- q fn Read a grammar from the file
fn.
- Q
Trace parser, by printing out the stack after reading each word.
General phrasal options
- P
Use anchor bracketing. Normally when a phrase has been built, and is
about to be tagged, phrase brackets are inserted around it, so that the
initial transition is from the (internal tag of) the phrase bracket to the
first object in the phrase and similarly at the end of the phrase. Option "P"
uses the anchor tag in place of the phrase tag.
Other options
- X "Special" option : reserved for various additional tests.
- z Verbose: report iterations and number of words processed.
- Z Debug: dump stack after tagging each sequence.
Example command lines
Here are some examples to illustrate typical options to label.
- To build dictionary and transitions from a tagged training corpus
in.txt, writing the results to corp.lex and corp.trn, using the
standard tag list in tags.map. Treats numbers as "parsed". O0
suppresses output.
label in.txt N l Rcorp O0
- To use the dictionary and transitions generated above to tag corpus
test.txt, with output including performance statistics to
test.out, and skipping ditto tags. Use the Viterbi algorithm.
label test.txt N rcorp C2 S V otest.out
- As above, using the forward-backward algorithm with 10 iterations of
Baum-Welch re-estimation, and writing performance statistics only to
standard output.
label test.txt N rcorp C2 S B10 O0
Files
The tagger requires a corpus file and a tag list. Corpus files may appear in a
variety of formats, selected by the input option C, and are used both to
provide the training data to the tagger and for the text to be tagged. The input
corpus must be tagged for training using the forward-backward algorithm; an
untagged corpus can be used with Baum-Welch re-estimation. A tagged corpus is
also required if you want to evaluate the accuracy of the tagger. The tag
list file specifies all of the tags to be used. The main output is a tagged
corpus, which may appear in a variety of formats, selected by option O. A
number of other input and output files are used with certain options. The format
of these files is described in detail in the following sections.
Input files
Corpus files
Several input corpus formats are available.
- tagged in LOB format (with some variants)
- untagged
- tagged with scores
- Penn Treebank ("postexts") format
- Lancpars format (also used for SEC)
In all the formats, the symbol ^ is treated as a sentence boundary
marker. It is always unambiguous, i.e. has a single hypothesised tag.
Tagged LOB format
For tagged LOB format, white space separates words. Words may be followed by
an underscore separator and a tag, for example editorial_NN. A further
underscore may optionally follow the tag. If a word has no following
underscore, or if the tag is empty, it is skipped. The tag is used in
training and in testing performance. This is the default format, and no C
option need be specified for it.
Untagged format
In untagged format, words are simply separated by white space characters
(tabs, spaces and newlines). Use option C1 for this format.
Tagged LOB with ditto tags
With option C2, the format is the same as tagged LOB, except that if the
tag is a "ditto tag" (used to mark idioms; see Garside et al. (1987)), then
the word is skipped. This is the recommended format with the tagged LOB
corpus.
Penn Treebank format
In Penn Treebank format, option C4, tags are separated from words by "/".
Alternate tags may be specified, separated by -, although the program
ignores all but the first. The character "/" may appear in a word, escaped
by \. Untagged words are skipped. Header lines, starting
*x* are skipped, and lines starting === are treated as sentence
boundaries. Words are separated by white space or ":".
This format covers only the texts in the "postexts" part of the treebank.
Lancpars format
Lancpars format is the same as tagged LOB, except that the input may also
contain phrase brackets, which have the form "[ tag" and " tag]".
Transitions to the first of these and from the second are treated as
transitions to and from the tag. In addition, an internal tag is constructed,
representing transitions from the start of a phrase to the first object within
it, and the corresponding transition at the end of a phrase. Brackets without
tags are skipped.
The tagged format with scores is not implemented at present.
Tag list files
Tag list files (also called tag mapping files) list all of the tags which will
be used in a corpus, one to a line. The line may optionally start with a
hyphen ("-"), which indicates that the tag is a closed class. Closed class tag
are not considered as possible hypotheses for unknown words during tagging;
all other tags are.
Blank lines are skipped. White space at the start of a line, after the hyphen
and at the end of a line is ignored. Note that this means that lines which
contain a tag followed by white space and more text will be (incorrectly)
treated as a single tag, i.e. accidental white space in the middle of a tag
must be avoided.
The tag list file and the dictionary and transitions files must match
up: if the dictionary and transitions files are built with some tag list, the
same tag list must be used for tagging with that dictionary and transitions.
The program makes some attempt at a consistency check, but it cannot always be
guaranteed.
An additional tag ^ is used as an "anchor": it is used to mark
sentence boundaries.
If the program was build with either phrasal option (FSMs or Parser), then the
tags list may also include lines which start with a plus ("+"), indicating
that the tag is, or may be, phrasal. Phrasal tags are treated as being closed
class tags. In addition, an extra internal tag is constructed for each phrasal
tag, used for transitions from the phrase tag to the first object in a phrase,
and for the last object in a phrase to the phrase tag.
Spaces at the start of a line are ignored, and also have the effect of
disabling the special interpretation of "-" and "+", in case these symbols
need to be used as ordinary tags.
Output corpus files
The major output file is the corpus output. A variety of formats is possible, selected by option O. Option S also affects what appears in
the output.
The normal output format -- called verticalised format -- has one word per
line. Lines have the form
XYZ word ctag tags
where:
- X = "a" if the word is ambiguous, and is blank otherwise. When
thresholding is in use, rejected words are flagged "r".
- Y = "u" if the word is unknown, "s" if special (i.e. number
words), and is blank otherwise.
- Z = "x" if the word was not correctly tagged, and is blank otherwise.
(Always blank for untagged input).
O option +4 omits the XYZ flags.
- ctag is the correct tag, and is present only if the input corpus is tagged
and the all tags option (+8) has not been selected.
- tags depends on the output options. If all tags are being reported
(option +8), then it lists all of the candidate tags on the word, indicating
which was chosen with [*] and which is the correct one with [+]. Otherwise the
chosen tag is shown if it differs for the correct one (when using tagged
input). In each case, tags may be optionally followed by their scores (option
+16).
Skipped words are output with no flags or tag information.
Data files
The tagger makes use of two kinds of data file: dictionaries, which list
all known words with their tags and frequencies, and transitions files,
which contain the tag-to-tag transition frequencies, the initial tag
frequencies, and the total tag frequencies. The files are created by the
tagger, and may be merged using the program dmerge.
In general, there is no need to know the format of these files,
and they should not be altered by hand.
Dictionary files
Dictionaries list words with the frequencies of their possible tags.
They are initially built from a tagged corpus using label, and may
also result from Baum-Welch re-estimation in label, and from merging
dictionaries with dmerge. The dictionary format is also used for
exclusion lists.
Format:
- First line: the letter D followed by the size of the dictionary (s).
- s lines of the form
word ntags tags
where ntags is the number of tags, and tags is that many entries,
each in of the
form
tag baseScore
where tag is the tag in printing form and baseScore is the base score of the tag.
By convention, dictionary file names end in .lex.
Transitions files
Transitions files specify the frequency of occurrence of tags, tag pairs
and of tags as the start of a sequence. Transitions can be built from a
training corpus, or can result from Baum-Welch re-estimation, and can also be
merged with dmerge.
The transitions file is kept in binary format, and must be read and written as
such on systems that distinguish binary and text files. Format:
- The letter T, followed by the number of tags as an integer (n).
- Array of doubles of size n for pi (sequence-initial probability) values.
- Array of doubles of size n for gamma (normalization) values.
- Array of double of size n^2 for transitions values.
By convention, transitions file names end in .trn.
Note that different hardware architectures and operating systems write binary
files in different ways, and if a transitions file has been created on one, it
may not work on another.
Tag Inference files
Option "i" allows tag inference rules to be used. The file consists of a
number of lines of the form
<threshold> <tag>*. The rule is applied to any dictionary entry, for which the total score (i.e.
the total score of its tags) relative to the total score of all words in the
dictionary falls below the threshold, and for which all of the tags appear in
the list. The effect of the rule is to add all tags which appear in the
rule to that dictionary entry, giving them a score equal to the total score of
the word divided by the number of tags specified in the rule. The aim in doing
so is to add extra tags to low frequency words, where there may not have been
sufficiently many occurrences in the training corpus to provide reliable
information. The same word may be adjusted by more than one inference rule.
Inference happens after reading the dictionary and before normalisation; the
altered dictionary is not written out.
Unknown word rules
In the initial release of the Acquilex tagger, unknown words were treated by
hypothesising all non-closed class tags. It is possible to do better with a simple-minded analysis of the
surface form of the word. It is not as sophisticated as morphological
analysis, but in general does help to bring down the number of tags for unknown
words and so
improve the accuracy.
The action on unknown words is specified by a number of rules. Each rule
consists of a pattern to be matched and an action, which is either to restrict
the hypotheses to certain tags or to eliminate certain tags. The rules are
applied in order. Generally, they are mutually exclusive, so that if one rule
fires, then the remaining ones are not checked. However, it is also possible
to specify rules which may fire and then allow further rules to be tested.
The rules are supplied to the tagger in a file with the following format. Each
rule must appear on a single line. Rules have the
following format:
[<id>][:<next>][~]<pattern> [~]<tags>
<tags> is a list of tags separated by spaces. The rule is is read
as "if the word meets the pattern, then restrict the tags to the given
ones". ~ before the pattern changes this to "if the pattern does not
match ...", and ~ before the tags means "... exclude the listed tags
from consideration". Finally, each rule may have a number <id>. If :<next> appears and the rule fires, the next one tested is
the rule an id value of next, so providing a crude way of grouping rules. If no such
continuation is specified, and the current rule does apply, then no further
ones are tested. Continuations must appear after the rule they are referenced
from.
Patterns are either suffixes, prefixes or one of a range of special
patterns. Suffix patterns have the form -x|y where x and y are strings and should be in lower case. The pattern matches a word
ending in y, provided the word does not end in xy. x or y (but not both) may be empty. Prefix patterns are
y|x- which
matches a word beginning y but not in yx. The special patterns
are single upper case letters: I to match words with an initial
capital, A for ones all in capitals and C for ones with a
capital letter at any position in the word.
For rules in which the tags are not negated, each tag may be followed by a
floating point number, which is applied as a factor to the base score of the tag. This allows biasing towards certain tags. If more than one rule applies,
the factors accumulate, so they must be used with some caution.
If the rules result in all hypotheses being excluded, then the whole
open-class list is restored, and a warning message is issued.
Note that the syntax checking on the rules file is fairly unsophisticated.
Example:
The following rules apply to the Penn treebank tagset.
:1 I JJ NN NP NPS SYM
-l|ly RB
1 -er ~JJS
-est ~JJR
-ing ~VB VBD VBN VBP VBZ
-u|s ~VB VBG VBN VBP VBZ NN NP
-ed ~VB VBG VBZ
:2 ~-s ~NNS NPS
2 ~-ing ~VBG
The third rule says that if a word ends with "er" then it is not a
superlative adjective, while the second says that words ending with "ly" but
not with "lly" are adverbs. The initial capital rule blocks matching of
"ly", for names like "Billy".
Parser input files
When the parser is being used, a grammar file must be specified (using option
"q"). Each line of the file has the form:
<score> <phrase-tag> -> <tag>*
score is the factor applied to the score of the enclosed phrase when a
rule matched, phrase-tag the tag assigned to the phrase, and the
sequence of tags after the arrow is the body of the rule.
Finite state machine files
When FSMs are being used, a FSM definition file must be specified (using option
"e"). The file consists of one or more FSM definitions using the following
syntax:
<fsm> ::= <name> <state>* end
<state> ::= <id> <item>* ;
<item> ::= [<tags>] _ <action>
<action> ::= <id> | <tagSc> | back <tagSc>
<tags> ::= <tag> | <tag> <tags>
<tagSc> ::= <tag> | <tag>_<score>
Spaces, tabs and newlines act as separators. FSMs must end with "end". The
list of items in a state must end with ";". To allow ";" as a tag, escape it
with \. There must be a space before the ";" and the "_" in an
item. FSM names and state ids are any text which is not a tag. The tags to the
left of the "_" cause the action to take place if any one or more of them are
recognised. All of the tags that match are created as hypotheses within a
phrase. An empty tag list matches everything not specified in another tag list
on the item. FSMs are non-deterministic: all actions that can be carried out
are. The actions are:
- id Jump to the state with the given id.
- tagSc Create a phrase with the given tag, and run the tagger on
it. The score is the factor applied to the score of the enclosed phrase,
defaulting to 1.0.
- back tagSc As tagSc, except that the things that matched
this item do not form part of the phrase.
Secondary Programs
There is a collection of secondary programs for examination and manipulation
of tagger files, and for miscellaneous operations which were found to be
useful in research based on the tagger. Error checking in some of these
programs is minimal. You may need to modify the makefile in order to build some
of the minor programs.
dmerge
dmerge merges dictionaries and transition matrices. It has two main
purposes: for combining the data files constructed by training from two or
more corpora, without having to retain on the entire corpus, and for merging a
dictionary of "unknown" words into an existing dictionary.
Usage:
dmerge <out> <options>
out is a root name used for the output dictionary and/or transitions.
Extensions are appended as for options "r" and "R" of label. The
options are:
- d Merge dictionaries
- t
Merge transitions
- m fn Filename for tags list; default
"tags.map"
- A n Allocate a dictionary of the given size, as for
label.
- r
Indicates end of options. The remainder of the command line is a
list of root names for the files to be merged.
If "d" is specified, then the .lex files are read, and if "t" the
.trn ones. "d" and "t" may be specified together. The tag list file must be
the same one for all of the dictionary and transitions files involved.
dtinfo
dtinfo reads a dictionary and transitions file and prints information
about tag distributions.
Usage:
dtinfo <root> <map>
root is the root name of the files. Extensions are added as for options
"r" and "R" of label. map is the name of the tag list file,
defaulting to "tags.map". The output consists of the the distribution of
words against number of tags on the word, and the distribution of tags against
the number of transitions from and to a tag.
exdict
exdict allows examination of a dictionary. Words are read from standard
input and their tags with basic and normalised scores are printed out.
Usage:
exdict <dictionary> <trans>
readtr
readtr reads a transitions file and prints it in a textual form.
Usage:
readtr <trans> [<map>]
trans is the name of the transitions file. The default extension is
not added by readtr. map is the name of the tag list file,
defaulting to "tags.map", if not specified. The output consists of the
normalisation value (gamma) for each tag, and all non-zero transitions from
each tag, both un-normalised and normalised.
outcomp
outcomp takes the output of the labeller and tests the accuracy of
phrasal labelling. The output must have been produced from a tagged input
corpus, with one of the phrasal tagging options in use. It reports the number
of cases where there was an exact match, i.e. the tagger correctly predicted a
phrase, where the tagger added an extra phrase, and where the tagger failed to
find a phrase present in the input corpus. The algorithm is given in the
header of outcomp.c. It will not spot cases where (say), the open
bracket of a phrase is in the right place, but the close bracket is not: such
cases will add one to each of the "extra" and "missed" counts.
Usage: reads from the standard input (stdin), producing the totals on the
standard output (stdout); i.e. use as a filter.
rules
rules builds phrase structure rules and finite state machines from a
corpus in Lancpars format. Rules are written to stdout, and FSM definitions to
stderr.
Usage:
rules <corpus> [<map>]
cmptran
cmptran is a program for comparing tag sequence probabilities. It takes
lines consisting of three tags from standard input and works out the
probability of the sequence, and the ratio of this one to the last (on the
second of each pair).
Usage:
cmptran <trans>
dephrase
dephrase takes a Lancpars format corpus and removes all phrase tags
except those from a specified set, given by the first argument on the command
line. The corpus is read from standard input and the result written to
standard output.
Usage:
dephrase <tag-file>
Error messages
There is a large collection of error messages relating to illegal combinations
of options, which is not listed here. Of the remaining messages, some indicate
errors in the input data, some a system failure (such as running out of
memory), and some that an internal limit in the program has been exceeded.
The latter class require changing of the limit and recompilation. In addition,
there are a few consistency check messages, which should not occur unless
there is an error in the program. Error recovery is quite poor at present, and
many errors will simply cause the program to abort (via the function
get_out).
Memory errors can sometimes be cured (especially when using the phrasal
packages), by modifying the corpus to include anchors (^) between
sentences.
Array overflow (MaxTags)
Internal limit exceeded: recompile with larger value for MaxTags. (outcomp)
Back action on initial state
Bad score
Duplicate FSM name
Duplicate state ID
Duplicate tag ignored
Jump destination missing
Tag required
Token coincides with a tag
Unexpected end of file
Unexpected end of item definition
"end" of FSM missing
Various errors in the format of a FSM definition. Note that FSM syntax
checking is not very robust.
Bad reduced tags mapping line: ...
The line does not have the required format. See tagger option M for
specification.
Bad total
No hypotheses on word
A consistency check has failed.
The latter error may occur if it was not possible to assign any hypotheses
to unknown words, for example, if all tag classes were marked as closed.
Beta pass consistency fail
Consistency check failed (choose_hyps ...)
Consistency error: PhraseStart/PhraseEnd seen without lancpars
Consistency fail ...
End of rule consistency check failed at ...
Warning: consistency check fails (non-skip with 0 hypotheses)
analyse: consistency error
These errors should never occur in the standard version; they indicate a bug
in the program. (label, outcomp)
Buffer overflow at "..."
Buffer overflow (fetch_word)
Score buffer overflow at "..."
Tag buffer overflow at "..."
Word buffer overflow at "..."
Recompile with buffer size increased.
Cannot open ...
The file could not be opened. Check it exists and was specified correctly.
Corpus line buffer overflow
Needs a larger internal buffer -- recompile with larger MaxLine.
Dictionary full
File is too big for internal dictionary array. Try using option A to specify
larger dictionary.
Dictionary size is missing
Error reading tags from dictionary
Map file indicates ... tags, array has ... entries
Wrong file code
Some fault in the format of a file supplied to the program. Check the right
file is being used. Some of these errors may result from using the wrong tag
list file.
Duplicate entry ... in ...
Reported in a number of cases where an object must be unique. The message will
contain additional information to help identify the problem.
Empty prefix pattern at line ...
Empty suffix pattern at line ...
The indicated pattern was expected but not found in an unknown words rule.
Error reading word from dictionary at word ...
Trans write failure
Trans read failure
Input/Output failures. Check for system problems, e.g. that
the file is not corrupt when reading, and that there is enough space when
writing.
Failed to read a tag
No tag was found in a file when one was expected.
File is too big for dictionary
The size of the dictionary being read is greater than the one specified using
the "A" option.
Inference rule with no tags
Self explanatory.
Line overflows buffer
Line is too long when reading FSM definition or parse rules: recompile with
larger MaxLine.
Missing suffix pattern at line ...
Missing prefix pattern at line ...
The indicated pattern was expected but not found in an unknown words rule.
Missing threshold in inference rule
Self explanatory.
No valid options: need d or t or both
Self explanatory. (dmerge)
Out of memory creating ...
The program needs more memory for the given data. There is no cure if more
memory cannot be made available.
Phrase start "..." does not match phrase end "..."
Mismatch between phrase brackets. (rules)
Root name is too long
A buffer in the program is not large enough; either find some other way of
specifying the file name or change the source.
Rule has wrong format at line ...
General syntax error in unknown words rule.
Too few arguments
Check command line had the right format. (dmerge)
Too many tags
There is a word in a tagged input corpus with more than the allowed number of
tags (1 in the current version).
Too many tags in inference rule
The internal array for inference rules is no large enough. Recompile,
adjusting MaxTags in io.c.
Unexpected close bracket ...
Unmatched bracket ...
The given close bracket did not have a matching open bracket. (outcomp)
Unexpected end of file in rules
Self explanatory.
Unexpected phrase end in "..."
End of phrase bracket found with no preceding start of phrase bracket. (dephrase)
Unexpected tag separator in "..."
Tag separator found without a corresponding word. (dephrase)
Unknown tag
The specified tag is not given in the tag list file.
Unknown word
Unknown word seen when using a wordlist: add it to the dictionary.
Warning: terminal missing
Warning: {\tt "->"} missing
Warning: empty rule
Various errors in the format of a parse rule.
... not found
The word on which information was requested is not in the dictionary. (exdict)
Programmer's Details
How to build the programs
The source code is supplied with a make file which works with Gnu make on Unix
and cygwin. To make any of the programs,
use the command make program-name. To make them all, type either
make all or simply make. See the makefile for details on compiler and
linker options.
To compile on Windows, use the command nmake -f winmake. This has been
tested only with Microsoft Visual Studio version 6. You may first have to set up
environment variables by executing the batch file VCVARS32.BAT, usually found in
c:\Program Files\Microsoft Visual Studio\VC98\Bin.
To include the FSM and/or phrasal parsing code, change makefile to define
either Use_FSM or Use_Parser.
Source files
The source files are:
analyse.c: Functions for analysis output option.
cmptran.c: Top level of cmptran.
common.c: Functions common to whole of tagger.
dephrase.c: Top level of dephrase.
diction.c: Dictionary processing functions.
dmerge.c: Top level of dmerge.
dtinfo.c: Top level of dtinfo.
exdict.c: Top level of exdict.
fsm.c: Finite state machine package.
io.c: Higher level input/output functions for tagger.
label.c: Main tagging functions.
list.c: Low level linked list processing.
low.c: Low level input/output functions.
mainl.c: Top level of label.
map.c: Tag conversion functions.
outcomp.c: Source of outcomp.
parser.c: Chart parser package.
phrase.c: Common phrase handling code.
readtr.c: Top level of readtr.
rules.c: Top level of rules.
stack.c: Functions for maintaining the stack.
trans.c: Transition array processing functions.
The header files are: analyse.h, common.h, diction.h, fsm.h, label.h, list.h,
low.h, map.h, parser.h, phrase.h, stack.h, and trans.h. The files correspond
roughly to the ".c" files with the same names.
See the source code for detailed documentation. Generally, external data
structures and macros are documented in header files. Functions are given a
brief description in the header files, with more details in the source file.
How to change the input and output formats
Input format
The low level input is handled by functions in the files low.c and
io.c. If additional formats for the input corpus are to be added, the
following changes should be made.
- Add an extra code for use with the C option. This is done by adding a
line to common.h of the form:
#define xxx (n)
where xxx is some suitable define name, and n is a power of two not
equal to any of the codes already in use. Whenever the code needs to test
whether the option is set, expressions such as
if (InOpt(xxx))
can be used.
- In low.c, the function corpus_getword must be altered to add
an extra case for the new format. The function takes the following parameters:
FILE *fp: Pointer to the input file.
char *text: Buffer for the text of the word.
int textlen: The length of this buffer.
int max: Size of the following two arrays.
Tag *tag: An array of tags. This and the scores array are needed if the
input corpus is tagged, either because it is training text or because the
correct tag is provided for testing the accuracy of the tagger. The tags
specified in the input should be placed in this array. Where the input corpus
specifies scores for the hypotheses, they should be placed in score,
which should otherwise be initialised to contain 1 in each entry for which
there is a tag.
Score *score: An array of scores.
The function returns an integer, which is -1 if no word could be read, or the
number of tags read otherwise (possibly 0). corpus_getword is called by
the function fetch_word in io.c. Normally this function uses the
number of tags to assign a class to the word; if there are 0 tags on the word,
it is generally marked as "skipped", meaning that it will be ignored for the
purposes of training and tagging. An exception to this is made for untagged
input corpora. There is already an input option for this case (specified by
option C1, untagged_input); see the code for this option for how to
deal with other untagged formats.
Output format
If additional formats for the output corpus are to be added, the following
changes should be made.
- Add an extra code for use with the O option. This is done by adding a
line to label.h of the form:
#define xxx (n)
where xxx is some suitable define name, and n is a power of two not
equal to any of the codes already in use. Whenever the code needs to test
whether the option is set, expressions such as
if (OutOpt(xxx)) can be used.
- In io.c, the function output_level must be altered to add
an extra case for the new format. The function takes the following parameters:
FILE *outfile: Pointer to the output file.
Node node: Pointer to node structure from which output is to be made.
BOOL output: TRUE if any output is to be produced, FALSE if the word is
just to be logged (used when only performance statistics or error analyses are
to be collected).
BOOL stats: TRUE if performance statistics are to be collected.
Node *node_out: Return pointer to the next word to be output. This
must be set up in output_level, normally to node->succ.
The function must return TRUE if the correct hypothesis was chosen and FALSE
otherwise. This is only needed if the error analysis option is being used. It
is recommended that the existing version of the function be used as an
example. The definition of the node data structure is in stack.h. See
this file and the existing version of output_level for how to find most
of the information about words and hypotheses which is likely to be needed.
Unknown word handling
When an unknown word (one not listed in the dictionary) is encountered, the
tagger considers all tags which are not marked as being closed classes as
possible hypotheses for it. There are often many such tags, and this has a
negative effect on both the accuracy and the execution speed of the tagger.
Some other taggers have overcome this problem by doing a limited morphological
analysis of the word and predicting a more restricted range of tags where
possible. The program may be changed to include such analysis by altering the
code in stack.c. The function copy_unknown is called whenever an
unknown word is pushed onto the stack, and this is where morphological analysis
should take place. The parameters to the function are:
Lexeme lp: A pointer to the lexeme entry.
char *text: The literal text of the word.
Dictword d: The dictionary entry for the word. This will have been set
up to point to a special entry used for unknown words. Amongst other
information, this entry lists all the non-closed class tags.
The function should return a list of the hypotheses constructed, in
"unscored" format; see copy_from_dict for an example of how to do
this. A consistency error is raised if no hypotheses were created.
Internal limits
There are a number of internal limits on sizes of arrays and buffers, which
may need to be altered.
- MAXWORD: Maximum length of the text of a word read from the dictionary
or corpus. (common.h).
- MAXTAG: Maximum length of the text of a tag, read from a corpus or tag
list file. (common.h).
- MAXFN: Maximum length of a file name. (common.h).
- DICTLEN: Default size of a dictionary when training is being used in label or when dictionaries are being merged in
dmerge. (dmerge.c; mainl.c).
- MaxLine: Maximum length of a lines read from various input files.
Defined in a number of source files for different inputs: cmptran.c, dephrase.c, fsm.c, low.c,
parser.c.
Bibliography
Cutting, Doug, Julian Kupiec, Jan Pedersen and Penelope Sibun. A Practical
Part of Speech Tagger. Proceedings of 3rd ACL Conference on Applied Natural
Language Processing, Trento, Italy
Elworthy, David (1994a). Automatic
Error Detection in Part of Speech Tagging, Proceedings on New Methods in
Language Processing, UMIST.
Elworthy, David (1994b). Does
Baum-Welch Re-estimation Help Taggers?, Proceedings of 4th ACL
Conference on Applied Natural Language Processing, Stuttgart.
Elworthy, David (1995). Tagset
Design and Inflected Languages, Proceedings of EACL SIGDAT workshop
"From Texts to Tags: Issues in Multilingual Language Analysis",
Dublin.
Garside, Roger, Geoffrey Leech and Geoffrey Sampson (1987). The
Computational Analysis of English. Longman.
Manning, Chris and Hinrich Schütze (1999). Foundations of Statistical
Natural Language Processing, MIT Press. Cambridge, MA.