The "Acquilex" Part-of-speech Tagger

David Elworthy

January 2004

Table of contents

 

Introduction

The Acquilex tagger is a HMM-based part-of-speech tagger intended both as a practical system for tagging text corpora and as a research tool for investigating how taggers perform under various conditions. It was originally written for a short-term research project which I carried out in the first half of 1993, just after completing by PhD at the University of Cambridge Computer Laboratory. The work was funded as part of the Acquilex project, led by Ted Briscoe. This version is a descendant of the original Acquilex tagger; it has not changed much since 1995 in this version. Other versions have been branched off and used by a number of people including Ted Briscoe and John Carroll. One version can be found in the RASP tools (http://www.cogs.susx.ac.uk/lab/nlp/rasp/). Three publications are based on experiments conducted using the tagger: see Elworthy (1994a, 1994b, 1995). Some of these experiments are referred to in the tagging chapter of Manning and Schütze's book Foundations of Statistical Natural Language Programming.

I am releasing the tagger now (January 2004) under an open source licence. These days, it is no big deal to write a HMM tagger, and there are several which are available commercially or otherwise. If you use it, please drop me a line at david'at'friendlymoose.com, and include an acknowledgement in any publications. As noted, the code was written as a research vehicle, and some parts of the code (particularly to do with combining parsing and tagging) are probably not much use. I believe the main part of the code to be free from serious bugs and memory leaks. It is in ANSI C, and has been tested with GCC and Microsoft Visual C++.

There are several programs in the tagger suite, described in the following sections. With a HMM tagger, you must first construct a model containing lexical and transition probabilities, normally using a tagged corpus. The tagger program can be used to do this, as well as its main task of tagging a corpus. I am not including any tagged corpora with the software, to avoid potential problems with copyright or licence infringement. If you need to get hold of a tagged corpus, I recommend visiting the Linguistic Data Consortium or the British National Corpus. The program can tag a corpus which has already been annotated with the correct tags, and collect statistics for evaluation purposes, or it can tag a corpus which contains no tag information. There are a number of secondary programs for examining and manipulating the tagger model files. I admit that the description of how to use the programs is not particularly clear; experiment and follow the examples is my advice.

Licence

The tagger is released under the GNU General Public Licence. Personally, I have some reservations about the GPL and the GNU Project in general, but I think it is the most appropriate licence given that the software was originally written in an academic research environment. The licensing terms do not apply to any other variants of the Acquilex tagger prior to this one, released in January 2004. Be sure to follow the terms of the licence agreement, or you must just open the door one day to find Richard Stallman waiting to beat you with a big stick with nails sticking out of it. The licence terms can be found in a file called LICENSE. You must include this file if you distribute the tagger further. If the licence file is missing for some reason, you can obtain a copy here or here. If you need a tagger that can be used commercially, I may be releasing a separate one under a less restrictive open source licence; check my website to see. 

User Guide to the Tagger (label)

The tagger is used for training the statistical model and for tagging corpora on the basis of a model. There is a large number of option for selecting exactly what the tagger will do, and for specifying variations on the input and output formats and the basic algorithms. Options come in two sorts: boolean flags, and options which take arguments. Where there is an argument, it may immediately follow the option letter, or may be separated from it by white space. Arguments are terminated by white space on the command line. Inconsistent collections of options are reported. Options are case sensitive, and may be preceded by "-", which is ignored. Note that there are many options for experimental variants, and the descriptions below may not be particularly clear; in some cases, I now have only the haziest idea of what I did when when I wrote the code 10 years ago. For simple use, see the examples.

The tagger is invoked with a command line of the form:
    label <in-corpus> <options>
where in-corpus is the name of the input corpus, and options specifies options as follows.

Tagging algorithm

These options specify the basic action of the tagger.

(Options g and G are so called because I originally described them as Good-Turing correction, before I knew what it really was.)

Input and output data files

Output data files are produced from training or Baum-Welch re-estimation. Input data files are needed unless training is being used.

Tag list files

Tag list files are used to list the valid part of speech tags and specify their properties. More details appears below. Options associated with tag list files are:

Input corpus

Input corpus formats are described in more detail in a later section.

Output corpus

Output corpus formats are described in more detail in a later section.

Unknown words

The performance on unknown words can be improved by specifying rules for selecting tags based on the form of the word. Normally all non-closed class tags are used. The unknown word rules use very simple surface analysis of the word to reduce the range of possible tags. See below for the file format and how the rules work.

Numbers

The tagger performance is improved if all numbers are treated as being instance of the same "word". If neither of the following options is specified, each distinct number is treated as a separate word; for example, every number will have a separate entry in the dictionary during training. With these options, numbers are mapped to a special entry in the tagger dictionary, These options specify how a number will be defined.

Initialisation

These override the values in the dictionary and/or transitions, and are intended for use with Baum-Welch re-estimation where the initial data is not reliable and hence may need smoothing. These options were used for the experiments reported in the ANLP paper and summarised in Manning & Schütze.

Phrasal tagging

The phrasal tagging code was intended for use in some experiments with recognizing phrases and incorporating them into the tagging process. The code here is unlikely to be of much use. The phrase recognition can be done using finite state machines or context free grammars.

Finite state machines

These options require the programs to be build with the define Use_FSM.

Parser

These options require the programs to be build with the define Use_Parser.

General phrasal options

Other options

Example command lines

Here are some examples to illustrate typical options to label.

 

Files

The tagger requires a corpus file and a tag list. Corpus files may appear in a variety of formats, selected by the input option C, and are used both to provide the training data to the tagger and for the text to be tagged. The input corpus must be tagged for training using the forward-backward algorithm; an untagged corpus can be used with Baum-Welch re-estimation. A tagged corpus is also required if you want to evaluate the accuracy of the tagger. The tag list file specifies all of the tags to be used. The main output is a tagged corpus, which may appear in a variety of formats, selected by option O. A number of other input and output files are used with certain options. The format of these files is described in detail in the following sections.

Input files

Corpus files

Several input corpus formats are available.

In all the formats, the symbol ^ is treated as a sentence boundary marker. It is always unambiguous, i.e. has a single hypothesised tag.

Tagged LOB format

For tagged LOB format, white space separates words. Words may be followed by an underscore separator and a tag, for example editorial_NN. A further underscore may optionally follow the tag. If a word has no following underscore, or if the tag is empty, it is skipped. The tag is used in training and in testing performance. This is the default format, and no C option need be specified for it.

Untagged format

In untagged format, words are simply separated by white space characters (tabs, spaces and newlines). Use option C1 for this format.

Tagged LOB with ditto tags

With option C2, the format is the same as tagged LOB, except that if the tag is a "ditto tag" (used to mark idioms; see Garside et al. (1987)), then the word is skipped. This is the recommended format with the tagged LOB corpus.

Penn Treebank format

In Penn Treebank format, option C4, tags are separated from words by "/". Alternate tags may be specified, separated by -, although the program ignores all but the first. The character "/" may appear in a word, escaped by \. Untagged words are skipped. Header lines, starting *x* are skipped, and lines starting === are treated as sentence boundaries. Words are separated by white space or ":". This format covers only the texts in the "postexts" part of the treebank.

Lancpars format

Lancpars format is the same as tagged LOB, except that the input may also contain phrase brackets, which have the form "[ tag" and " tag]". Transitions to the first of these and from the second are treated as transitions to and from the tag. In addition, an internal tag is constructed, representing transitions from the start of a phrase to the first object within it, and the corresponding transition at the end of a phrase. Brackets without tags are skipped. The tagged format with scores is not implemented at present.

Tag list files

Tag list files (also called tag mapping files) list all of the tags which will be used in a corpus, one to a line. The line may optionally start with a hyphen ("-"), which indicates that the tag is a closed class. Closed class tag are not considered as possible hypotheses for unknown words during tagging; all other tags are. Blank lines are skipped. White space at the start of a line, after the hyphen and at the end of a line is ignored. Note that this means that lines which contain a tag followed by white space and more text will be (incorrectly) treated as a single tag, i.e. accidental white space in the middle of a tag must be avoided. The tag list file and the dictionary and transitions files must match up: if the dictionary and transitions files are built with some tag list, the same tag list must be used for tagging with that dictionary and transitions. The program makes some attempt at a consistency check, but it cannot always be guaranteed. An additional tag ^ is used as an "anchor": it is used to mark sentence boundaries. If the program was build with either phrasal option (FSMs or Parser), then the tags list may also include lines which start with a plus ("+"), indicating that the tag is, or may be, phrasal. Phrasal tags are treated as being closed class tags. In addition, an extra internal tag is constructed for each phrasal tag, used for transitions from the phrase tag to the first object in a phrase, and for the last object in a phrase to the phrase tag. Spaces at the start of a line are ignored, and also have the effect of disabling the special interpretation of "-" and "+", in case these symbols need to be used as ordinary tags.

Output corpus files

The major output file is the corpus output. A variety of formats is possible, selected by option O. Option S also affects what appears in the output. The normal output format -- called verticalised format -- has one word per line. Lines have the form
XYZ word ctag tags
where: 

Skipped words are output with no flags or tag information.

Data files

The tagger makes use of two kinds of data file: dictionaries, which list all known words with their tags and frequencies, and transitions files, which contain the tag-to-tag transition frequencies, the initial tag frequencies, and the total tag frequencies. The files are created by the tagger, and may be merged using the program dmerge. In general, there is no need to know the format of these files, and they should not be altered by hand.

Dictionary files

Dictionaries list words with the frequencies of their possible tags. They are initially built from a tagged corpus using label, and may also result from Baum-Welch re-estimation in label, and from merging dictionaries with dmerge. The dictionary format is also used for exclusion lists. Format: 

By convention, dictionary file names end in .lex.

Transitions files

Transitions files specify the frequency of occurrence of tags, tag pairs and of tags as the start of a sequence. Transitions can be built from a training corpus, or can result from Baum-Welch re-estimation, and can also be merged with dmerge. The transitions file is kept in binary format, and must be read and written as such on systems that distinguish binary and text files. Format:

By convention, transitions file names end in .trn. Note that different hardware architectures and operating systems write binary files in different ways, and if a transitions file has been created on one, it may not work on another.

Tag Inference files

Option "i" allows tag inference rules to be used. The file consists of a number of lines of the form <threshold> <tag>*. The rule is applied to any dictionary entry, for which the total score (i.e. the total score of its tags) relative to the total score of all words in the dictionary falls below the threshold, and for which all of the tags appear in the list. The effect of the rule is to add all tags which appear in the rule to that dictionary entry, giving them a score equal to the total score of the word divided by the number of tags specified in the rule. The aim in doing so is to add extra tags to low frequency words, where there may not have been sufficiently many occurrences in the training corpus to provide reliable information. The same word may be adjusted by more than one inference rule. Inference happens after reading the dictionary and before normalisation; the altered dictionary is not written out.

Unknown word rules

In the initial release of the Acquilex tagger, unknown words were treated by hypothesising all non-closed class tags. It is possible to do better with a simple-minded analysis of the surface form of the word. It is not as sophisticated as morphological analysis, but in general does help to bring down the number of tags for unknown words and so improve the accuracy. The action on unknown words is specified by a number of rules. Each rule consists of a pattern to be matched and an action, which is either to restrict the hypotheses to certain tags or to eliminate certain tags. The rules are applied in order. Generally, they are mutually exclusive, so that if one rule fires, then the remaining ones are not checked. However, it is also possible to specify rules which may fire and then allow further rules to be tested. The rules are supplied to the tagger in a file with the following format. Each rule must appear on a single line. Rules have the following format:
[<id>][:<next>][~]<pattern> [~]<tags>

<tags> is a list of tags separated by spaces. The rule is is read as "if the word meets the pattern, then restrict the tags to the given ones". ~ before the pattern changes this to "if the pattern does not match ...", and ~ before the tags means "... exclude the listed tags from consideration". Finally, each rule may have a number <id>. If :<next> appears and the rule fires, the next one tested is the rule an id value of next, so providing a crude way of grouping rules. If no such continuation is specified, and the current rule does apply, then no further ones are tested. Continuations must appear after the rule they are referenced from. Patterns are either suffixes, prefixes or one of a range of special patterns. Suffix patterns have the form -x|y where x and y are strings and should be in lower case. The pattern matches a word ending in y, provided the word does not end in xy. x or y (but not both) may be empty. Prefix patterns are y|x- which matches a word beginning y but not in yx. The special patterns are single upper case letters: I to match words with an initial capital, A for ones all in capitals and C for ones with a capital letter at any position in the word. For rules in which the tags are not negated, each tag may be followed by a floating point number, which is applied as a factor to the base score of the tag. This allows biasing towards certain tags. If more than one rule applies, the factors accumulate, so they must be used with some caution. If the rules result in all hypotheses being excluded, then the whole open-class list is restored, and a warning message is issued. Note that the syntax checking on the rules file is fairly unsophisticated.

Example: The following rules apply to the Penn treebank tagset.

:1 I JJ NN NP NPS SYM
-l|ly RB
1 -er ~JJS
-est ~JJR
-ing ~VB VBD VBN VBP VBZ
-u|s ~VB VBG VBN VBP VBZ NN NP
-ed ~VB VBG VBZ
:2 ~-s ~NNS NPS
2 ~-ing ~VBG

The third rule says that if a word ends with "er" then it is not a superlative adjective, while the second says that words ending with "ly" but not with "lly" are adverbs. The initial capital rule blocks matching of "ly", for names like "Billy".

Parser input files

When the parser is being used, a grammar file must be specified (using option "q"). Each line of the file has the form:
<score> <phrase-tag> -> <tag>*
score is the factor applied to the score of the enclosed phrase when a rule matched, phrase-tag the tag assigned to the phrase, and the sequence of tags after the arrow is the body of the rule.

Finite state machine files

When FSMs are being used, a FSM definition file must be specified (using option "e"). The file consists of one or more FSM definitions using the following syntax:
<fsm> ::= <name> <state>* end
<state> ::= <id> <item>* ;
<item> ::= [<tags>] _ <action>
<action> ::= <id> | <tagSc> | back <tagSc>
<tags> ::= <tag> | <tag> <tags>
<tagSc> ::= <tag> | <tag>_<score>

Spaces, tabs and newlines act as separators. FSMs must end with "end". The list of items in a state must end with ";". To allow ";" as a tag, escape it with \. There must be a space before the ";" and the "_" in an item. FSM names and state ids are any text which is not a tag. The tags to the left of the "_" cause the action to take place if any one or more of them are recognised. All of the tags that match are created as hypotheses within a phrase. An empty tag list matches everything not specified in another tag list on the item. FSMs are non-deterministic: all actions that can be carried out are. The actions are: 

 

Secondary Programs

There is a collection of secondary programs for examination and manipulation of tagger files, and for miscellaneous operations which were found to be useful in research based on the tagger. Error checking in some of these programs is minimal. You may need to modify the makefile in order to build some of the minor programs.

dmerge

dmerge merges dictionaries and transition matrices. It has two main purposes: for combining the data files constructed by training from two or more corpora, without having to retain on the entire corpus, and for merging a dictionary of "unknown" words into an existing dictionary. Usage:
dmerge <out> <options>
out is a root name used for the output dictionary and/or transitions. Extensions are appended as for options "r" and "R" of label. The options are:

If "d" is specified, then the .lex files are read, and if "t" the .trn ones. "d" and "t" may be specified together. The tag list file must be the same one for all of the dictionary and transitions files involved.

dtinfo

dtinfo reads a dictionary and transitions file and prints information about tag distributions. Usage:
dtinfo <root> <map>
root is the root name of the files. Extensions are added as for options "r" and "R" of label. map is the name of the tag list file, defaulting to "tags.map". The output consists of the the distribution of words against number of tags on the word, and the distribution of tags against the number of transitions from and to a tag.

exdict

exdict allows examination of a dictionary. Words are read from standard input and their tags with basic and normalised scores are printed out. Usage:
exdict <dictionary> <trans>

readtr

readtr reads a transitions file and prints it in a textual form. Usage:
readtr <trans> [<map>]

trans is the name of the transitions file. The default extension is not added by readtr. map is the name of the tag list file, defaulting to "tags.map", if not specified. The output consists of the normalisation value (gamma) for each tag, and all non-zero transitions from each tag, both un-normalised and normalised.

outcomp

outcomp takes the output of the labeller and tests the accuracy of phrasal labelling. The output must have been produced from a tagged input corpus, with one of the phrasal tagging options in use. It reports the number of cases where there was an exact match, i.e. the tagger correctly predicted a phrase, where the tagger added an extra phrase, and where the tagger failed to find a phrase present in the input corpus. The algorithm is given in the header of outcomp.c. It will not spot cases where (say), the open bracket of a phrase is in the right place, but the close bracket is not: such cases will add one to each of the "extra" and "missed" counts.

Usage: reads from the standard input (stdin), producing the totals on the standard output (stdout); i.e. use as a filter.

rules

rules builds phrase structure rules and finite state machines from a corpus in Lancpars format. Rules are written to stdout, and FSM definitions to stderr. Usage:
rules <corpus> [<map>]

cmptran

cmptran is a program for comparing tag sequence probabilities. It takes lines consisting of three tags from standard input and works out the probability of the sequence, and the ratio of this one to the last (on the second of each pair). Usage:
cmptran <trans>

dephrase

dephrase takes a Lancpars format corpus and removes all phrase tags except those from a specified set, given by the first argument on the command line. The corpus is read from standard input and the result written to standard output. Usage: 
dephrase <tag-file>

 

Error messages

There is a large collection of error messages relating to illegal combinations of options, which is not listed here. Of the remaining messages, some indicate errors in the input data, some a system failure (such as running out of memory), and some that an internal limit in the program has been exceeded. The latter class require changing of the limit and recompilation. In addition, there are a few consistency check messages, which should not occur unless there is an error in the program. Error recovery is quite poor at present, and many errors will simply cause the program to abort (via the function get_out). Memory errors can sometimes be cured (especially when using the phrasal packages), by modifying the corpus to include anchors (^) between sentences.

Array overflow (MaxTags)
Internal limit exceeded: recompile with larger value for MaxTags. (outcomp)

Back action on initial state
Bad score
Duplicate FSM name
Duplicate state ID
Duplicate tag ignored
Jump destination missing
Tag required
Token coincides with a tag
Unexpected end of file
Unexpected end of item definition
"end" of FSM missing
Various errors in the format of a FSM definition. Note that FSM syntax checking is not very robust.

Bad reduced tags mapping line: ...
The line does not have the required format. See tagger option M for specification.

Bad total
No hypotheses on word
A consistency check has failed. The latter error may occur if it was not possible to assign any hypotheses to unknown words, for example, if all tag classes were marked as closed.

Beta pass consistency fail
Consistency check failed (choose_hyps ...)
Consistency error: PhraseStart/PhraseEnd seen without lancpars
Consistency fail ...
End of rule consistency check failed at ...
Warning: consistency check fails (non-skip with 0 hypotheses)
analyse: consistency error
These errors should never occur in the standard version; they indicate a bug in the program. (label, outcomp)

Buffer overflow at "..."
Buffer overflow (fetch_word)
Score buffer overflow at "..."
Tag buffer overflow at "..."
Word buffer overflow at "..."
Recompile with buffer size increased.

Cannot open ...
The file could not be opened. Check it exists and was specified correctly.

Corpus line buffer overflow
Needs a larger internal buffer -- recompile with larger MaxLine.

Dictionary full
File is too big for internal dictionary array. Try using option A to specify larger dictionary.

Dictionary size is missing
Error reading tags from dictionary
Map file indicates ... tags, array has ... entries
Wrong file code
Some fault in the format of a file supplied to the program. Check the right file is being used. Some of these errors may result from using the wrong tag list file.

Duplicate entry ... in ...
Reported in a number of cases where an object must be unique. The message will contain additional information to help identify the problem.

Empty prefix pattern at line ...
Empty suffix pattern at line ...
The indicated pattern was expected but not found in an unknown words rule.

Error reading word from dictionary at word ...
Trans write failure
Trans read failure
Input/Output failures. Check for system problems, e.g. that the file is not corrupt when reading, and that there is enough space when writing.

Failed to read a tag
No tag was found in a file when one was expected.

File is too big for dictionary
The size of the dictionary being read is greater than the one specified using the "A" option.

Inference rule with no tags
Self explanatory.

Line overflows buffer
Line is too long when reading FSM definition or parse rules: recompile with larger MaxLine.

Missing suffix pattern at line ...
Missing prefix pattern at line ...
The indicated pattern was expected but not found in an unknown words rule.

Missing threshold in inference rule
Self explanatory.

No valid options: need d or t or both
Self explanatory. (dmerge)

Out of memory creating ...
The program needs more memory for the given data. There is no cure if more memory cannot be made available.

Phrase start "..." does not match phrase end "..."
Mismatch between phrase brackets. (rules)

Root name is too long
A buffer in the program is not large enough; either find some other way of specifying the file name or change the source.

Rule has wrong format at line ...
General syntax error in unknown words rule.

Too few arguments
Check command line had the right format. (dmerge)

Too many tags
There is a word in a tagged input corpus with more than the allowed number of tags (1 in the current version).

Too many tags in inference rule
The internal array for inference rules is no large enough. Recompile, adjusting MaxTags in io.c.

Unexpected close bracket ...
Unmatched bracket ...
The given close bracket did not have a matching open bracket. (outcomp

Unexpected end of file in rules
Self explanatory.

Unexpected phrase end in "..."
End of phrase bracket found with no preceding start of phrase bracket. (dephrase)

Unexpected tag separator in "..."
Tag separator found without a corresponding word. (dephrase)

Unknown tag
The specified tag is not given in the tag list file.

Unknown word
Unknown word seen when using a wordlist: add it to the dictionary.

Warning: terminal missing
Warning: {\tt "->"} missing
Warning: empty rule
Various errors in the format of a parse rule.

... not found
The word on which information was requested is not in the dictionary. (exdict)

 

Programmer's Details

How to build the programs

The source code is supplied with a make file which works with Gnu make on Unix and cygwin. To make any of the programs, use the command make program-name. To make them all, type either make all or simply make. See the makefile for details on compiler and linker options.

To compile on Windows, use the command nmake -f winmake. This has been tested only with Microsoft Visual Studio version 6. You may first have to set up environment variables by executing the batch file VCVARS32.BAT, usually found in c:\Program Files\Microsoft Visual Studio\VC98\Bin.

To include the FSM and/or phrasal parsing code, change makefile to define either Use_FSM or Use_Parser.

Source files

The source files are:

analyse.c: Functions for analysis output option.
cmptran.c: Top level of cmptran.
common.c: Functions common to whole of tagger.
dephrase.c: Top level of dephrase.
diction.c: Dictionary processing functions.
dmerge.c: Top level of dmerge.
dtinfo.c: Top level of dtinfo.
exdict.c: Top level of exdict.
fsm.c: Finite state machine package.
io.c: Higher level input/output functions for tagger.
label.c: Main tagging functions.
list.c: Low level linked list processing.
low.c: Low level input/output functions.
mainl.c: Top level of label.
map.c: Tag conversion functions.
outcomp.c: Source of outcomp.
parser.c: Chart parser package.
phrase.c: Common phrase handling code.
readtr.c: Top level of readtr.
rules.c: Top level of rules.
stack.c: Functions for maintaining the stack.
trans.c: Transition array processing functions.

The header files are: analyse.h, common.h, diction.h, fsm.h, label.h, list.h, low.h, map.h, parser.h, phrase.h, stack.h, and trans.h. The files correspond roughly to the ".c" files with the same names. See the source code for detailed documentation. Generally, external data structures and macros are documented in header files. Functions are given a brief description in the header files, with more details in the source file.

How to change the input and output formats

Input format

The low level input is handled by functions in the files low.c and io.c. If additional formats for the input corpus are to be added, the following changes should be made.

 

Output format

If additional formats for the output corpus are to be added, the following changes should be made.

 

Unknown word handling

When an unknown word (one not listed in the dictionary) is encountered, the tagger considers all tags which are not marked as being closed classes as possible hypotheses for it. There are often many such tags, and this has a negative effect on both the accuracy and the execution speed of the tagger. Some other taggers have overcome this problem by doing a limited morphological analysis of the word and predicting a more restricted range of tags where possible. The program may be changed to include such analysis by altering the code in stack.c. The function copy_unknown is called whenever an unknown word is pushed onto the stack, and this is where morphological analysis should take place. The parameters to the function are:

Lexeme lp: A pointer to the lexeme entry.
char *text: The literal text of the word.
Dictword d: The dictionary entry for the word. This will have been set up to point to a special entry used for unknown words. Amongst other information, this entry lists all the non-closed class tags.

The function should return a list of the hypotheses constructed, in "unscored" format; see copy_from_dict for an example of how to do this. A consistency error is raised if no hypotheses were created.

Internal limits

There are a number of internal limits on sizes of arrays and buffers, which may need to be altered.

 

Bibliography

Cutting, Doug, Julian Kupiec, Jan Pedersen and Penelope Sibun. A Practical Part of Speech Tagger. Proceedings of 3rd ACL Conference on Applied Natural Language Processing, Trento, Italy

Elworthy, David (1994a). Automatic Error Detection in Part of Speech Tagging, Proceedings on New Methods in Language Processing, UMIST.

Elworthy, David (1994b). Does Baum-Welch Re-estimation Help Taggers?, Proceedings of 4th ACL Conference on Applied Natural Language Processing, Stuttgart.

Elworthy, David (1995). Tagset Design and Inflected Languages, Proceedings of EACL SIGDAT workshop "From Texts to Tags: Issues in Multilingual Language Analysis", Dublin.

Garside, Roger, Geoffrey Leech and Geoffrey Sampson (1987). The Computational Analysis of English. Longman.

Manning, Chris and Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA.