TnT -- Statistical Part-of-Speech Tagging

Thorsten Brants

Universität des Saarlandes
Computational Linguistics
P.O.Box 151150, D-66041 Saarbrücken, Germany
thorsten at brants dot net

What is TnT?

TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-of-speech tagger that is trainable on different languages and virtually any tagset. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words.

TnT is not optimized for a particular language. Instead, it is optimized for training on a large variety of corpora. Adapting the tagger to a new language, new domain, or new tagset is very easy. Additionally, TnT is optimized for speed.

The tagger is an implementation of the Viterbi algorithm for second order Markov models. The main paradigm used for smoothing is linear interpolation, the respective weights are determined by deleted interpolation. Unknown words are handled by a suffix trie and successive abstraction.

How Do I Obtain TnT?

TnT is subject to a license agreement which is free of charge for non-commercial research purposes. Please see the license agreement. Fill in the form and fax it to +1-815-846-0652 (in the United States), or scan it and email it to me, and I will send you the details of downloading the program, parameter files (language models) and documentation.

Running TnT

TnT comes with two language models, one for German, and one for English. The German model is trained on the Saarbrücker German newspaper corpus using the Stuttgart-Tübingen-Tagset. The English model is trained on the Susanne Corpus. Additionally, there is a pre-compiled model trained on the Penn Treebank. Due to copyright issues, you need a copy of the corpus in order to obtain the model.

TnT can directly be applied by using one of these three language models. The input file contains one token per line. In the basic mode, the tagger adds a second column to each line, containing the tag for the word. Optionally, the tagger emits alternative tags for each token, together with a probability distribution. Example:

                 Basic         Optional
Input            Output        Extended Output
----------------+------------+--------------------------------------------------------------------
Der              ART         | ART     1.000000e+00
Mandolinen-Club  NN      *   | NN      1.000000e+00    *
Falkenstein      NE      *   | NE      8.001280e-01    NN      1.998720e-01    *
und              KON         | KON     1.000000e+00
der              ART         | ART     1.000000e+00
Frauenchor       NN      *   | NN      9.828203e-01    NE      1.717975e-02    *
aus              APPR        | APPR    1.000000e+00
dem              ART         | ART     1.000000e+00
sächsischen      ADJA        | ADJA    1.000000e+00
Königstein       NN          | NN      7.762892e-01    NE      2.237108e-01
gestalten        VVINF       | VVINF   1.000000e+00
die              ART         | ART     9.796126e-01    PRELS   1.443545e-02    PDS    5.951974e-03
Feier            NN          | NN      1.000000e+00
gemeinsam        ADJD        | ADJD    1.000000e+00
.                $.          | $.      1.000000e+00
----------------+------------+--------------------------------------------------------------------
Words marked with an asterisk (*) are not in the lexicon of the tagger. They are processed by suffix analysis. Tagging speed depends on the average ambiguity rate of words and the percentage of unknown words in the text. It is typically between 30,000 and 60,000 tokens per second on a Pentium 500, running Linux.

We measured tagging accuracy for the different tagsets by dividing the corpora in 90% training and 10% test set. The experiments were repeated 10 times with different divisions, training and test set were ensured to be disjoint, none of the testing material was seen during training.

Corpus Language Domain Size (Tokens) Avg. Accuracy Std. Deviation
NEGRA corpus German Newspaper 350,000 96.7% 0.29
Penn Treebank English Newspaper 1,200,000 96.7% 0.13
Susanne Corpus English Mixed 150,000 94.5% 0.76
Accuracy for the Susanne Corpus is lowest, which is due to the small size of the Corpus (around 150,000 tokens) and the large tagset (around 160 plus multi-token tags). Accuracy for the Penn Treebank is state-of-the-art for English texts, and accuracy for the German NEGRA corpus is excellent.

Training your own Model

TnT is trainable on languages that separate words by white space, using virtually any tagset that can be represented in ASCII. You need a tagged corpus in the format shown above: one token per line, the first column is the word, the second column is the tag. This corpus is given to the training module which creates appropriate parameter files. Training speed is typically around 100,000 tokens per second on a Pentium 500.

Using TnT

Here is the TnT user's manual: postscript / PDF

Acknowledgements

Many thanks go to Hans Uszkoreit for his support during the development of TnT. Thanks go also to the Deutsche Forschungsgemeinschaft for financing the work by a grant in the Graduiertenkolleg Kognitionswissenschaft Saarbrücken. Large annotated corpora are the pre-requisite for developing and testing part-of-speech taggers, and they enable the generation of high-quality language models. Therfore, I would like to thank all the people who were involved in building the Stuttgarter Referenzkorpus, the NEGRA Corpus, the Penn Treebank and the Susanne Corpus. And, last but not least, I would like to thank the users of TnT who provided me with bug reports and valuable suggestions for improvements.
Last changed: 26 Oct 1998, Thorsten Brants