TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-of-speech tagger that is trainable on different languages and virtually any tagset. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words.
TnT is not optimized for a particular language. Instead, it is optimized for training on a large variety of corpora. Adapting the tagger to a new language, new domain, or new tagset is very easy. Additionally, TnT is optimized for speed.
The tagger is an implementation of the Viterbi algorithm for second order Markov models. The main paradigm used for smoothing is linear interpolation, the respective weights are determined by deleted interpolation. Unknown words are handled by a suffix trie and successive abstraction.
TnT can directly be applied by using one of these three language models. The input file contains one token per line. In the basic mode, the tagger adds a second column to each line, containing the tag for the word. Optionally, the tagger emits alternative tags for each token, together with a probability distribution. Example:
Basic Optional Input Output Extended Output ----------------+------------+-------------------------------------------------------------------- Der ART | ART 1.000000e+00 Mandolinen-Club NN * | NN 1.000000e+00 * Falkenstein NE * | NE 8.001280e-01 NN 1.998720e-01 * und KON | KON 1.000000e+00 der ART | ART 1.000000e+00 Frauenchor NN * | NN 9.828203e-01 NE 1.717975e-02 * aus APPR | APPR 1.000000e+00 dem ART | ART 1.000000e+00 sächsischen ADJA | ADJA 1.000000e+00 Königstein NN | NN 7.762892e-01 NE 2.237108e-01 gestalten VVINF | VVINF 1.000000e+00 die ART | ART 9.796126e-01 PRELS 1.443545e-02 PDS 5.951974e-03 Feier NN | NN 1.000000e+00 gemeinsam ADJD | ADJD 1.000000e+00 . $. | $. 1.000000e+00 ----------------+------------+--------------------------------------------------------------------Words marked with an asterisk (*) are not in the lexicon of the tagger. They are processed by suffix analysis. Tagging speed depends on the average ambiguity rate of words and the percentage of unknown words in the text. It is typically between 30,000 and 60,000 tokens per second on a Pentium 500, running Linux.
We measured tagging accuracy for the different tagsets by dividing the corpora in 90% training and 10% test set. The experiments were repeated 10 times with different divisions, training and test set were ensured to be disjoint, none of the testing material was seen during training.
Corpus | Language | Domain | Size (Tokens) | Avg. Accuracy | Std. Deviation |
NEGRA corpus | German | Newspaper | 350,000 | 96.7% | 0.29 |
Penn Treebank | English | Newspaper | 1,200,000 | 96.7% | 0.13 |
Susanne Corpus | English | Mixed | 150,000 | 94.5% | 0.76 |