alwaysaditi's picture
End of training
dc78b20 verified
part-of-speech tagging is the process of assigning grammatical categories to individual words in a corpus. one widely used approach makes use of a statistical technique called a hidden markov model (hmm). the model is defined by two collections of parameters: the transition probabilities, which express the probability that a tag follows the preceding one (or two for a second order model); and the lexical probabilities, giving the probability that a word has a given tag without regard to words on either side of it. to tag a text, the tags with non-zero probability are hypothesised for each word, and the most probable sequence of tags given the sequence of words is determined from the probabilities. two algorithms are commonly used, known as the forward-backward (fb) and viterbi algorithms. fb assigns a probability to every tag on every word, while viterbi prunes tags which cannot be chosen because their probability is lower than the ones of competing hypotheses, with a corresponding gain in computational efficiency. for an introduction to the algorithms, see cutting et at. (1992), or the lucid description by sharman (1990). there are two principal sources for the parameters of the model. if a tagged corpus prepared by a human annotator is available, the transition and lexical probabilities can be estimated from the frequencies of pairs of tags and of tags associated with words. alternatively, a procedure called baumwelch (bw) re-estimation may be used, in which an untagged corpus is passed through the fb algorithm with some initial model, and the resulting probabilities used to determine new values for the lexical and transition probabilities. by iterating the algorithm with the same corpus, the parameters of the model can be made to converge on values which are locally optimal for the given text. the degree of convergence can be measured using a perplexity measure, the sum of plog2p for hypothesis probabilities p, which gives an estimate of the degree of disorder in the model. the algorithm is again described by cutting et ad. and by sharman, and a mathematical justification for it can be found in huang et at. (1990). the first major use of hmms for part of speech tagging was in claws (garside et a/., 1987) in the 1970s. with the availability of large corpora and fast computers, there has been a recent resurgence of interest, and a number of variations on and alternatives to the fb, viterbi and bw algorithms have been tried; see the work of, for example, church (church, 1988), brill (brill and marcus, 1992; brill, 1992), derose (derose, 1988) and kupiec (kupiec, 1992). one of the most effective taggers based on a pure hmm is that developed at xerox (cutting et al., 1992). an important aspect of this tagger is that it will give good accuracy with a minimal amount of manually tagged training data. 96% accuracy correct assignment of tags to word token, compared with a human annotator, is quoted, over a 500000 word corpus. the xerox tagger attempts to avoid the need for a hand-tagged training corpus as far as possible. instead, an approximate model is constructed by hand, which is then improved by bw re-estimation on an untagged training corpus. in the above example, 8 iterations were sufficient. the initial model set up so that some transitions and some tags in the lexicon are favoured, and hence having a higher initial probability. convergence of the model is improved by keeping the number of parameters in the model down. to assist in this, low frequency items in the lexicon are grouped together into equivalence classes, such that all words in a given equivalence class have the same tags and lexical probabilities, and whenever one of the words is looked up, then the data common to all of them is used. re-estimation on any of the words in a class therefore counts towards re-estimation for all of them'. the results of the xerox experiment appear very encouraging. preparing tagged corpora either by hand is labour-intensive and potentially error-prone, and although a semi-automatic approach can be used (marcus et al., 1993), it is a good thing to reduce the human involvement as much as possible. however, some careful examination of the experiment is needed. in the first place, cutting et a/. do not compare the success rate in their work with that achieved from a hand-tagged training text with no re-estimation. secondly, it is unclear how much the initial biasing contributes the success rate. if significant human intervention is needed to provide the biasing, then the advantages of automatic training become rather weaker, especially if such intervention is needed on each new text domain. the kind of biasing cutting et a/. describe reflects linguistic insights combined with an understanding of the predictions a tagger could reasonably be expected to make and the ones it could not. the aim of this paper is to examine the role that training plays in the tagging process, by an experimental evaluation of how the accuracy of the tagger varies with the initial conditions. the results suggest that a completely unconstrained initial model does not produce good quality results, and that one 'the technique was originally developed by kupiec (kupiec, 1989). accurately trained from a hand-tagged corpus will generally do better than using an approach based on re-estimation, even when the training comes from a different source. a second experiment shows that there are different patterns of re-estimation, and that these patterns vary more or less regularly with a broad characterisation of the initial conditions. the outcome of the two experiments together points to heuristics for making effective use of training and reestimation, together with some directions for further research. work similar to that described here has been carried out by merialdo (1994), with broadly similar conclusions. we will discuss this work below. the principal contribution of this work is to separate the effect of the lexical and transition parameters of the model, and to show how the results vary with different degree of similarity between the training and test data.from the observations in the previous section, we propose the following guidelines for how to train a hmm for use in tagging: able, use bw re-estimation with standard convergence tests such as perplexity. the principal contribution of this work is to separate the effect of the lexical and transition parameters of the model, and to show how the results vary with different degree of similarity between the training and test data. part-of-speech tagging is the process of assigning grammatical categories to individual words in a corpus. one widely used approach makes use of a statistical technique called a hidden markov model (hmm). we will discuss this work below. in the end it may turn out there is simply no way of making the prediction without a source of information extrinsic to both model and corpus. work similar to that described here has been carried out by merialdo (1994), with broadly similar conclusions. the general pattern of the results presented does not vary greatly with the corpus and tagset used. during the first experiment, it became apparent that baum-welch re-estimation sometimes decreases the accuracy as the iteration progresses. to tag a text, the tags with non-zero probability are hypothesised for each word, and the most probable sequence of tags given the sequence of words is determined from the probabilities.