alwaysaditi's picture
End of training
dc78b20 verified
a maximum entropy approach to identifying sentence boundaries we present a trainable model for identify- ing sentence boundaries in raw text. given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of . , ?, and / as either a valid or invalid sentence boundary. the training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. the model can therefore be trained easily on any genre of english, and should be trainable on any other roman-alphabet language. performance is compa rable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains. our statistical system, mxterminator employs simpler lexical features of the words to the left and right of the candidate period.