alwaysaditi's picture
End of training
dc78b20 verified
for the past decade or more, symbolic, linguistically ori- ented methods and statistical or machine learning ap- proaches to nlp have often been perceived as incompat- ible or even competing paradigms. while shallow and probabilistic processing techniques have produced use- ful results in many classes of applications, they have not met the full range of needs for nlp, particularly where precise interpretation is important, or where the variety of linguistic expression is large relative to the amount of training data available. on the other hand, deep approaches to nlp have only recently achieved broad enough grammatical coverage and sufficient processing efficiency to allow the use of precise linguistic grammars in certain types of real-world applications. in particular, applications of broad-coverage analyti- cal grammars for parsing or generation require the use of sophisticated statistical techniques for resolving ambigu- ities; the transfer of head-driven phrase structure gram- mar (hpsg) systems into industry, for example, has am- plified the need for general parse ranking, disambigua- tion, and robust recovery techniques. we observe general consensus on the necessity for bridging activities, com- bining symbolic and stochastic approaches to nlp. but although we find promising research in stochastic pars- ing in a number of frameworks, there is a lack of appro- priately rich and dynamic language corpora for hpsg. likewise, stochastic parsing has so far been focussed on information-extraction-type applications and lacks any depth of semantic interpretation. the redwoods initia- tive is designed to fill in this gap. in the next section, we present some of the motivation for the lingo redwoods project as a treebank develop- ment process. although construction of the treebank is in its early stages, we present in section 3 some prelim- inary results of using the treebank data already acquired on concrete applications. we show, for instance, that even simple statistical models of parse ranking trained on the redwoods corpus built so far can disambiguate parses with close to 80% accuracy. 2 a rich and dynamic treebank the redwoods treebank is based on open-source hpsg resources developed by a broad consortium of re- search groups including researchers at stanford (usa), saarbru?cken (germany), cambridge, edinburgh, and sussex (uk), and tokyo (japan). their wide distribution and common acceptance make the hpsg framework and resources an excellent anchor point for the redwoods treebanking initiative. the key innovative aspect of the redwoods ap- proach to treebanking is the anchoring of all linguis- tic data captured in the treebank to the hpsg frame- work and a generally-available broad-coverage gram- mar of english, the lingo english resource grammar (flickinger, 2000) as implemented with the lkb gram- mar development environment (copestake, 2002). un- like existing treebanks, there is no need to define a (new) form of grammatical representation specific to the tree- bank.the lingo redwoods treebank motivation and preliminary applications stephan oepen, kristina toutanova, stuart shieber, christopher manning, dan flickinger, and thorsten brants {oe |kristina |manning |dan}@csli.stanford.edu, shieber@deas.harvard.edu, brants@parc.xerox.com abstract the lingo redwoods initiative is a seed activity in the de- sign and development of a new type of treebank.