alwaysaditi's picture
End of training
dc78b20 verified
predicate subcategorization is a key component of a lexical entry, because most, if not all, recent syntactic theories 'project' syntactic structure from the lexicon. therefore, a wide-coverage parser utilizing such a lexicalist grammar must have access to an accurate and comprehensive dictionary encoding (at a minimum) the number and category of a predicate's arguments and ideally also information about control with predicative arguments, semantic selection preferences on arguments, and so forth, to allow the recovery of the correct predicate-argument structure. if the parser uses statistical techniques to rank analyses, it is also critical that the dictionary encode the relative frequency of distinct subcategorization classes for each predicate. several substantial machine-readable subcategorization dictionaries exist for english, either built largely automatically from machine-readable versions of conventional learners' dictionaries, or manually by (computational) linguists (e.g. the alvey nl tools (anlt) dictionary, boguraev et al. (1987); the comlex syntax dictionary, grishman et al. (1994)). unfortunately, neither approach can yield a genuinely accurate or comprehensive computational lexicon, because both rest ultimately on the manual efforts of lexicographers / linguists and are, therefore, prone to errors of omission and commission which are hard or impossible to detect automatically (e.g. boguraev & briscoe, 1989; see also section 3.1 below for an example). furthermore, manual encoding is labour intensive and, therefore, it is costly to extend it to neologisms, information not currently encoded (such as relative frequency of different subcategorizations), or other (sub)languages. these problems are compounded by the fact that predicate subcategorization is closely associated to lexical sense and the senses of a word change between corpora, sublanguages and/or subject domains (jensen, 1991). in a recent experiment with a wide-coverage parsing system utilizing a lexicalist grammatical framework, briscoe & carroll (1993) observed that half of parse failures on unseen test data were caused by inaccurate subcategorization information in the anlt dictionary. the close connection between sense and subcategorization and between subject domain and sense makes it likely that a fully accurate 'static' subcategorization dictionary of a language is unattainable in any case. moreover, although schabes (1992) and others have proposed `lexicalized' probabilistic grammars to improve the accuracy of parse ranking, no wide-coverage parser has yet been constructed incorporating probabilities of different subcategorizations for individual predicates, because of the problems of accurately estimating them. these problems suggest that automatic construction or updating of subcategorization dictionaries from textual corpora is a more promising avenue to pursue. preliminary experiments acquiring a few verbal subcategorization classes have been reported by brent (1991, 1993), manning (1993), and ushioda et at. (1993). in these experiments the maximum number of distinct subcategorization classes recognized is sixteen, and only ushioda et at. attempt to derive relative subcategorization frequency for individual predicates. we describe a new system capable of distinguishing 160 verbal subcategorization classes—a superset of those found in the anlt and comlex syntax dictionaries. the classes also incorporate information about control of predicative arguments and alternations such as particle movement and extraposition. we report an initial experiment which demonstrates that this system is capable of acquiring the subcategorization classes of verbs and the relative frequencies of these classes with comparable accuracy to the less ambitious extant systems. we achieve this performance by exploiting a more sophisticated robust statistical parser which yields complete though 'shallow' parses, a more comprehensive subcategorization class classifier, and a priori estimates of the probability of membership of these classes. we also describe a small-scale experiment which demonstrates that subcategorization class frequency information for individual verbs can be used to improve parsing accuracy.predicate subcategorization is a key component of a lexical entry, because most, if not all, recent syntactic theories 'project' syntactic structure from the lexicon. the experiment and comparison reported above suggests that our more comprehensive subcategorization class extractor is able both to assign classes to individual verbal predicates and also to rank them according to relative frequency with comparable accuracy to extant systems. boguraev & briscoe, 1987). we achieve this performance by exploiting a more sophisticated robust statistical parser which yields complete though 'shallow' parses, a more comprehensive subcategorization class classifier, and a priori estimates of the probability of membership of these classes. we also describe a small-scale experiment which demonstrates that subcategorization class frequency information for individual verbs can be used to improve parsing accuracy. therefore, a wide-coverage parser utilizing such a lexicalist grammar must have access to an accurate and comprehensive dictionary encoding (at a minimum) the number and category of a predicate's arguments and ideally also information about control with predicative arguments, semantic selection preferences on arguments, and so forth, to allow the recovery of the correct predicate-argument structure. brent's (1993) approach to acquiring subcategorization is based on a philosophy of only exploiting unambiguous and determinate information in unanalysed corpora.