broad-coverage lexical resources such as wordnet are extremely useful in applications such as word sense disambiguation (leacock, chodorow, miller 1998) and question answering (pasca and harabagiu 2001). however, they often include many rare senses while missing domain-specific senses. for example, in wordnet, the words dog, computer and company all have a sense that is a hyponym of person. such rare senses make it difficult for a coreference resolution system to use wordnet to enforce the constraint that personal pronouns (e.g. he or she) must refer to a person. on the other hand, wordnet misses the user-interface object sense of the word dialog (as often used in software manuals). one way to deal with these problems is to use a clustering algorithm to automatically induce semantic classes (lin and pantel 2001). many clustering algorithms represent a cluster by the centroid of all of its members (e.g., k means) (mcqueen 1967) or by a representative element (e.g., k-medoids) (kaufmann and rousseeuw 1987). when averaging over all elements in a cluster, the centroid of a cluster may be unduly influenced by elements that only marginally belong to the cluster or by elements that also belong to other clusters. for example, when clustering words, we can use the contexts of the words as features and group together the words that tend to appear in similar contexts. for instance, u.s. state names can be clustered this way because they tend to appear in the following contexts: (list a) ___ appellate court campaign in ___ ___ capital governor of ___ ___ driver's license illegal in ___ ___ outlaws sth. primary in ___ ___'s sales tax senator for ___ if we create a centroid of all the state names, the centroid will also contain features such as: (list b) ___'s airport archbishop of ___ ___'s business district fly to ___ ___'s mayor mayor of ___ ___'s subway outskirts of ___ because some of the state names (like new york and washington) are also names of cities. using a single representative from a cluster may be problematic too because each individual element has its own idiosyncrasies that may not be shared by other members of the cluster. in this paper, we propose a clustering algo rithm, cbc (clustering by committee), in which the centroid of a cluster is constructed by averaging the feature vectors of a subset of the cluster members. the subset is viewed as a committee that determines which other elements belong to the cluster. by carefully choosing committee members, the features of the centroid tend to be the more typical features of the target class. for example, our system chose the following committee members to compute the centroid of the state cluster: illinois, michigan, minnesota, iowa, wisconsin, indiana, nebraska and vermont. as a result, the centroid contains only features like those in list a. evaluating clustering results is a very difficult task. we introduce a new evaluation methodol ogy that is based on the editing distance between output clusters and classes extracted from wordnet (the answer key).we presented a clustering algorithm, cbc, for automatically discovering concepts from text. we introduce a new evaluation methodol ogy that is based on the editing distance between output clusters and classes extracted from wordnet (the answer key). this research was partly supported by natural sciences and engineering research council of canada grant ogp121338 and scholarship pgsb207797. as a result, the centroid contains only features like those in list a. evaluating clustering results is a very difficult task. however, they often include many rare senses while missing domain-specific senses. we generated clusters from a news corpus using cbc and compared them with classes extracted from wordnet (miller 1990). the parameters k and t are usually considered to be small numbers. broad-coverage lexical resources such as wordnet are extremely useful in applications such as word sense disambiguation (leacock, chodorow, miller 1998) and question answering (pasca and harabagiu 2001). five of the 943 clusters discovered by cbc from s13403 along with their features with top-15 highest mutual information and the wordnet classes that have the largest intersection with each cluster. test data. clustering algorithms are generally categorized as hierarchical and partitional. to extract classes from wordnet, we first estimate the probability of a random word belonging to a subhierarchy (a synset and its hyponyms). |