the importance of learning to manipulate monolingual paraphrase relationships for applications like summarization, search, and dialog has been highlighted by a number of recent efforts (barzilay & mckeown 2001; shinyama et al 2002; lee & barzilay 2003; lin & pantel 2001). while several different learning methods have been applied to this problem, all share a need for large amounts of data in the form of pairs or sets of strings that are likely to exhibit lexical and/or structural paraphrase alternations. one approach1 1 an alternative approach involves identifying anchor points--pairs of words linked in a known way--and collecting the strings that intervene. (shinyama, et al 2002; lin & pantel 2001). since our interest is in that has been successfully used is edit distance, a measure of similarity between strings. the assumption is that strings separated by a small edit distance will tend to be similar in meaning: the leading indicators measure the economy? the leading index measures the economy?. lee & barzilay (2003), for example, use multi sequence alignment (msa) to build a corpus of paraphrases involving terrorist acts. their goal is to extract sentential templates that can be used in high-precision generation of paraphrase alter nations within a limited domain. our goal here is rather different: our interest lies in constructing a monolingual broad-domain corpus of pairwise aligned sentences. such data would be amenable to conventional statistical machine translation (smt) techniques (e.g., those discussed in och & ney 2003).2 in what follows we compare two strategies for unsupervised construction of such a corpus, one employing string similarity and the other associating sentences that may overlap very little at the string level. we measure the relative utility of the two derived monolingual corpora in the context of word alignment techniques developed originally for bilingual text. we show that although the edit distance corpus is well-suited as training data for the alignment algorithms currently used in smt, it is an incomplete source of information about paraphrase relations, which exhibit many of the characteristics of comparable bilingual corpora or free translations. many of the more complex alternations that characterize monolingual paraphrase, such as large-scale lexical alternations and constituent reorderings, are not readily learning sentence level paraphrases, including major constituent reorganizations, we do not address this approach here. 2 barzilay & mckeown (2001) consider the possibility of using smt machinery, but reject the idea because of the noisy, comparable nature of their dataset. captured by edit distance techniques, which conflate semantic similarity with formal similarity. we conclude that paraphrase research would benefit by identifying richer data sources and developing appropriate learning techniques.we remain, however, responsible for all content. edit distance identifies sentence pairs that exhibit lexical and short phrasal alternations that can be aligned with considerable success. we conclude that paraphrase research would benefit by identifying richer data sources and developing appropriate learning techniques. we have also benefited from discussions with ken church, mark johnson, daniel marcu and franz och. the importance of learning to manipulate monolingual paraphrase relationships for applications like summarization, search, and dialog has been highlighted by a number of recent efforts (barzilay & mckeown 2001; shinyama et al 2002; lee & barzilay 2003; lin & pantel 2001). given a large dataset and a well-motivated clustering of documents, useful datasets can be gleaned even without resorting to more sophisticated techniques figure 2. captured by edit distance techniques, which conflate semantic similarity with formal similarity. the second relied on a discourse-based heuristic, specific to the news genre, to identify likely paraphrase pairs even when they have little superficial similarity. while several different learning methods have been applied to this problem, all share a need for large amounts of data in the form of pairs or sets of strings that are likely to exhibit lexical and/or structural paraphrase alternations. our two paraphrase datasets are distilled from a corpus of news articles gathered from thousands of news sources over an extended period. |