Extracting Clusters of Specialist Terms from Unstructured Text

Automatically identifying related specialist terms is a difficult and important task required to understand the structure of less prominent portions of the lexicon. Terms are often defining features of a particular domain. We develop a corpus-based method of extracting coherent clusters of satellite terminology – terms on the edge of the lexicon – using co-occurrences networks from unstructured text. Clusters are identified by extracting communities in the co-occurrence graph, after which we largest is discarded and rank words in the remaining groups by centrality. The method is computationally tractable on large corpora, requires no document structure and minimal normalization. Results suggest that the method does indeed extract coherent groups of satellite terms in corpora with varying content, style and structure. Second, the results show that language consists of a densely connected core (previously found in dictionary structure) and also has systematic, semantically coherent structure on the fringe of the observed vocabulary.

For More: Gerow, A. (2014, October). Extracting clusters of specialist terms from unstructured text. Association for Computational Linguistics.