Quantifying the Impact and Extent of Undocumented Biomedical Synonymy

Automated systems that extract and integrate information from the research literature have become common in biomedicine. As the same meaning can be expressed in many distinct but synonymous ways, access to comprehensive thesauri may enable such systems to maximize their performance. Here, we establish the importance of synonymy for a specific text-mining task (named-entity normalization), and we suggest that current thesauri may be woefully inadequate in their documentation of this linguistic phenomenon. To test this claim, we develop a model for estimating the amount of missing synonymy. We apply our model to both biomedical terminologies and general-English thesauri, predicting massive amounts of missing synonymy for both lexicons. Furthermore, we verify some of our predictions for the latter domain through “crowd-sourcing.” Overall, our work highlights the dramatic incompleteness of current biomedical thesauri, and to mitigate this issue, we propose the creation of “living” terminologies, which would automatically harvest undocumented synonymy and help smart machines enrich biomedicine.

For More: Blair, D.R., Wang, K., Nestorov, S., Evans, J.A., Rzhetsky, A., 2014. Quantifying the Impact and Extent of Undocumented Biomedical Synonymy. PLoS Comput Biol 10 (September, 2014)