MedGene and BioGene: Advanced Literature Mining

Development Team: Michael Fiacco and Jin Park, Ph.D.

High-throughput technologies, such as proteomic screening and DNA micro-arrays, produce vast amounts of data requiring comprehensive analytical methods to decipher the biologically relevant results. One approach would be to manually search the biomedical literature; however, this would be an arduous task. We developed two automated literature-mining tools, termed MedGene and BioGene, which can be used to comprehensively search the literature for associations between human genes and disease or biological themes, respectively.


MedGene

MedGene is a text-mining tool for the associations between human genes and diseases in literature (see publications). It searches the titles and the abstracts of over 16,000,000 Medline records to identify genes co-cited with human diseases. Normalization is applied to assess the strength of each association so that for any given human disease, MedGene can return a list of associated genes in rank order. Although eventually it will be feasible to study the entire human proteome in high-throughput studies, at present, practicality often demands a focus on relevant subsets of genes. In the case of the breast cancer studies, for example, it was important to identify a set of 1000 genes with a high likelihood of yielding results in the screening experiments (see figure).

In addition, high-throughput technologies, such as proteomic screening and DNA micro-arrays, produce vast amounts of data requiring comprehensive analytical methods to decipher the biologically relevant results. The global understanding of gene-disease relationships enables comprehensive comparisons between large experimental data sets and existing knowledge in the medical literature.

Estimation of the false negative rate by comparison with hand-curated databases. The breast cancer-related genes identified by MedGene were compared with those listed in several other databases including the Tumor Gene Database (TGD), the Breast Cancer Gene Database(BCG), GeneCards (GC) and Swissprot. Genes were considered false negatives if they were represented in at least one of these other databases and not in MedGene and their link to breast cancer was supported by at least one literature reference. All literature references were verified by manual review to confirm their validity. The number of genes in each database or shared by more than one database is indicated. The false negative rate was calculated by genes missed at MedGene (26)/total number of nonoverlapping genes in other databases (285).


BioGene

BioGene is based on a similar concept to MedGene. Instead of disease terms, BioGene searches for the associations between human genes and biological themes in literature (Hu et al 2003). It allows more broad searches using any Biological or chemical MeSH term, such as “cell cycle”, “lipids” and “tetrahydrofolates”

 


Publications

LaBaer J. Mining the literature and large datasets. Nat Biotechnol. 2003 Sep;21(9):9767. PMID: 12949554

Hu Y, Hines LM, Weng H, Zuo D, Rivera M, Richardson A, LaBaer J. Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res. 2003 Jul-Aug;2(4):405-12. PMID: 12938930

Hu Y, Labaer J. Tracking gene-disease relationships for high-throughput functional studies. Surgery. 2004 Sep;136(3):504-10. ISSN 0039-6060. PMID: 15349093

Hu Y, Hines LM, Weng H, Zuo D, Rivera M, Richardson A, LaBaer J. Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res. 2003 Jul-Aug;2(4):405-12. PMID: 12938930