Description

This track indicates the location of sequences in publications mapped back to the genome. Article titles are shown if you leave the mouse cursor over the features. The track is based on the fulltext of biomedical research articles. The current data consists of about 600.000 files (main text and supplementary files) from Pubmed Central (Open-Access set) and around 6 million text files (main text) from the publisher Elsevier (as part of the Sciverse Apps program).

Methods

Any filetype was converted to text, including XML, raw ASCII, PDFs and various office formats (Excel, Word, Powerpoint). The results were processed with programs to find groups of words that look like DNA/RNA sequences or words that look like protein sequences. These are then mapped with BLAT to the human and nine other model organism genomes.

The software is available from Google Code.

Credits

Software and processing by Maximilian Haeussler. UCSC Track visualisation by Larry Meyer and Hiram Clawson. Elsevier support by Max Berenstein, Raphael Sidi, Scott Robbins and colleagues. Original version written at the University of Manchester, Casey Bergman Lab.

Feedback

Please send ideas, comments or feedback on this track to max@soe. ucsc. edu. We are also very interested in getting access to more articles from publishers for this dataset.

References

Haeussler M, Bergman CM. Annotating genes and genomes with sequences extracted from biomedical articles, Bioinformatics 2011, see also www.text2genome.org

Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJM, Montgomery SB, Bergman CM, The Open Regulatory Annotation Consortium. Text-mining assisted regulatory annotation. Genome Biol. 2008;9(2):R31.