Biomedical natural language processing

Tools and resources

This page provides various language resources created from the entire available biomedical scientific literature, a text corpus of over five billion words.

Word vectors: vector representations of words
N-gram counts: counts of word sequence occurrences
Language models: models of word sequence probabilities

Documentation and tools allowing these and similar resources to be recreated are also provided.

Word vectors

Word vectors were induced from PubMed and PMC texts and their combination using the word2vec tool. The word vectors are provided in the word2vec binary format.

The word vectors are available for download from the following directory:

http://evexdb.org/pmresources/vec-space-models/

We also provide a set of word vectors induced on a combination of PubMed and PMC texts with texts extracted from a recent English Wikipedia dump. To get started with word vectors induced from a large corpus of biomedical and general-domain texts, download these vectors here (4GB file).

See below for tools for working with this data.

N-gram counts

Counts and probabilities of all n-grams from the 5.5B tokens of the available biomedical literature.

N-gram counts are provided in the simple TAB-separated values (TSV) format used for Google n-grams, easily understood through an example:

Materials and Methods     2012    44094   31834
no significant difference 2012    19033   11898

Each line contains four TAB-separated values: n-gram, year, total-count, and document-count. total-count is the total number of occurrences of the n-gram in the given year, and document-count is the number of documents that the n-gram appears in that year.

The n-grams for PubMed abstracts and PubMed Central full-text documents can be downloaded separately from these two directories:

To fetch all files, you can use the filelist files provided in each of these directories, for example as follows (in bash):

wget http://evexdb.org/pmresources/ngrams/PMC/filelist
for url in `cat filelist` ; do wget -c $url ; done

To only download 5-grams, you can filter the filelist:

for url in `cat filelist | grep 5-grams` ; do wget -c $url ; done

The n-gram models are a large dataset. Please avoid unnecessary downloads. The file sizes are as follows:

	PMC	PubMed
1-grams	168MB	236MB
2-grams	1.6GB	2.4GB
3-grams	6GB	8.3GB
4-grams	13GB	16GB
5-grams	18GB	23GB
6-grams	24GB	29GB
7-grams	28GB	33GB

Language model

A smoothed 5-gram language model of the combination of the PubMed and PubMed Central data was produced using the KenLM language modelling package.

You can download the model in the standard ARPA format from this directory. All you need to do is to inject the contents of the five (or less, if you need a lower-order model) .bz2 files into the model.arpa file. The resulting model can be processed and queried using KenLM, or any other package supporting the ARPA format.

The file sizes are

	PubMed+PMC
1-grams	173MB
2-grams	1.8GB
3-grams	7.9GB
4-grams	19GB
5-grams	29GB

Documentation and tools

To re-create the resources available from this page or create similar ones, see the following:

Source data
Document preprocessing
Word vector tools
N-gram tools

Source data

These resources were derived from the combination of all publication abstracts from PubMed and all full-text documents from the PubMed Central Open Access subset. Together, these literature databases effectively cover the entire available biomedical domain scientific literature.

Document preprocessing

To create the resources, it is necessary to extract plain text content from the document data, which is distributed in custom XML formats.

We applied the nxml2txt tool to extract plain ASCII text from the PubMed Central .nxml format. This tool is available for download from the nxml2text github repository.

Word vector tools

The following tools were used to induce word vectors:

word2vec by Tomas Mikolov and colleagues at Google.
random indexing tools by Martin Duneld.

We additionally introduced a tool for working with word vectors created by different methods.

wvlib word vector library

word2vec was run using the skip-gram model with a window size of 5, hierarchical softmax training, and a frequent word subsampling threshold of 0.001 to create 200-dimensional vectors. We refer to the word2vec page for explanation of these parameters and further information.

N-gram tools

The following tools were used to derive N-gram counts and probabilities.

License

All data on this page is made available under the Creative Commons Attribution (CC BY) license. Please attribute this data by citing Pyysalo et al. (2013)

Hosting

Hosting for this data is generously provided by the EVEX project at the University of Turku. Download speeds are throttled on a per-connection basis to 1MB/sec. Please do not bypass this limit using multiple connections.

References

Distributional Semantics Resources for Biomedical Text Processing. Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski and Sophia Ananiadou. LBM 2013.