Biomedical natural language processing

Tools and resources

This page provides various language resources created from the entire available biomedical scientific literature, a text corpus of over five billion words.

Documentation and tools allowing these and similar resources to be recreated are also provided.


Word vectors

Word vectors were induced from PubMed and PMC texts and their combination using the word2vec tool. The word vectors are provided in the word2vec binary format.

The word vectors are available for download from the following directory:

We also provide a set of word vectors induced on a combination of PubMed and PMC texts with texts extracted from a recent English Wikipedia dump. To get started with word vectors induced from a large corpus of biomedical and general-domain texts, download these vectors here (4GB file).

See below for tools for working with this data.


N-gram counts

Counts and probabilities of all n-grams from the 5.5B tokens of the available biomedical literature.

N-gram counts are provided in the simple TAB-separated values (TSV) format used for Google n-grams, easily understood through an example:

Materials and Methods     2012    44094   31834
no significant difference 2012    19033   11898

Each line contains four TAB-separated values: n-gram, year, total-count, and document-count. total-count is the total number of occurrences of the n-gram in the given year, and document-count is the number of documents that the n-gram appears in that year.

The n-grams for PubMed abstracts and PubMed Central full-text documents can be downloaded separately from these two directories:

To fetch all files, you can use the filelist files provided in each of these directories, for example as follows (in bash):

wget http://evexdb.org/pmresources/ngrams/PMC/filelist
for url in `cat filelist` ; do wget -c $url ; done

To only download 5-grams, you can filter the filelist:

for url in `cat filelist | grep 5-grams` ; do wget -c $url ; done

The n-gram models are a large dataset. Please avoid unnecessary downloads. The file sizes are as follows:

PMCPubMed
1-grams168MB236MB
2-grams1.6GB2.4GB
3-grams6GB8.3GB
4-grams13GB16GB
5-grams18GB23GB
6-grams24GB29GB
7-grams28GB33GB


Language model

A smoothed 5-gram language model of the combination of the PubMed and PubMed Central data was produced using the KenLM language modelling package.

You can download the model in the standard ARPA format from this directory. All you need to do is to inject the contents of the five (or less, if you need a lower-order model) .bz2 files into the model.arpa file. The resulting model can be processed and queried using KenLM, or any other package supporting the ARPA format.

The file sizes are

PubMed+PMC
1-grams173MB
2-grams1.8GB
3-grams7.9GB
4-grams19GB
5-grams29GB


Documentation and tools

To re-create the resources available from this page or create similar ones, see the following:

Source data

These resources were derived from the combination of all publication abstracts from PubMed and all full-text documents from the PubMed Central Open Access subset. Together, these literature databases effectively cover the entire available biomedical domain scientific literature.

Document preprocessing

To create the resources, it is necessary to extract plain text content from the document data, which is distributed in custom XML formats.

We applied the nxml2txt tool to extract plain ASCII text from the PubMed Central .nxml format. This tool is available for download from the nxml2text github repository.

Word vector tools

The following tools were used to induce word vectors:

We additionally introduced a tool for working with word vectors created by different methods.

word2vec was run using the skip-gram model with a window size of 5, hierarchical softmax training, and a frequent word subsampling threshold of 0.001 to create 200-dimensional vectors. We refer to the word2vec page for explanation of these parameters and further information.

N-gram tools

The following tools were used to derive N-gram counts and probabilities.


License

All data on this page is made available under the Creative Commons Attribution (CC BY) license. Please attribute this data by citing Pyysalo et al. (2013)


Hosting

Hosting for this data is generously provided by the EVEX project at the University of Turku. Download speeds are throttled on a per-connection basis to 1MB/sec. Please do not bypass this limit using multiple connections.


References