This page provides various language resources created from the entire available biomedical scientific literature, a text corpus of over five billion words.
- Word vectors: vector representations of words
- N-gram counts: counts of word sequence occurrences
- Language models: models of word sequence probabilities
Documentation and tools allowing these and similar resources to be recreated are also provided.
Word vectors
Word vectors were induced from PubMed and PMC texts and their combination using the word2vec tool. The word vectors are provided in the word2vec binary format.
The word vectors are available for download from the following directory:We also provide a set of word vectors induced on a combination of PubMed and PMC texts with texts extracted from a recent English Wikipedia dump. To get started with word vectors induced from a large corpus of biomedical and general-domain texts, download these vectors here (4GB file).
See below for tools for working with this data.
N-gram counts
Counts and probabilities of all n-grams from the 5.5B tokens of the available biomedical literature.
N-gram counts are provided in the simple TAB-separated
values (TSV) format used for
Google n-grams, easily understood through an example:
Materials and Methods 2012 44094 31834
no significant difference 2012 19033 11898
Each line contains four TAB-separated values: n-gram, year, total-count, and document-count. total-count is the total number of occurrences of the n-gram in the given year, and document-count is the number of documents that the n-gram appears in that year.
The n-grams for PubMed abstracts and PubMed Central full-text documents can be downloaded separately from these two directories:
To fetch all files, you can use the filelist files provided in each of these directories, for example as follows (in bash):
wget http://evexdb.org/pmresources/ngrams/PMC/filelist for url in `cat filelist` ; do wget -c $url ; done
To only download 5-grams, you can filter the filelist:
for url in `cat filelist | grep 5-grams` ; do wget -c $url ; done
The n-gram models are a large dataset. Please avoid unnecessary downloads. The file sizes are as follows:
PMC | PubMed | |
---|---|---|
1-grams | 168MB | 236MB |
2-grams | 1.6GB | 2.4GB |
3-grams | 6GB | 8.3GB |
4-grams | 13GB | 16GB |
5-grams | 18GB | 23GB |
6-grams | 24GB | 29GB |
7-grams | 28GB | 33GB |
Language model
A smoothed 5-gram language model of the combination of the PubMed and PubMed Central data was produced using the KenLM language modelling package.
You can download the model in the standard ARPA format from this directory. All you need to do is to inject the contents of the five (or less, if you need a lower-order model) .bz2 files into the model.arpa file. The resulting model can be processed and queried using KenLM, or any other package supporting the ARPA format.
The file sizes are
PubMed+PMC | |
---|---|
1-grams | 173MB |
2-grams | 1.8GB |
3-grams | 7.9GB |
4-grams | 19GB |
5-grams | 29GB |
Documentation and tools
To re-create the resources available from this page or create similar ones, see the following:
Source data
These resources were derived from the combination of all publication abstracts from PubMed and all full-text documents from the PubMed Central Open Access subset. Together, these literature databases effectively cover the entire available biomedical domain scientific literature.
Document preprocessing
To create the resources, it is necessary to extract plain text content from the document data, which is distributed in custom XML formats.
We applied the nxml2txt tool to extract plain ASCII text from the PubMed Central .nxml format. This tool is available for download from the nxml2text github repository.
Word vector tools
The following tools were used to induce word vectors:
- word2vec by Tomas Mikolov and colleagues at Google.
- random indexing tools by Martin Duneld.
We additionally introduced a tool for working with word vectors created by different methods.
- wvlib word vector library
word2vec was run using the skip-gram model with a window size of 5, hierarchical softmax training, and a frequent word subsampling threshold of 0.001 to create 200-dimensional vectors. We refer to the word2vec page for explanation of these parameters and further information.
N-gram tools
The following tools were used to derive N-gram counts and probabilities.
License
All data on this page is made available under the Creative Commons Attribution (CC BY) license. Please attribute this data by citing Pyysalo et al. (2013)
Hosting
Hosting for this data is generously provided by the EVEX project at the University of Turku. Download speeds are throttled on a per-connection basis to 1MB/sec. Please do not bypass this limit using multiple connections.
References
- Distributional Semantics Resources for Biomedical Text Processing. Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski and Sophia Ananiadou. LBM 2013.