So you want to get some sequencing data in NCBI?¶
This is a set of tutorials for working with the NCBI and MG-RAST databases – specifically, to download project specific information.
First, let’s think about how these databases are structured. I am going to create a database for folks to deposit whole genome sequences. What kind of information am I going to store in this? Many of you may be familiar with such a database, hosted by the NCBI. The scripts that complement this tutorial can be downloaded with the following:
git clone https://github.com/adina/scripts-for-ngs2014.git
Let’s come up with a list of things we’d like stored in this database and discuss some of the challenges involved in database creation, management, and access.
So, you’ve been given a list of genomes and been asked to create a phylogenetic tree of these genomes. How big would this list be before you thought about hiring an undergraduate to download sequences?
Say the list is only 3 genomes:
CP000962 CP000967 CP000975
The following are some ways with which I’ve used to grab genome sequences:
- Go to the web portal and look up each FASTA
- Go to the FTP site, find each genome, and download manually
- Use the NCBI Web Services API to download the data
Among these, I’m going to assume many of you are familiar with the first two. This tutorial then is going to focus on using APIs.
What is an API and how does it relate to NCBI?¶
Here’s some answers, among which my favorite is “an interface through which you access someone else’s code or through which someone else’s code accesses yours – in effect the public methods and properties.”
The NCBI has a whole toolkit which they call Entrez Programming Utilities or eutils for short. You can read all about it in the documentation. There are a lot of things you can do to interface with all things NCBI, including publications, etc., but I am going to focus today on downloading sequencing data.
To do this, you’re going to be using one tool in eutils, called efetch. There is a whole chapter devoted to efetch – when I first started doing this kind of work, this documentation always broke my heart. Its easier for me to just show you how to use it.
Open a web browser, and try the following URL to download the nucleotide genome for CB00962:
Note that the NCBI knows a lot about this genome. Check it out here.
If I want to access other kinds of data associated with this genome, I would try the following command:
Do you notice the difference in these two commands? Let’s breakdown the command here:
- <http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?> This is command telling your computer program (or your browser) to talk to the NCBI API tool efetch.
- <db=nuccore> This command tells the NCBI API that you’d like it to look in this particular database for some data. Other databases that the NCBI has available can be found here.
- <id=CP000962> This command tells the NCBI API efetch the ID of the genome you want to find.
- <rettype=gb&retmode=text> These two commands tells the NCBI how the data is returned. You’ll note that in the two examples above this command varied slightly. In the first, we asked for only the FASTA sequence, while in the second, we asked for the Genbank file. Here’s some elusive documentation on where to find these “return” objects.
Also, a useful command is also <version=1>. There are different versions of sequences and some times that is useful. For reproducibility, I try to specify versions in my queries, see these comments.
Notice the “&” that comes between each of these little commands, it is necessary and important.
Automating with an API¶
Ok, let’s think of automating this sort of query.
In the shell, you could run the same commands above with the addition of curl on your EC2 instance:
You’ll see it fly on to your screen. Don’t panic - you can save it to a file and make it more useful.:
curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=CP000962&rettype=fasta&retmode=text" > CP000962.fa
You could now imagine writing a program where you made a list of IDs you want to download and put it in a for loop, curling each genome and saving it to a file. The following is a script. Thanks to Jordan Fish who gave me the original version of this script before I even knew how and made it easy to use.
To see the documentation for this script:
You’ll see that you need to provide a list of IDs and a directory where you want to save the downloaded files.
To run the script:
/usr/bin/python fetch-genomes.py interesting-genomes.txt genbank-files
You may want to run this on just a few of these IDs to begin with. You can create a smaller list using the head command with the -n parameter in the shell. For example, head -n 3 interesting-genomes.txt > 3genomes.txt.
Let’s take a look inside this script. The meat of this script uses the following code:
url_template = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=%s&rettype=gb&retmode=text"
You’ll see that the id here is a string character which is obtained from list of IDs contained in a separate file. The rest of the script manages where the files are being placed and what they are named. It also prints some output to the screen so you know its running.
Exercise - Downloading data¶
Try modifying the fetch_genomes.py script to download just the FASTA sequences of the genes.
Running this script should allow you to download genomes to your heart’s content. But how do you grab specific genes from this data then? Specifically, the challenge was to make a phylogenetic tree of sequences, so let’s target the conserved bacterial gene, 16S ribosomal RNA gene.
Comment on Genbank files¶
Genbank files have a special structure to them. You can look at it and figure it out for the most part, or read about it in detail here. To find out if your downloaded Genbank files contain 16S rRNA genes, I like to run the following command:
This should look somewhat familiar from your shell lesson, but basically we’re looking for anylines that contain the character “16S” in any Genbank file we’ve downloaded. Note that you’ll have to run this in the directory where you downloaded these files.
The structure of the Genbank file allows you to identify 16S genes. For example,
You could write code to find text like ‘rRNA’ and ‘/product=”16S ribosomal RNA”’, grab the location of the gene, and then go to the FASTA file and grab these sequences. I’ve done that before.
You could also use existing packages to parse Genbank files. I have the most experience with BioPython. To begin with, let’s just use BioPython so you can get to using existing scripts without writing scripts.
First, we’ll have to install BioPython on your instance and they’ve made that pretty easy:
Fan Yang (Iowa State University) and I wrote a script to extract 16S rRNA sequences from Genbank files, here. It basically searches for text strings in the Genbank structure that is appropriate for these particular genes. You can read more about BioPython here and its Genbank parser here.
To run this script on the Genbank file for CP000962:
The resulting output file contains all 16S rRNA genes from the given Genbank file.
To run this for multiple files, I use a shell for loop:
There are multiple ways to get this done – but this is how I like to do it.
And there you have it, you can now pretty much automatically grab 16S rRNA genes from any number of genomes in NCBI databases.