Download the latest version of the NCBI SRA Toolkit. Untar or unzip downloaded toolkit file. The “prefetch” is a command line tool available in the SRA toolkit. It takes the cart file as an argument and download the files selected in the previous step. Erro while converting.SRA File to.fastq File with SRA Toolkit Hi, everyone, I have download 6 paired end sequencing data from Ncib, and I am trying to convert. Trimming by fastp for.

What is NCBI Sequence Read Archive (SRA) Toolkit?

  • NCBI SRA toolkit is a set of utilities to download, view and search large volume of high-throughput sequencing data from NCBI SRA database at faster speed

Applications

  • Effectively download the large volume of high-throughput sequencing data (eg. FASTQ, SAM)
  • Convert SRA file into other biological file format (eg. FASTA, ABI, SAM, QSEQ, SFF)
  • Retrieve small subset of large files (eg. sequences, alignment)
  • Search within SRA files and fetch specific sequences

Download and install NCBI SRA toolkit

Download SRA datasets using NCBI SRA toolkit

Note: Current SRA toolkit version 2.10.8 does not support Aspera client (ascp). Even though ascp can run with olderversions, it will download the data by https mode and not by FASP mode.

NOTE: With fastq-dump and fasterq-dump, prefetch step is unncessary and you can directly download sequence data in FASTQ format

Batch download SRA datasets

  • Sometimes, we need to download hundreds or thousands of FASTQ files from the SRA database and it would be inconvenient to directly use the SRA toolkit for batch download
  • I have added a wrapper script for fasterq-dump in bioinfokit (v0.9.7 or later) for easy downloadof a large number of FASTQ files from the SRA database
  • Check bioinfokit documentation for installation and documentation
  • Download test SRA accession file containing accessions for both single and paired-end FASTQ datasets

Validation of downloaded SRA data integrity

It is essential to check the integrity and checksum of SRA datasets to ensure successful download

Customized download of SRA datasets

You can use SRA tools for customized output of large SRA datasets without downloading complete datasets (NOTE: some options are not available in fasterq-dump)

Convert SRA data into other biological formats

SRA tools allow you to convert SRA files into FASTA, ABI, Illumina native (QSEQ), and SFF format

Search within SRA files

You can search specific sequences or subset of sequences in SRA files

NOTE: For every SRA tools, you can check all options by providing -h parameter (eg. fasterq-dump -h)

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

Last updated: September 18, 2020


This work is licensed under a Creative Commons Attribution 4.0 International License.

Attention Conservation Notice: This post explains how to run the exceptionally fast RNA-seq k-mer aligner kallisto from the Pachter lab on data you download from NCBI’s Short Read Archive, and then analyze it for differential expression using voom/limma. As with everything in bioinformatics, this will likely be obsolete in months, if not weeks.

Kallisto is really fast and non-memory intensive, without sacrificing accuracy (at least, according to their paper), and therefore has the potential to make your life a lot easier when it comes to analyzing RNA-seq data.

As a test data set, I used the very useful SRA DNA Nexus to search the SRA database for a transcriptome study from human samples with 2-3 biological replicates in 2 groups, so that I could achieve more robust differential expression calls without having to download too much data.

I ended up using SRP006900, which is titled “RNA-seq of two paired normal and cancer tissues in two stage III colorectal cancer patients.” I actually wasn’t able to find a study describing it for differential expression analysis in a 3-5 min search, but since it was public almost four years ago, I’m sure that it has been analyzed in such a manner. It is single-end reads with an average of 65 base pairs per read.

In order to download this data set (which totals ~ 10 GB over its four files, in .fastq format) from the SRA via the command line, I used the SRA toolkit. You’ll want to make sure this is downloaded correctly by running a test command such as the following on the command line:

You’ll also want to download kallisto. In order to install it, I copied it to my usr/local/bin/ folder to make it globally executable, as described on the kallisto website.

You also need to download the fasta file for your transcriptome of interest and create an index file that kallisto can work on with. You can download the hg38 reference transcriptome from the kallisto website or from UCSC.

For pre-reqs, it is helpful to have homebrew installed as a package manager. If you do (or once you do), these three commands did the trick for me:

Next you’ll need to actually download the data from each run, convert them to .fastq, and then run kallisto on them. For extensibility, I did this in the following shell script.

First, a config file with the sample names and average read length (will be needed for kallisto if the data is single-end, in my experience):

(Update 5/13/15: The average read length should be average fragment length, which I’m working on figuring out the best way of estimating from single-end reads (if one is possible). In the meantime, it may be easier to choose a paired-end read sample from SRA instead, from which the average fragment length can be estimated by kallisto. Thanks to Hani Goodarzi for the helpful pointer that I was using the wrong parameter.)

Download Sra Toolkit Mac Terminal Mac

Next, I use a shell script to execute the relevant commands on these sample names to build your abundance.txt files. Note that this data set has single-end reads, so you only call one .fastq file, rather than the two in their example. I put each output in a file called output_$RUNNAME that I can call from R for downstream analyses. Here’s the full shell script (also github):

Download Sra Toolkit Mac Terminal

Once you have these files, you can analyze them in R using limma. Here’s how I went about it. First, read in the files and merge the transcripts per million (tpm) calls into one data frame:

Then make a contrasts matrix for normal vs control samples:

Then compute differential expression using that contrast matrix:

And that’s it. Interestingly, the transcript called as the #1 most differentially expressed using this method, ATP5A1, has been previously implicated as a marker for colon cancer.

Sra Toolkit Tutorial

Update 5/13/15: Fixed two typos, switched one link, and switched “genome” to “transcriptome” (I was sloppy previously) per @pmelsted’s useful comment on twitter.

Download Sra Toolkit

Update 8/10/15: As Seb Battaglia points out on twitter, “voom assumes you cut out out 0s and low intensity tpm. I’d add that step or data hist() will be skewed.” I don’t have the opportunity to re-run the analyses right now (!), but please take this into consideration.