Download the latest version of the NCBI SRA Toolkit. Untar or unzip downloaded toolkit file. The “prefetch” is a command line tool available in the SRA toolkit. It takes the cart file as an argument and download the files selected in the previous step. Erro while converting.SRA File to.fastq File with SRA Toolkit Hi, everyone, I have download 6 paired end sequencing data from Ncib, and I am trying to convert. Trimming by fastp for.
Note: Current SRA toolkit version 2.10.8 does not support Aspera client (ascp). Even though ascp
can run with olderversions, it will download the data by https
mode and not by FASP
mode.
NOTE: With fastq-dump
and fasterq-dump
, prefetch
step is unncessary and you can directly download sequence data in FASTQ format
fasterq-dump
in bioinfokit
(v0.9.7 or later) for easy downloadof a large number of FASTQ files from the SRA databaseIt is essential to check the integrity and checksum of SRA datasets to ensure successful download
You can use SRA tools for customized output of large SRA datasets without downloading complete datasets (NOTE: some options are not available in fasterq-dump
)
SRA tools allow you to convert SRA files into FASTA, ABI, Illumina native (QSEQ), and SFF format
You can search specific sequences or subset of sequences in SRA files
NOTE: For every SRA tools, you can check all options by providing -h
parameter (eg. fasterq-dump -h
)
If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com
Last updated: September 18, 2020
This work is licensed under a Creative Commons Attribution 4.0 International License.
Attention Conservation Notice: This post explains how to run the exceptionally fast RNA-seq k-mer aligner kallisto from the Pachter lab on data you download from NCBI’s Short Read Archive, and then analyze it for differential expression using voom/limma. As with everything in bioinformatics, this will likely be obsolete in months, if not weeks.
Kallisto is really fast and non-memory intensive, without sacrificing accuracy (at least, according to their paper), and therefore has the potential to make your life a lot easier when it comes to analyzing RNA-seq data.
As a test data set, I used the very useful SRA DNA Nexus to search the SRA database for a transcriptome study from human samples with 2-3 biological replicates in 2 groups, so that I could achieve more robust differential expression calls without having to download too much data.
I ended up using SRP006900, which is titled “RNA-seq of two paired normal and cancer tissues in two stage III colorectal cancer patients.” I actually wasn’t able to find a study describing it for differential expression analysis in a 3-5 min search, but since it was public almost four years ago, I’m sure that it has been analyzed in such a manner. It is single-end reads with an average of 65 base pairs per read.
In order to download this data set (which totals ~ 10 GB over its four files, in .fastq format) from the SRA via the command line, I used the SRA toolkit. You’ll want to make sure this is downloaded correctly by running a test command such as the following on the command line:
You’ll also want to download kallisto. In order to install it, I copied it to my usr/local/bin/ folder to make it globally executable, as described on the kallisto website.
You also need to download the fasta file for your transcriptome of interest and create an index file that kallisto can work on with. You can download the hg38 reference transcriptome from the kallisto website or from UCSC.
For pre-reqs, it is helpful to have homebrew installed as a package manager. If you do (or once you do), these three commands did the trick for me:
Next you’ll need to actually download the data from each run, convert them to .fastq, and then run kallisto on them. For extensibility, I did this in the following shell script.
First, a config file with the sample names and average read length (will be needed for kallisto if the data is single-end, in my experience):
(Update 5/13/15: The average read length should be average fragment length, which I’m working on figuring out the best way of estimating from single-end reads (if one is possible). In the meantime, it may be easier to choose a paired-end read sample from SRA instead, from which the average fragment length can be estimated by kallisto. Thanks to Hani Goodarzi for the helpful pointer that I was using the wrong parameter.)
Next, I use a shell script to execute the relevant commands on these sample names to build your abundance.txt files. Note that this data set has single-end reads, so you only call one .fastq file, rather than the two in their example. I put each output in a file called output_$RUNNAME that I can call from R for downstream analyses. Here’s the full shell script (also github):
Once you have these files, you can analyze them in R using limma. Here’s how I went about it. First, read in the files and merge the transcripts per million (tpm) calls into one data frame:
Then make a contrasts matrix for normal vs control samples:
Then compute differential expression using that contrast matrix:
And that’s it. Interestingly, the transcript called as the #1 most differentially expressed using this method, ATP5A1, has been previously implicated as a marker for colon cancer.
Update 5/13/15: Fixed two typos, switched one link, and switched “genome” to “transcriptome” (I was sloppy previously) per @pmelsted’s useful comment on twitter.
Update 8/10/15: As Seb Battaglia points out on twitter, “voom assumes you cut out out 0s and low intensity tpm. I’d add that step or data hist() will be skewed.” I don’t have the opportunity to re-run the analyses right now (!), but please take this into consideration.