Quality assessment of this mRNA Transcript Assembly Software
is described in
Too many transcript assemblies is much better than too few. It allows one then to
apply biological criteria to pick out the best ones. Don't be misled by the
"right number" of transcripts that one or other transcript assembler may produce.
It is the "right sequence" you want, and now the only way to get it is to produce
way too many assemblies on a good RNA data set, with several methods and
several parameter settings.
EvidentialGene tr2aacds.pl is my new, "easy to use" pipeline script for processing
large piles of transcript assemblies, from several methods such as Velvet/O, Trinity,
Soap, .., into the most biologically useful "best" set of mRNA,
classified into primary and alternate transcripts.
It takes as input the transcript fasta produced by any/all of the transcript assemblers,
although just now one should regularize those fasta IDs, which I do separately.
The output is a neat pile of "okay" and "drop"ped transcripts, the okay pile is
close to a biologically real set regardless of how many millions of input assemblies
you start with.
This packages my current methods and does a fair job on all the several transcriptomes
I've tested for plants and animals.
It solves 2 major problems with the protein CD-HIT method I recommended before:
1. That does not classify alternate transcripts of same locus properly,
e.g. alternates may differ in protein quite a bit, but share identical exon parts.
2. That removed paralogs with high identity in protein sequence, which is a problem
for some very interesting gene families. E.g. silent codon3 changes are lost.
How To get Best mRNA Transcript assemblies ,
Please read this brief How-To document that
summarizes my tests on best transcript assembly methods. It points out
tips for assembly parameters, such as using scaffolding and multi-kmer
settings (for Velvet, Soap, others that allow; not Trinity), that will
improve your transcript assemblies.
This really matters:
PLEASE, DON'T SELECT BY LONGEST TRANSCRIPT, as with CD-HIT-EST (a common mistake now, even
by users of Evigene)
Also review this recent comparison of gene assembler accuracy,
Best assembly methods compared for mosquito genes, 2016
EvidentialGene tr2aacds mRNA classifier description
Classification is based mainly on CDS-dna local alignment identity.
Perfect fragment CDS are dropped, those with some CDS base differences are
kept, with longest CDS as primary transcript. UTR identity is ignored (for now) because
many of the mis-assemblies are from joined/mangled genes in UTR region.
Required additional software:
You supply the transcript assemblies, an over-assembly of 100s per locus is desired.
- fastanrdb of exonerate package,
quickly reduces perfect duplicate sequences
- cd-hit, cd-hit-est
clusters protein or nucleotide sequences.
- blastn and makeblastdb of
Basic Local Alignment Search Tool, finds regions of local similarity between sequences.
tr2aacds pipeline algorithm
Prerequisite is that you create transcript assemblies (with any/all useful
methods). This software reads all the transcripts.fasta you have, chews on them
and puts them into good and bad piles (with extras).
0. collect input transcripts.tr,
You supply input transcript sequences in .fasta, an over-assembly with redundant and
variable quality transcripts, as one file. Please consider trformat.pl to regularize IDs,
which are critical to tracking inputs.
tr2aacds produces .cds, .aa sequences from .tr, working mostly on the CDS.
1. perfect redundant removal: exonerate/fastanrdb input.cds > input_nr.cds,
and protein qualities are used for choosing among cds identicals.
2. perfect fragment removal: cd-hit-est -c 1.0 -l $MINCDS ..
3. blastn, basic local align hi-ident subsequences for alternate tr.,
with -perc_identity CDSBLAST_IDENT (98%), to find high-identity exon-sized alignments.
4. classify main/alternate cds, okay & drop subsets, using evigene/rnaseq/asmrna_dupfilter2.pl
.. merges alignment table, protein-quality and identity, to score okay-main, ok-alt, and drop sets.
5. final output files from outclass: okay-main, okay-alts, drops
okayset is for public consumption, drops for data-overload enthusiasts (may contain valids).
evgmrna2tsa.pl is next step, to become part of tr2aacds2, for final publicset/
Other EvidentialGene scripts for trassembly
Use BEFORE tr2aacds to regularize IDs in fasta of Velvet,Soap,Trinity, ensure unique IDs,
add prefixes for parameter sets.
Use AFTER tr2aacds on okayset, add gene function names from UniProt-Ref and CDD blastp.
deltablast -rpsdb $cddb -show_domain_hits -outfmt 7 -db $refdb -query $name.allok.aa -out $name.deblastp
namegenes.pl -cddnames=info.cdd.txt -refnames $refdb.names -blast $name.deblastp
UniVec vector screening and NNN-end trimming, per NCBI or INSDC desires
which describes steps, including asmrna_trimvec, namegenes as parts of evgmrna2tsa pipeline.
make public mRNA gene set, with pubIDs, main/alternates, names and annotation,
and Genbank TSA format for public submission (needs work to simplify/separate parts)
EvidentialGene software is at
One way to get and update is:
with a standard ftp client, wget, curl and some web browsers. Alternatively, find tar archives at
Then extract the tar archive
tar -xf evigene.tar into current folder, preserving run permission.
Run the Perl ".pl" scripts from extracted evigene folder, as they are a package.
E.g., for Unix bash shell:
$evigene/scripts/prot/tr2aacds.pl ..; $evigene/scripts/evgmrna2tsa.pl .. ;
For Unix csh/tcsh, "set evigene=`pwd`/evigene".
Most of the shell ".sh" scripts require editing for your cluster; consider them examples.
These scripts have brief -help, but most of their documentation is perl POD; read the scripts.
This is a complex package, including my working scripts for several genome projects, some
are obsolete now. One needs a cheat-sheet to make sense of what is good and I will work
Please first try this test case with small input data (TAIR10 mRNA) and tr2aacds outputs,
It is worth your time to set up and run this with same input data to see that you get same results.
CLUSTER Run Script
Run on cluster with qsub/PBS batch scheduler as:
env trset=banana1all3.tr datad=`pwd` qsub -q batch tr2aacds_qsub.sh
## env trset=banana1all3.tr datad=`pwd` qsub -q normal tr2aacds_qsub.sh
#PBS -N tr2cds
#PBS -l nodes=1:ppn=32,walltime=18:55:00
#PBS -o tr2cds.$$.out
#PBS -e tr2cds.$$.err
maxmem=50000 # in Megabytes
## Required software
#t2ac: app=cd-hit-est, path= echo MISSING_cd-hit-est
#t2ac: app=fastanrdb, path= echo MISSING_fastanrdb
#t2ac: app=blastn, path= echo MISSING_blastn
if [ "X" = "X$trset" ]; then
echo "missing env trset=xxxx.tr"; exit -1
if [ "X" = "X$datad" ]; then
echo "missing env datad=/path/to/data"; exit -1
$evigene/prot/tr2aacds.pl -NCPU $ncpu -MAXMEM $maxmem -log -cdna $trset
NOTE: main CPU cost is blastn-self, which grows with number of non-redundant input transcripts.
for 1 Mill, 32 cpu, may take 1-2hr;
for 10 Mill, "", may take 10+ hr (esp w/ many hi-identity alternates).
Banana tree example summary
#t2ac: EvidentialGene tr2aacds.pl VERSION 2013.03.11
#t2ac: app=blastn, path=/home/ux455375/bio/ncbi2227/bin/blastn
#t2ac: app=makeblastdb, path=/home/ux455375/bio/ncbi2227/bin/makeblastdb
#t2ac: app=fastanrdb, path=/home/ux455375/bio/exonerate/bin/fastanrdb
#t2ac: app=cd-hit-est, path=/home/ux455375/bio/cdhit461/bin/cd-hit-est
#t2ac: app=cd-hit, path=/home/ux455375/bio/cdhit461/bin/cd-hit
#t2ac: BEGIN with cdnaseq= banana1all3.tr.gz date= Mon Mar 11 16:52:43 PDT 2013
#t2ac: bestorf_cds= banana1all3.cds nrec=1806602 # 1.8 Mill input transcripts
#t2ac: nonredundant_cds= banana1all3nr.cds nrec=1262705 # 30% redundant identicals; 70% differ in CDS
#t2ac: nonredundant_reassignbest= 0 of 0
#t2ac: nofragments_cds= banana1all3nrcd1.cds nrec=873521 # 30% nr-CDS are perfect fragments of others
leaving 48% informative CDS of 1.8 Mill input
#t2ac: blastn_cds= banana1all3nrcd1-self95.blastn # Blastn for near-perfect Basic Local Alignments
#t2ac: asmdupfilter_cds= banana1all3.trclass # interpret blastn and protein identities to
classify into categories below.
main = primary transcript w/ alternates;
alt = alternate transcript, with main identified;
noclass = primary with no alternates;
"hi" = high identity (>=98% dna);
"hi1" (very hi identity) and "a2" (protein identity) are subclasses
"mid" and "mfrag" are lower identity, testing w/ genome mapping
says these are more often paralogs than alternates.
part = fragment alternates
"okay" and "drop" are partitions of the classes to keep and ignore.
# Class Table for banana1all3.trclass
class %okay %drop okay drop
althi 11.5 12.6 100791 110786
althi1 0.9 3.1 7911 27333
althia2 0 12.3 0 108082
altmfrag 2.5 2.7 22579 24216
altmfraga2 0.5 0.4 5068 4002
altmid 9.7 2.7 84966 24036
altmida2 2.4 0.4 21269 3913
main 5.2 4.7 45621 41299
maina2 0.4 0.2 4023 2534
noclass 0.9 8 8710 70025
noclassa2 0 0 44 150
parthi 0 12.4 0 108619
parthi1 0 1.3 0 12053
parthia2 0 4 0 35439
total 34.4 65.5 300982 572487 << 300K okay transcripts in 58,398 loci
# AA-quality for okay set of banana1all3.aa.qual (no okalt)
okay.top n=1000; average=1212; median=1091; min,max=894,5104; gaps=2216,2.2%
okay.all n=58398; average=237; median=161; min,max=40,5104; gaps=156841,2.6%
banana1all3.drop.tr banana1all3.drop.aa banana1all3.drop.cds