euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene: tr2aacds, mRNA Transcript Assembly Software

Don Gilbert, gilbertd At indiana edu, 2013 April

About evigene/scripts/prot/tr2aacds.pl

Quality assessment of this mRNA Transcript Assembly Software is described in EvidentialGene_quality.

Too many transcript assemblies is much better than too few. It allows one then to apply biological criteria to pick out the best ones. Don't be misled by the "right number" of transcripts that one or other transcript assembler may produce. It is the "right sequence" you want, and now the only way to get it is to produce way too many assemblies on a good RNA data set, with several methods and several parameter settings.

EvidentialGene tr2aacds.pl is my new, "easy to use" pipeline script for processing large piles of transcript assemblies, from several methods such as Velvet/O, Trinity, Soap, .., into the most biologically useful "best" set of mRNA, classified into primary and alternate transcripts.

It takes as input the transcript fasta produced by any/all of the transcript assemblers, although just now one should regularize those fasta IDs, which I do separately. The output is a neat pile of "okay" and "drop"ped transcripts, the okay pile is close to a biologically real set regardless of how many millions of input assemblies you start with.

This packages my current methods and does a fair job on all the several transcriptomes I've tested for plants and animals. It solves 2 major problems with the protein CD-HIT method I recommended before:
1. That does not classify alternate transcripts of same locus properly, e.g. alternates may differ in protein quite a bit, but share identical exon parts.
2. That removed paralogs with high identity in protein sequence, which is a problem for some very interesting gene families. E.g. silent codon3 changes are lost.

How To get Best mRNA Transcript assemblies ,
Please read this brief How-To document that summarizes my tests on best transcript assembly methods. It points out tips for assembly parameters, such as using scaffolding and multi-kmer settings (for Velvet, Soap, others that allow; not Trinity), that will improve your transcript assemblies. Also review this recent comparison of gene assembler accuracy,
Best assembly methods compared for mosquito genes, 2016

EvidentialGene tr2aacds mRNA classifier description
Classification is based mainly on CDS-dna local alignment identity. Perfect fragment CDS are dropped, those with some CDS base differences are kept, with longest CDS as primary transcript. UTR identity is ignored (for now) because many of the mis-assemblies are from joined/mangled genes in UTR region.

Required additional software:

  • fastanrdb of exonerate package, quickly reduces perfect duplicate sequences
  • cd-hit, cd-hit-est clusters protein or nucleotide sequences.
  • blastn and makeblastdb of NCBI BLAST, Basic Local Alignment Search Tool, finds regions of local similarity between sequences.
You supply the transcript assemblies, an over-assembly of 100s per locus is desired.

tr2aacds pipeline algorithm
EvidentialGene/evigene/scripts/prot/tr2aacds.pl

Prerequisite is that you create transcript assemblies (with any/all useful methods). This software reads all the transcripts.fasta you have, chews on them and puts them into good and bad piles (with extras).


0. collect input transcripts.tr,
You supply input transcript sequences in .fasta, an over-assembly with redundant and variable quality transcripts, as one file. Please consider trformat.pl to regularize IDs, which are critical to tracking inputs. tr2aacds produces .cds, .aa sequences from .tr, working mostly on the CDS.

1. perfect redundant removal: exonerate/fastanrdb input.cds > input_nr.cds,
and protein qualities are used for choosing among cds identicals.

2. perfect fragment removal: cd-hit-est -c 1.0 -l $MINCDS ..

3. blastn, basic local align hi-ident subsequences for alternate tr.,
with -perc_identity CDSBLAST_IDENT (98%), to find high-identity exon-sized alignments.

4. classify main/alternate cds, okay & drop subsets, using evigene/rnaseq/asmrna_dupfilter2.pl
.. merges alignment table, protein-quality and identity, to score okay-main, ok-alt, and drop sets.

5. final output files from outclass: okay-main, okay-alts, drops
okayset is for public consumption, drops for data-overload enthusiasts (may contain valids).
evgmrna2tsa.pl is next step, to become part of tr2aacds2, for final publicset/


Other EvidentialGene scripts for trassembly

evigene/scripts/rnaseq/trformat.pl
Use BEFORE tr2aacds to regularize IDs in fasta of Velvet,Soap,Trinity, ensure unique IDs, add prefixes for parameter sets.
evigene/scripts/prot/namegenes.pl
Use AFTER tr2aacds on okayset, add gene function names from UniProt-Ref and CDD blastp.
deltablast -rpsdb $cddb -show_domain_hits -outfmt 7 -db $refdb -query $name.allok.aa -out $name.deblastp
namegenes.pl -cddnames=info.cdd.txt -refnames $refdb.names -blast $name.deblastp
evigene/scripts/rnaseq/asmrna_trimvec.pl
UniVec vector screening and NNN-end trimming, per NCBI or INSDC desires
evigene/scripts/evgmrna2tsa.pl
See evgmrna2tsa_help, which describes steps, including asmrna_trimvec, namegenes as parts of evgmrna2tsa pipeline. make public mRNA gene set, with pubIDs, main/alternates, names and annotation, and Genbank TSA format for public submission (needs work to simplify/separate parts)

EvidentialGene software is at
http://arthropods.eugenes.org/EvidentialGene/evigene/ or ftp://arthropods.eugenes.org/evigene.tar

One way to get and update is: ftp ftp://arthropods.eugenes.org/evigene.tar
with a standard ftp client, wget, curl and some web browsers. Alternatively, find tar archives at
/EvidentialGene/other/evigene_old/.       Then extract the tar archive
tar -xf evigene.tar into current folder, preserving run permission.

Run the Perl ".pl" scripts from extracted evigene folder, as they are a package. E.g., for Unix bash shell:
export evigene=`pwd`/evigene;
$evigene/scripts/prot/tr2aacds.pl ..; $evigene/scripts/evgmrna2tsa.pl .. ;

For Unix csh/tcsh, "set evigene=`pwd`/evigene". Most of the shell ".sh" scripts require editing for your cluster; consider them examples. These scripts have brief -help, but most of their documentation is perl POD; read the scripts. This is a complex package, including my working scripts for several genome projects, some are obsolete now. One needs a cheat-sheet to make sense of what is good and I will work on that.

TEST DRIVE

Please first try this test case with small input data (TAIR10 mRNA) and tr2aacds outputs,
EvidentialGene/plants/arabidopsis/evigene_tr2aacds_test/
It is worth your time to set up and run this with same input data to see that you get same results.

CLUSTER Run Script

evigene/scripts/prot/tr2aacds_qsub.sh
Run on cluster with qsub/PBS batch scheduler as:
env trset=banana1all3.tr datad=`pwd` qsub -q batch tr2aacds_qsub.sh

#! /bin/bash
## env trset=banana1all3.tr datad=`pwd` qsub -q normal tr2aacds_qsub.sh
#PBS -N tr2cds
#PBS -l nodes=1:ppn=32,walltime=18:55:00
#PBS -o tr2cds.$$.out
#PBS -e tr2cds.$$.err
#PBS -V

ncpu=32
maxmem=50000  # in Megabytes
evigene=$HOME/bio/evigene/scripts

## Required software
#t2ac: app=cd-hit-est, path= echo MISSING_cd-hit-est
export PATH=$HOME/bio/cdhit/bin:$PATH
#t2ac: app=fastanrdb, path= echo MISSING_fastanrdb
export fastanrdb=$HOME/bio/exonerate/bin/fastanrdb
#t2ac: app=blastn, path= echo MISSING_blastn
export PATH=$HOME/bio/ncbi2227/bin:$PATH

if [ "X" = "X$trset" ]; then
  echo "missing env trset=xxxx.tr"; exit -1
fi
if [ "X" = "X$datad" ]; then
  echo "missing env datad=/path/to/data"; exit -1
fi

cd $datad/
$evigene/prot/tr2aacds.pl -NCPU $ncpu -MAXMEM $maxmem -log -cdna $trset


NOTE: main CPU cost is blastn-self, which grows with number of non-redundant input transcripts. for 1 Mill, 32 cpu, may take 1-2hr; for 10 Mill, "", may take 10+ hr (esp w/ many hi-identity alternates).

Banana tree example summary


#t2ac: EvidentialGene tr2aacds.pl VERSION 2013.03.11
#t2ac: app=blastn, path=/home/ux455375/bio/ncbi2227/bin/blastn
#t2ac: app=makeblastdb, path=/home/ux455375/bio/ncbi2227/bin/makeblastdb
#t2ac: app=fastanrdb, path=/home/ux455375/bio/exonerate/bin/fastanrdb
#t2ac: app=cd-hit-est, path=/home/ux455375/bio/cdhit461/bin/cd-hit-est
#t2ac: app=cd-hit, path=/home/ux455375/bio/cdhit461/bin/cd-hit
#t2ac: BEGIN with cdnaseq= banana1all3.tr.gz date= Mon Mar 11 16:52:43 PDT 2013

#t2ac: bestorf_cds= banana1all3.cds nrec=1806602        # 1.8 Mill input transcripts
#t2ac: nonredundant_cds= banana1all3nr.cds nrec=1262705 # 30% redundant identicals; 70% differ in CDS
#t2ac: nonredundant_reassignbest= 0 of 0
#t2ac: nofragments_cds= banana1all3nrcd1.cds nrec=873521 # 30% nr-CDS are perfect fragments of others
                                                           leaving 48% informative CDS of 1.8 Mill input
#t2ac: blastn_cds= banana1all3nrcd1-self95.blastn        # Blastn for near-perfect Basic Local Alignments
#t2ac: asmdupfilter_cds= banana1all3.trclass             # interpret blastn and protein identities to
                                                           classify into categories below.
  TrClasses:
  main = primary transcript w/ alternates; 
  alt = alternate transcript, with main identified;
  noclass = primary with no alternates; 
  "hi" = high identity (>=98% dna); 
  "hi1" (very hi identity) and "a2" (protein identity) are subclasses
  "mid" and "mfrag" are lower identity, testing w/ genome mapping 
    says these are more often paralogs than alternates.
  part = fragment alternates
  "okay" and "drop" are partitions of the classes to keep and ignore.
      
  # Class Table for banana1all3.trclass 
  class           %okay   %drop   okay    drop
  althi           11.5    12.6    100791  110786
  althi1          0.9     3.1     7911    27333
  althia2         0       12.3    0       108082
  altmfrag        2.5     2.7     22579   24216
  altmfraga2      0.5     0.4     5068    4002
  altmid          9.7     2.7     84966   24036
  altmida2        2.4     0.4     21269   3913
  main            5.2     4.7     45621   41299
  maina2          0.4     0.2     4023    2534
  noclass         0.9     8       8710    70025
  noclassa2       0       0       44      150
  parthi          0       12.4    0       108619
  parthi1         0       1.3     0       12053
  parthia2        0       4       0       35439
  ---------------------------------------------
  total           34.4    65.5    300982  572487  << 300K okay transcripts in 58,398 loci
  =============================================
  
# AA-quality for okay set of banana1all3.aa.qual (no okalt)
okay.top         n=1000; average=1212; median=1091; min,max=894,5104; gaps=2216,2.2%
okay.all         n=58398; average=237; median=161; min,max=40,5104; gaps=156841,2.6%
#t2ac: asmdupfilter_fileset= 
  banana1all3.okay.tr banana1all3.okalt.tr
  banana1all3.okay.aa banana1all3.okalt.aa
  banana1all3.okay.cds banana1all3.okalt.cds
  banana1all3.drop.tr banana1all3.drop.aa banana1all3.drop.cds


Developed at the Genome Informatics Lab of Indiana University Biology Department