euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Animal and Plant gene set reconstructions with EvidentialGene:
Comparisons to other popular and recent gene reconstructions.

D.G. Gilbert, gilbertd at indiana.edu, 2016/2017

Recent plant & animal EvidentialGene constructions surpass PacBio, Maker, NCBI and Trinity methods for arabidopsis, corn plants, white fly, water flea. In comparison to gene sets of these other commonly used methods, the Evigene methods are more accurate at recovering genes as measured by homology across species and by expression data.

In particular, for 3 plant species sets, Illumina RNA assemblies done according to Evigene methods surpass Pac-Bio RNA genes not only in total gene set accuracy, but in per-locus accuracy, where both methods recover some transcripts, for primary, alternate and paralog transcript reconstruction. Trinity assembled Illumina RNA gene sets are likewise incomplete compared to Evigene's multiple-assembler/reduction approach.

In comparison to genome-modeled gene sets, derived from many sources of gene evidence (prediction from chromosomes, RNA, other species proteins), Evigene's RNA-only constructions often surpass accuracy of those modeled genes. This is likely due to the greater complexity of merging many evidence sources in modeled genes, with greater chances of mis-modeling.

Evigene Illumina-RNA versus PacBio RNA comparisons include below summarized Arabidopsis model plant, Zea mays corn, as well as pine trees. Evigene versus Trinity-only comparisons include these plants and animals such as Bemisia white fly, Daphnia water fleas, Aedes and Anopheles mosquitoes, Honey bee, mice, fishes and others (including several by independent authors of animal and plant gene sets). Evigene versus genome modeled sets include those produced by NCBI EGAP, MAKER software, AUGUSTUS and similar gene modelers, for Arabidopsis, corn, pine and other plants, and animals including mosquitos, water fleas, honey bee, and others.


1. Plant model Arabidopsis thal. gene reconstructions ...  evigene2017_arabidopsis 
    Gene assemblies of Illumina RNA-seq vs PacBio

               AtAraport genes   Cacao genes      Introns
  Geneset     Found%  AlignT%   Found% AlignF%    Found%
  AtAraport    --       --         88.7   70.7     88.1  
  AtEvigene    95.4     95.0       89.1   70.3     87.5   
  AtOases      90.0     91.2        na     na      81.1
  AtIDBAtr     89.5     89.1        na     na      80.7
  AtSOAPtr     88.9     87.0        na     na      79.1
  AtTrinity    88.4     84.1        na     na      81.4
  AtPacBio     58.1     48.2       64.2   60.5     56.3   
 --------------------------------------------------------

2.  Corn Zea mays gene reconstructions ...  evigene2016_corn
  Gene assemblies of Illumina, PacBio, and genes modeled on chromosome assembly
  
            Sorghum genes    Introns
  Geneset   Found%  AlignT%  Found%
  ZmEvigene   82.9    91.1     68.7 
  ZmGramene   81.9    90.3     68.1 
  ZmNCBI      81.3    89.6      na  
  ZmPacBio    78.0    82.4     68.2   
  ZmJgi4      77.6    81.2     68.9
 ------------------------------------

3. White fly gene reconstructions ... evigene2016_whitefly 
    Bemisia tabaci (cotton/crop plant pest)

                 Reference species       RNA
               Pea aphid    Fruit fly    Introns 
  Geneset   Found%  AlnT%  Found% AlnT%  Found% 
  BtEvigene   81.2   88.0    74.1  74.9   68.5   
  BtNCBI      79.7   82.3    73.4  71.6   69.4   
  BtMaker     77.4   73.8    72.1  66.0   57.7   
  BtTrinity   73.5   59.2    68.0  53.2   50.5    
 ----------------------------------------------

4. Water flea Daphnia pulex gene reconstructions ...  evigene2017_daphnia_pulex

                 Reference species       RNA
            Daphnia magna   Fruit fly    Introns 
  Geneset   Found%  AlnT%  Found% AlnT%  Found% 
  DpEvigene  72.0   88.6    67.9  80.3   66.6    
  DpMaker    58.9   69.9    64.3  74.5   46.7    
 ----------------------------------------------

Arabidopsis gene sets
  AtAraport  = public gene set of 2016 of Arabidopsis thal. from Araport.org 
  AtEvigene= Evigene classification/reduction of Illumina RNA assemblies
            http://arthropods.eugenes.org/EvidentialGene/plants/arabidopsis/evigene2017_arabidopsis/
  AtOases   = Velvet/oases assembly of Illumina RNA,
  AtIDBAtr  = idba_tran asm of Ill. RNA,
  AtSOAPtr  = SOAP-Trans asm of Ill. RNA,
  AtTrinity = Trinity asm of Ill. RNA,
  AtPacBio  = Pac-Bio "no-assembly" assembly (PacBio xxx method) of Pac-Bio RNA data

Zea mays gene sets
  ZmEvig = Evigene Zeamay5fEVm 2016 assembly of Illumina RNA-seq, public at
     http://arthropods.eugenes.org/EvidentialGene/plants/corn/evg5corn/
  ZmGram = Ensembl/Gramene 2016.09 Zm000nnnn, 
  ZmPacb = CSHL/Gramene PacBio gene assemblies of 2016 as SRA entries SRR3147024..054,
  ZmNCBI = NCBI 2014 refgen zeamay
  ZmJgi4 = JGI Rnnotator assembly set of Illumina RNA-Seq , 2014

Bemisia tabaci gene sets 
  BtEvig = Evigene gene assembly, 2016 update (vers 3), available [soon] at
    http://arthropods.eugenes.org/EvidentialGene/arthropods/whitefly/whitefly3evigene/
  BtNCBI = NCBI RefSeq gene models, 2016
  BtMakr = Whitefly genome project genes modeled with MAKER, 2016, whiteflygenomics.org
  BtTrin = TSA.GBII gene assembly 2015, Trinity of Illumina

Daphnia pulex gene sets      
  DpEvig7 Evigene genes of 2017 from 
    http://arthropods.eugenes.org/EvidentialGene/daphnia/daphnia_pulex/daphnia_pulex_genes2017/
  DpMaker7 genes of 2017 from report of doi:10.1534/g3.116.038638 

Measures
  Genes Found%  = percent of reference genes with significant alignment to gene sets (BLASTp/n of proteins or CDS),
  Genes AlnT%   = percent of aligned bases of reference gene bases
  Introns Found% = percent of evidence introns aligned to gene set exons,
       intron evidence from Illumina RNA-seq mapped to chromosome assemblies

Further details are in evigene_plantsanimals_2017.txt

Developed at the Genome Informatics Lab of Indiana University Biology Department