euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene: Quality of Animal & Plant Gene Sets


Fig.1 Gene Sizes of Animals & Plants, for longest 1000 [ aasize1k_alleuk PDF ]

Orthology completeness of Arthropod, Plant and Fish gene sets, 2014
EvidentialGene (solid bars) exceed others (hatched), update with more species, tables.


The bar graph shows the average size of the longest 1000 proteins in several mature and new animal and plant gene sets. Versions of gene sets from present (2012, solid bars) past to 2005 or 2002 (hatched bars) are shown.

The gene sets include those developed in the EvidentialGene project, marked as evgD, evgR (genomic or transcriptome only), and related reference or new species gene sets for comparison. This doesn't attempt a full display of species, but a fair set to compare to EvidentialGene sets.

The top label legend for bars lists short name and date (from 2000) of gene sets. The "/evg" mark indicates this project, and "/tsa" is from NCBI TSA. Above each bar is the maximum observed CDS size (in kilobases). These longest 1000 exclude duplicate proteins with over 90% sequence identity. The "U" line marks average size for unique proteins below 70% sequence identity. Details of gene sets are in the data table at

Figures 2a,b show the improvement in gene sets by Evigene methods, as version difference in protein size and homology alignment (blastp) to reference species genes. Homology improves closely with protein size improvements. Figures 4 to 9 show basic protein quality metrics, separated by gene construction method, for genomic and transcriptomic animal and plant gene sets.

mRNA and Protein Assembly Methods:

The methods used for mRNA-seq transcript assembly are now described in EvidentialGene_trassembly_pipe with accompanying software. This encapsulates, or "pipelines", a useful subset of EvidentialGene's larger gene construction tool kit.

Here is an outline of mRNA assembly best practices. These are tips from experiences over several years of work with RNA-seq assembly and gene construction; some of this is encapsulated in scripts in the Evigene/rnaseq/ set (with a guide to be written).

Take-home points of graph:

New gene sets can be be improved rapidly now, compared to older projects. The number on each gene set name is year. Note that human, zebrafish, arabidopsis, drosophila, honeybee have taken 5+ years to improve. Newer updates for aphid, nasonia, daphnia have improved same amount over a year's work, with the benefit of much new gene evidence, and better software. And new "first-draft" gene sets like cacao, killifish, some of the bugs, are as good as the mature sets, by this size measure. I've other measures using homology, and they give about the same answer, complete/bigger proteins are also stronger orthologs.

Another point is that several of these gene sets (catfish, banana, locust, whitefly, whiteshrimp) are transcript assemblies without a genome. They are about as complete as those with genomes.

A third point, with those transcript assemblies, is that it matters how you process the data. There are some rather poor transcript assemblies shown, from the authors' NCBI TSA entries. I reassembled these into more complete assemblies (catfish, banana, p. beetle, whitefly, white shrimp, tiger shrimp). See Fig 2a for improvement of these TSA, in protein size and homology alignment to reference genes. The best result as measured by either protein sizes or homology to other species, is obtained from using several component methods and combining the best of their results, eliminating mistakes. This is the same lesson learned over past decade of genome gene modelling.

Comparing transcriptome and genome gene sets, eg Figs 2a and 2b, transcriptome sets can be improved much more, e.g. 5 times more for whitefly versus its relative pea aphid. Transcriptome informatics is relatively immature, lacking aspects of genome-centric gene construction including protein mapping and gene signal detection (introns, transcript start/stops).

A special case is banana, with two independent published gene sets from summer 2012. One is mRNA-seq only (Banana1bt TSA), the second Banana1bg, is a traditional genome-modelled gene set, which however used much less RNA-seq evidence. Both of these gene sets are inferior to the Evigene assembly of the TSA data, by measures of orthology to Arabidopsis genes, and protein sizes. The improvement for Banana1bg is lower in the range of other genome modelled improvements. Note this Evigene assembly did not use banana genome information, only the first mRNA-seq data set. This points out the obvious high value of relatively inexpensive RNA-seq (170 million banana reads) for genome projects.

With enough data and good software, one can now get an essentially complete gene set the first time (though it still takes effort, but much less than with earlier genomes). However, transcriptome-only gene sets are deficient in a subset of genes weakly expressed, or unexpressed in sampled conditions (roughly 10% to 15% in the best transcriptomes measured here), and are more ambiguious with respect to loci represented. It is difficult to distinguish the provenance of high-identity transcripts: whether same locus, duplicated loci, or assembly artifact. High identity mRNA assemblies when mapped to genomes for Daphnia, cacao, and others indicate their indefinite mixture. What appear as slight differences of a few bases, or an exon, in mRNA assemblies, can and do map to genomes with best identity sometimes at the same locus or at distinct loci, as well as mis-mapped assembly errors, either transcriptomic or genomic.

Quality Assessment Methods:

Protein sets from current, prior animal and plant genome projects are collected. All protein sets are processed with 90% clustering by CD-Hit to remove identical proteins and their fragments, often alternate transcripts. A second round of CD-Hit clustering at 70% sequence identity is used for unique protein counts.

Protein sizes, minus gaps, are calcuated and the longest 1000 are tabulated for average and median size. This measure of longest 1000 proteins is chosen for several reasons: the longest proteins are often the hardest to fully assemble/model, they are more suseptible to artifacts and missing data, thus are valuable as a quality measure. The metric is independent of orthology scoring, which is relative to nearest neighbor, but correlates well with orthology score (see other stats); longest are usually well-known genes. Ignoring shorter proteins eliminates the "trash-bin" of partial genes one gets with transcript assemblies. It is easy to calculate and apply when assessing new gene sets.

This average of longest 1000 has a biological maximum as such proteins are biologically expensive to produce. The observed biological maxima here are Vertebrates: 2400 aa or 7000 CDS, Insects/Crustacea : 2000 aa or 6000 CDS, Plants : 1500 aa or 4500 CDS. For animals, these are often long and repetetive muscle proteins (list), with longest at/above 10,000 aminos. Plants lack these muscle proteins. If a animalor plant transcriptome assemby falls 100s of aa below these maxima, it is likely incomplete, lacking data or best assembly method.

The data include transcript-assembly only projects. Those marked "/tsa" are taken from NCBI TSA entries. The evgR group are also transcript assemblies. These were first processed to long-orf proteins with evigene/cdna_bestorf. Because transcript-asm entries do not unambigously distinguish loci and alternate transcripts, other methods are needed to distinguish duplicates, real and artifact. As indicated above, transcripts with over 90% protein identity are removed using CD-Hit. Yet some remaining are clearly alternates at same locus, or artifact transcripts.

Using RNA sequence identity instead of protein, transcripts of each gene set were self-aligned with blastn, and identity levels measured. Those with 90% or higher identity were matched to genome locus for genome-mapped gene sets of arthropod, plant and vertebrate cases. Roughly half of these high identity transcripts are found to map to distinct loci (on separate scaffolds or tandem duplicates). This doesn't clarify the ambiguity of locus assignment.

Since distinct proteins, whether at same locus or not, are biologically interesting, and distinguishing locus assignment without a finished genome assembly is problematic, this is deemed not a priority for assessing gene set quality. Another operational definition of unique proteins, below sequence identity of 70%, is measured with CD-Hit clustering. This 70% unique protein set is marked at the "U" line on gene size bars in Fig.1. Those gene sets without U marking mostly lack alternate transcripts (e.g. older gene sets).

Orthology statistics of transcript assembly proteins are tabulated here
for Locusta migratoria, Daphnia magna, and Theobroma cacao.
EvidentialGene RNA-Seq assembly methods are summarized here

Gene Set Version Improvements

Fig.2a Transcriptome mRNA Assembly Improvements, Protein Size x Homology Alignment
mRNA assembly version improvements for Evigene vs published transcript assembly (TSA). These are de-novo transcriptome assemblies without a genome (evigene mRNA in Fig. 1). Drosophila version 2011 - 2002 (drosmel53) is a reference case. Banana1bt, Banana1bg are a special case. Banana1bg shows Evigene mRNA-seq assembly improvement to an independent genome-modelled gene set (see text), while Banana1bt shows improvement to the published TSA. Improvement is calculated using blastp alignment to 1000 longest reference species proteins, then Evigene - old version score is subtracted, for aligned bases and protein size.


Fig.2b Genome Gene Set Improvements, Protein Size x Homology Alignment
Gene set version improvements for Evigene vs prior version, for 1000 longest proteins, with drosmel53 as reference. These are genome gene sets (evigene gDNA in Fig. 1), built with a mixture of protein, mRNA evidence and HMM gene modeling on genome assemblies, using Evigene methods. Locust1bt is the same mRNA assembly as in Fig. 2a, scaled comparisons of these two figures.

Fig. 3 Protein alignment of mRNA Assemblies matched to longest Reference Genes (draft 2)

Fig. 3 Protein sizes of mRNA Assemblies matched to longest Reference Genes (draft 1)
Details for the average result in Fig 2a.

Fig.4-9 Protein Quality Summary for Gene Set Versions
Coding & Transcript sizes, Completeness, CDS/UTR Ok of 1000 longest

add legend

Daphnia magna






Cacao tree


Banana tree




Developed at the Genome Informatics Lab of Indiana University Biology Department