The bar graph shows the average size of the longest 1000 proteins in
several mature and new animal and plant gene sets. Versions of gene sets
from present (2012, solid bars) past to 2005 or 2002 (hatched bars) are
The gene sets include those developed in the EvidentialGene project,
marked as evgD, evgR (genomic or transcriptome only),
and related reference or new species gene sets for comparison. This
doesn't attempt a full display of species, but a fair set to compare
to EvidentialGene sets.
The top label legend for bars lists short name and date (from 2000) of
gene sets. The "/evg" mark indicates this project, and "/tsa" is from NCBI TSA.
Above each bar is the maximum observed CDS size (in kilobases).
These longest 1000 exclude duplicate proteins with over 90% sequence identity.
The "U" line marks average size for unique proteins below 70% sequence identity.
Details of gene sets are in the data table at
Figures 2a,b show the improvement in gene sets by Evigene methods,
as version difference in protein size and homology alignment (blastp)
to reference species genes. Homology improves closely with protein
size improvements. Figures 4 to 9 show basic protein quality
metrics, separated by gene construction method, for genomic and
transcriptomic animal and plant gene sets.
mRNA and Protein Assembly Methods:
The methods used for mRNA-seq transcript assembly are now
with accompanying software. This encapsulates, or "pipelines", a useful subset of EvidentialGene's
larger gene construction tool kit.
Here is an outline of
mRNA assembly best practices. These are tips from experiences over several
years of work with RNA-seq assembly and gene construction; some of this is
encapsulated in scripts in the Evigene/rnaseq/ set (with a guide to be written).
Take-home points of graph:
New gene sets can be be improved
rapidly now, compared to older projects. The number on each gene set name
is year. Note that human, zebrafish, arabidopsis, drosophila, honeybee have taken
5+ years to improve. Newer updates for aphid, nasonia, daphnia have
improved same amount over a year's work, with the benefit of much
new gene evidence, and better software. And new "first-draft" gene
sets like cacao, killifish, some of the bugs, are as good as the mature
sets, by this size measure. I've other measures using homology, and they
give about the same answer, complete/bigger proteins are also stronger orthologs.
Another point is that several of these gene sets (catfish, banana, locust, whitefly, whiteshrimp)
are transcript assemblies without a genome. They are about as complete
as those with genomes.
A third point, with those transcript assemblies, is that it matters
how you process the data. There are some rather poor transcript
assemblies shown, from the authors' NCBI TSA entries. I reassembled
these into more complete assemblies (catfish, banana, p. beetle, whitefly, white
shrimp, tiger shrimp). See Fig 2a for improvement of these TSA, in protein size and
homology alignment to reference genes. The best result as measured by
either protein sizes or homology to other species, is obtained from
using several component methods and combining the best of their results,
eliminating mistakes. This is the same lesson learned over past
decade of genome gene modelling.
Comparing transcriptome and genome gene sets, eg Figs 2a
and 2b, transcriptome sets can be improved much more, e.g.
5 times more for whitefly versus its relative pea aphid. Transcriptome
informatics is relatively immature, lacking aspects of
genome-centric gene construction including protein mapping and gene
signal detection (introns, transcript start/stops).
A special case is banana, with two independent published gene sets
from summer 2012. One is mRNA-seq only (Banana1bt TSA), the second
Banana1bg, is a traditional genome-modelled gene set, which however
used much less RNA-seq evidence. Both of these gene sets are inferior
to the Evigene assembly of the TSA data, by measures of orthology to
Arabidopsis genes, and protein sizes. The improvement for Banana1bg
is lower in the range of other genome modelled improvements. Note this
Evigene assembly did not use banana genome information, only the first
mRNA-seq data set. This points out the obvious high value of
relatively inexpensive RNA-seq (170 million banana reads) for genome
With enough data and good software, one can now get an essentially
complete gene set the first time (though it still takes effort, but
much less than with earlier genomes).
However, transcriptome-only gene sets are deficient in a subset of
genes weakly expressed, or unexpressed in sampled conditions (roughly
10% to 15% in the best transcriptomes measured here), and are more
ambiguious with respect to loci represented. It is difficult to
distinguish the provenance of high-identity transcripts: whether same
locus, duplicated loci, or assembly artifact. High identity mRNA
assemblies when mapped to genomes for Daphnia, cacao, and others
indicate their indefinite mixture. What appear as slight
differences of a few bases, or an exon, in mRNA assemblies, can and do
map to genomes with best identity sometimes at the same locus or at
distinct loci, as well as mis-mapped assembly errors, either
transcriptomic or genomic.
Quality Assessment Methods:
Protein sets from current, prior animal and plant genome projects
are collected. All protein sets are processed with
90% clustering by CD-Hit to remove identical proteins and their
fragments, often alternate transcripts.
A second round of CD-Hit clustering at 70% sequence identity
is used for unique protein counts.
Protein sizes, minus gaps, are calcuated and the longest 1000 are
tabulated for average and median size. This measure of longest 1000 proteins is
chosen for several reasons: the longest proteins are often the
hardest to fully assemble/model, they are more suseptible to
artifacts and missing data, thus are valuable as a quality measure.
The metric is independent of orthology scoring, which is relative to nearest
neighbor, but correlates well with orthology score (see other stats);
longest are usually well-known genes. Ignoring shorter proteins
eliminates the "trash-bin" of partial genes one gets with transcript
assemblies. It is easy to calculate and apply when assessing new gene
This average of longest 1000 has a biological maximum as such proteins are
biologically expensive to produce. The observed biological maxima here are
Vertebrates: 2400 aa or 7000 CDS, Insects/Crustacea : 2000 aa or 6000 CDS,
Plants : 1500 aa or 4500 CDS.
For animals, these are often long and repetetive muscle proteins
(list), with longest at/above 10,000 aminos. Plants lack these
muscle proteins. If a animalor plant transcriptome assemby falls 100s
of aa below these maxima, it is likely incomplete, lacking data or
best assembly method.
The data include transcript-assembly only projects. Those marked "/tsa"
are taken from NCBI TSA entries. The evgR group are also transcript assemblies.
These were first processed to long-orf proteins with evigene/cdna_bestorf.
Because transcript-asm entries do not unambigously distinguish
loci and alternate transcripts, other methods are needed to
distinguish duplicates, real and artifact. As indicated above,
transcripts with over 90% protein identity are removed using CD-Hit.
Yet some remaining are clearly alternates at same locus, or artifact transcripts.
Using RNA sequence identity instead of protein, transcripts of each
gene set were self-aligned with blastn, and identity levels measured.
Those with 90% or higher identity were matched to genome locus for
genome-mapped gene sets of arthropod, plant and vertebrate cases.
Roughly half of these high identity transcripts are found to
map to distinct loci (on separate scaffolds or tandem duplicates).
This doesn't clarify the ambiguity of locus assignment.
Since distinct proteins, whether at same locus or not, are
biologically interesting, and distinguishing locus assignment without
a finished genome assembly is problematic, this is deemed not a
priority for assessing gene set quality. Another operational
definition of unique proteins, below sequence identity of 70%, is
measured with CD-Hit clustering. This 70% unique protein set is
marked at the "U" line on gene size bars in Fig.1. Those gene sets without U
marking mostly lack alternate transcripts (e.g. older gene sets).
Orthology statistics of transcript assembly proteins are tabulated here
for Locusta migratoria, Daphnia magna, and Theobroma cacao.
EvidentialGene RNA-Seq assembly methods are summarized here
Gene Set Version Improvements
Fig.2a Transcriptome mRNA Assembly Improvements, Protein Size x Homology Alignment
mRNA assembly version improvements for Evigene vs published
transcript assembly (TSA). These are de-novo transcriptome assemblies
without a genome (evigene mRNA in Fig. 1). Drosophila version 2011 -
2002 (drosmel53) is a reference case. Banana1bt, Banana1bg are a
special case. Banana1bg shows Evigene mRNA-seq assembly improvement to
an independent genome-modelled gene set (see text), while Banana1bt
shows improvement to the published TSA. Improvement is calculated
using blastp alignment to 1000 longest reference species proteins,
then Evigene - old version score is subtracted, for aligned bases
and protein size.
Fig.2b Genome Gene Set Improvements, Protein Size x Homology Alignment
Gene set version improvements for Evigene vs prior version, for 1000 longest proteins,
with drosmel53 as reference. These are genome gene sets (evigene gDNA in Fig. 1),
built with a mixture of protein, mRNA evidence and HMM gene modeling on genome assemblies,
using Evigene methods. Locust1bt is the same mRNA assembly as in Fig. 2a,
scaled comparisons of these two figures.
Fig. 3 Protein alignment of mRNA Assemblies matched to longest Reference Genes (draft 2)
Fig. 3 Protein sizes of mRNA Assemblies matched to longest Reference Genes (draft 1)
Details for the average result in Fig 2a.