Duplicate Genes from chromosome duplication
Reliable homeologous genes (ohnologs) in maize that are conserved with single loci in
rice, sorhgum and Arabidopsis are identifed by Schnable et al 2011 (doi:10.1073/pnas.1101368108).
These are 1750 paired-loci, each of pair on a separate chromosome (3500 loci).
Of these, 1661 paired-loci are identified in corn gene sets via alignment to sorhgum loci.
Alternate Transcripts of Maize Genes
Alternate transcripts may be more accurately reconstructed using several methods,
as each locus and alternate set have differing properies (size, complexity, amount of
shared exons, expression levels, ..) Kmer sizes for assembler have an effect on both
sequence accuracy (measured by alignment identity to reference alternate transcripts) and
number found (measured by found alignments to distinct reference alternate transcripts).
Alternate transcript and gene duplicate reconstruction share similarities, in the problem
of high identity duplicated sequences with variable expression,
and in greater accuracy of large Kmer assemblies.
They also differ in aspects, where alternates share large sequence spans of common exons.
For the five maize gene sets, the Evigene5 gene assembly of short
Illumina reads has reconstructed substantially more reference alternate
isoforms, as well as having greater alignment identity to reference
proteins, than the other 4 gene sets including those modeled on
chromosomes by Gramene/Ensembl and by NCBI, and genes assembled from
PacBio long reads as well as JGI gene assembly of short reads.
Summary for Gene Sets in Reconstruction of Maize Gene Alternates, Sorghum ref
Summary for Gene Sets in Reconstruction of Maize Gene Alternates, Arabidopsis ref
Data table for corngenes_alternates_qualsum
KMER Effect on Reconstruction of Maize Gene Alternates, 4 Qualities|
Conserved Ohnologs found, nRef=1697, nOhnoHit2=1453, nOhnoref=3150,
Alternates found, nRefLoci=6165 (Hit1), nAlternates=2251 (Hit2) for Sorghum reference (Sbicolor_313)
Alternates found, nRefLoci=9290 (Hit1), nAlternates=3841 (Hit2) for Arabidopsis reference (arath15)
Qualities for matching conserved ohnologs and alternates
nHit = number ohno loci found (1 or 2 copies),
pHit1 = %found 1st copy, pHit2 = %found 2nd (independent) copy,
or pHit2 = %found alternates, pHit1 primary isoform, for alternate summary,
Kmer is assembler read-shred size but for combined gene sets
pIg = % identity to maize V4 chr assembly, pIr = % identity to Sorghum ref gene CDS, for found loci (nHit),
pIgr = multiple of pIg x pIr / 100, i.e. % identity on both dimensions reference genes and maize chromosomes
rpIg, rpIr, rpIgr = same as above, but relative to all ohno loci (nOhnoref).
sumPIGR = sum over genes of pIgr metric (sumPIGR/nHit = pIgr, sumPIGR/nOhnoref = rpIgr)
Perfect quality score would be 100%, except maximal %identity to Sorghum genes CDS is ~90%
Overall quality metric is rpIgr: % identity on both dimensions of ref genes and chromosomes for
all conserved ohnologs
Source gene sets
Complete maize gene sets:
NCBIv3G= NCBI-EGAP on V3chr (2014),
JGI14v3R = JGI gene assembly on V3chr (2014)
CSHL6EnsG/CSHL32s4v=Gramene/Ensemble_plants v32 (Sep.2016) on V4chr,
CSHL6PacR=Gramene PacBio gene assembly (June.2016) on V4chr,
Evigene subset assemblies:
idba=cornhi12m3idba, soap=cornhi8m4msoap, trin=cornhi8mtrin, velv=cornhi8m9agvelv