euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene: Alignment scores of Gene sets Constructed from
mRNA-Seq Assembly or Genome Predictions/mappings

Fig. 3 Protein alignment of mRNA Assemblies matched to longest Reference Genes.

These tables and graphs show the longest proteins of reference gene sets, and alignment percent for draft constructed or predicted gene sets to those long reference genes. The mRNA assembled genes are constructed without any reference protein information, nor with genome data. The Genome gene sets all use both reference genes and genomes to produce gene models. I.e., the genome genes have the answers in front of them, while mRNAseq genes do not, for this test.

This summary shows that genes constructed from mRNA-seq are often more complete than genes from genome-modelling methods. mRNA-seq genes are more biologically convincing as they don't rely on the reference species genes, predictions or genome assembly, and thus are not subject to artifacts from those sources. When they match the reference genes, it is true discovery. Combine this truth discovery with lower mRNA-seq costs, simpler computational methods, and fewer artifact sources, or "mRNA-sequencing+assembly" versus "genome-sequencing+assembly+gene-prediction+refgene-mapping", and the value of mRNA-seq genes over genome-genes should be apparent.

  
  Summary of mRNA-genes and Genome-genes Alignment 
    to Reference genes for Animals and Plants
                  mRNA       Genome     mRNA 
  Clade           Evigene     Genes     TSA Genes  Organisms
  ---------------------------------------------------------------------------------
  Fishes           78%    83%,71,67,65  39%       tilapia, catfish, medaka, stickleback, killifish
  Plants         93%,61      95%,63     10%       cacao, banana
  Insects        75%,54,48   56%,33     41%-4%    beetles, whitefly, locust
  Crustacea      77%,58,28   66%,58      5%       waterfleas, black & white shrimps
  Ticks            74%       57%,43     56%       zebra-tick, spider-mite, deer-tick
  ---------------------------------------------------------------------------------
Statistic is percent of test genes within 90% of clade maximum alignment to 1000 longest reference genes, from below statistic tables. This statistic approximately answers what portion of genes are accurate. Other statistics such as percent alignment to references have same ordering. mRNA Evigene and TSA (author published) are for same data and species; Genome genes are for the most part other species.

Data tables are given below.
Test Gene set source: GENOME = major portion from gene prediction on genome assembly and/or reference protein mappings.
mRNAseq = genes assembled from mRNA-seq, de-novo, without genome assembly or reference proteins.
mRNAseq assemblies include author-submitted (tsa) and Evigene-produced (evg), using same source mRNA-seq, where the difference is often dramatic (e.g. banana evg versus tsa versus gno-genome, or tiger shrimp penaeus_evg versus penaeus_tsa).
Reference gene sets are chosen from good quality gene sets, distant enough taxonomically from the test species clade to reduce phylogenetic effects (human for fishes, arabidopsis for plants, bee/tribolium for insects and crustacea). Ticks lack a near reference, the reference is median gene per family contained in all of 3 species (human, daphnia, tribolium). The 1000 longest reference proteins are tabulated with aligned test genes, and 200 are plotted. Alignment is scored with 'blastp -query reference.aa -db testspecies.aa'.

Plot bars: The bar plots show percentage alignment to reference gene, for each of 200 longest ref genes, as orange bars. Background blue bars are the maximal alignment for that group/clade of gene sets. Each title lists test gene set, and average percent alignment to 200 reference genes.

Statistic table columns: dref = mean percent difference in alignment from reference length, sdref= std. deviation of mean,
dmax = mean pct. difference in alignment from maximum alignment for this clade group,
nmax90 = number of genes 90% or better of maximal alignment, pmax90 = percent of genes 90% or better,
tmax = t-statistic for dmax less than 100%, pmax = p-value of t.; df = number of items -1


Protein alignment % for RefHuDaTr_median-ticks 
          dref sdref dmax nmax90 pmax90  tmax      pmax  df
ixodes    62.2  28.3 75.2    427   42.8 -28.7 1.45e-132 996   GENOME
tetur     67.1  29.8 80.2    566   56.8 -22.6  6.50e-92 996   GENOME
ztick_evg 74.1  26.9 89.2    742   74.4 -16.3  1.11e-53 996   mRNAseq
ztick_tsa 62.8  33.9 75.0    558   56.0 -23.3  2.40e-96 996   mRNAseq
  ixodes = Ixodes scapularis deer tick, tetur = Tetranychus urticae spider mite, 
  ztick = Rhipicephalus pulchellus zebra tick

ref3a_median1000-ticks.toppaln
Protein alignment % for RefArath-plants 
           dref sdref dmax nmax90 pmax90   tmax     pmax  df
banana_evg 78.7  28.3 82.1    608   61.0 -21.39 3.92e-84 995  mRNAseq
banana_gno 82.4  23.7 85.9    623   62.6 -21.18 8.65e-83 995  GENOME
banana_tsa 45.3  28.1 47.6     97    9.7 -59.05 0.00e+00 995  mRNAseq
cacao3_evr 92.6  15.7 97.6    922   92.6 -10.26 7.51e-24 995  mRNAseq
cacao1_evg 93.3  15.4 98.3    941   94.5  -8.25 2.41e-16 995  GENOME (including same mRNA seq as cacao3_evr)
  cacao = Theobroma cacao chocolate tree, banana = Musa acuminata banana plant

refarath1000-banana.toppaln
Protein alignment % for RefHoneybee-insects 
              dref sdref dmax nmax90 pmax90  tmax      pmax  df
pgbeetle_evg  75.2  28.7 90.0    725   74.8 -15.5  4.59e-49 968 mRNAseq
pgbeetle_tsa  62.6  29.4 74.8    397   41.0 -30.8 2.56e-146 968 mRNAseq
pinebeetle    56.3  32.9 65.8    321   33.1 -33.9 3.34e-167 968 GENOME
tribobeetle   69.6  29.8 82.8    546   56.3 -23.1  5.61e-95 968 GENOME
whitefly1_evg 67.8  31.7 79.9    525   54.2 -22.8  5.72e-93 968 mRNAseq
whitefly1_tsa 31.8  21.0 38.8     38    3.9 -69.1  0.00000  968 mRNAseq
locust1_evg   63.1  32.9 75.2    468   48.3 -25.8 2.72e-112 968 mRNAseq
  tribobeetle = Tribolium castaneum beetle, pgbeetle = Pogonus chalceus beetle, 
  pinebeetle = Dendroctonus ponderosae pine beetle, 
  whitefly =  Bemisia tabaci whitefly, locust = Locusta migratoria

refbee1000-beetles.toppaln

refbee1000-insect3.toppaln
Protein alignment % for RefHuman-fish 
              dref sdref dmax nmax90 pmax90  tmax     pmax  df
catfish_evg   82.1  26.8 91.7    767   77.7 -14.9 1.12e-45 986  mRNAseq
catfish_tsa   61.6  36.6 68.8    383   38.9 -8.05 7.49e-13 986  mRNAseq
killifish_evg 78.4  28.3 86.6    639   64.7 -20.0 1.49e-75 986  GENOME
medaka        77.7  30.7 84.8    662   67.1 -19.0 5.37e-69 986  GENOME
stickleback   79.7  30.1 86.9    704   71.3 -17.2 1.25e-58 986  GENOME
tilapia       84.0  27.6 92.2    818   82.9 -12.5 1.77e-33 986  GENOME

refman1000-fish.toppaln
Protein alignment % for RefBeetle-crusts 
             dref sdref dmax nmax90 pmax90  tmax      pmax  df
daphmag5_evg 70.8  31.1 89.8    740   77.0 -14.1  1.93e-41 960  mRNAseq
daphniamag   65.4  31.5 82.3    555   57.8 -21.4  7.04e-84 960  GENOME/mRNAseq
daphniaplx10 68.1  30.8 86.1    639   66.5 -19.0  6.03e-69 960  GENOME
litova1_evg  50.0  30.9 64.5    273   28.4 -36.7 2.60e-185 960  mRNAseq
penaeus_evg  65.0  31.5 82.3    557   58.0 -21.6  4.04e-85 960  mRNAseq
penaeus_tsa  24.4  23.8 32.1     44    4.6 -74.1  0.00e+00 960  mRNAseq
  penaeus = Penaeus monodon black tiger shrimp, 
  litova = Litopenaeus vannamei pacific white shrimp,
  daphniaplx, daphniamag, daphmag = Daphnia pulex and magna waterfleas 

reftribo_top1000-crustacea4.toppaln

reftribo_top1000-daphnia.toppaln

Data tables

Data Table columns include ref gene ID and aa length, and test gene Id, Bitscore, Identity, Align, and Length (all AA sizes)
E.g. datatables/refman1000-fish.topgene.tab
RefgeneID             RefLen  TestID1    Bits1 Iden1 Align1 Len1   TestID2  Bits2 Iden2 Align2 Len2 ...
HSAPI:ENSG00000155657	33423	cfishevg:socatfishv1k31loc102733t1	18536.9	8832	13118	13376	
    kfishevg:Funhe5EG014955t1	19833.5	12321	23137	28556	
    medaka:ENSORLP00000022735	33417.5	16233	25294	26141	
    stickleback:ENSGACP00000000362	27711.8	14071	23265	26129	
    tilapia:ENSONIP00000007298	25333	12654	20549	22001
HSAPI:ENSG00000131018	8797	cfishevg:socatfishv2k25loc98569t1	9838	5199	8842	8727	
    kfishevg:Funhe5EG001671t1	9640	5076	8778	8743	
    medaka:ENSORLP00000018390	8138	4540	8837	8775	
    stickleback:ENSGACP00000011594	8357	4582	8832	8782	
    tilapia:ENSONIP00000000873	9817	5221	8857	8743

R-Stat Tables

#R-shell ..............
NSTAT <- 1000
# YALN only, pct of ref only
ylab <- "Protein alignment %"; fkey <- "paln"; ymax <- 100

tvars<- c("tid","bits","iden","algn","slen");
tinset = system(paste("ls -1 ",ingdir,"/ref*1000*.topgene.tab",sep=""),intern = T)
inttab <- tinset[1]

for (inttab in tinset) {

  gtab <- read.table(inttab, header=F, na.strings="na")
  nspp <- (ncol(gtab) - 2) / 5
  inghead <- c("refid","rlen");
  for(c in 1:nspp) { inghead <- c(inghead, paste(tvars,c,sep="")) }
  colnames(gtab) <- inghead
  
  gname= sub("refbee","RefHoneybee", sub("reftribo","RefBeetle",
    sub("refman","RefHuman", sub("refarath","RefArath",
    sub("beetleslocustwhifly","insects",
    sub("ref3a","RefHuDaTr", sub("ref3d","RefHuTrWa", sub("1000","", sub("_top1000","",
    sub(".topgene.tab","",sub("/Users.*/","",inttab))))) ))))))

  nrmax <- min(nrow(gtab),NSTAT)
  gmat <- gtab[1:nrmax,grep("rlen|algn",colnames(gtab))]

  labid <- gtab[1,grep("tid",colnames(gtab))] ## missing? YES: check >1; can use levels(labid[[i]])[1]
  for (i in 1:length(labid)) { lid <- levels(labid[[i]])[1]; 
    lid<- sub(":.*","",as.character(lid)); 
    lid <- sub("kfish","killifish",sub("cfish","catfish",lid))
    lid <- sub("whitefly","whitefly1evg",sub("locust","locust1evg",lid));
    lid <- sub("(evg|evr|gno|tsa)","_\\1",lid)
    labid[[i]] <- lid; }
  
  # SUBSET here..
  
  dmat <- 100*(gmat[,2:ncol(gmat)] / gmat[,1]) #  ave pct of ref
  colnames(dmat) <- labid;
  dmat[dmat>100] <- 100;   
  nc<- ncol(dmat)
  nr<- nrow(dmat)
  
  dmax <- apply(dmat,1,max)
  dmean<- mean(dmat); dsd <- sd(dmat)
  maxmat<- 100*dmat/dmax  # use this instead of dmat - dmax ?
  maxmean<- mean(maxmat);
  dstats <- data.frame( matrix(0,nrow=nc,ncol=8), row.names=labid)
  for (j in 1:nc) { 
    dxt <- t.test(maxmat[,j] - 100, alternative="less");
    nmax90 <- sum( maxmat[,j] >= 90); pmax90 <-  round(100 * nmax90 / nr,1)
    rstat <- cbind(dref=dmean[j],sdref=dsd[j],dmax=maxmean[j],
          nmax90=nmax90, pmax90=pmax90,
          tmax=dxt$statistic,pmax=dxt$p.value,df=dxt$parameter )
    dstats[j,] <- rstat
    if(j==1) colnames(dstats)<- colnames(rstat)
  }

cat(ylab, "for", gname,"\n"); 
print( dstats);
cat('-------------------------------------------',"\n\n"); 

}

R-PLOT Ref gene x TrAssembly Top Sizes for multi-species/ref

#R-shell ..............
tvars<- c("tid","bits","iden","algn","slen");
ingdir<- "/Users/gilbertd/Desktop/dspp-work/arthropod/aabugs4/aabugs4qual/tsaevg/aaeval/ref3redo"
tinset = system(paste("ls -1 ",ingdir,"/ref*1000*.topgene.tab",sep=""),intern = T)
inttab <- tinset[1]

NSHOW <- 200 ; plwidth <- 14 # NSHOW <- 100 ; plwidth <- 8;  # 
PSAME <- 0.05; NOSAME <- F # NOSAME <- T ; # 
SUBSET <- "";  subname<-""
SUBSET <- "banana";  SUBSET <- "cacao";  
SUBSET <- "daph"; subname<- "daphnia"
SUBSET <- "beet";  subname<-"beetles";
SUBSET <- "pgbeetl_evg|whitefly|locust";  subname<-"insect3"; # various evgR insects from beetleslocustwhifly
SUBSET <- "daphmag5_evg|litova1_evg|penaeus_evg|penaeus_tsa";  subname<-"crustacea4"; # various evgR insects from beetleslocustwhifly
#..............

  gtab <- read.table(inttab, header=F, na.strings="na")
  nspp <- (ncol(gtab) - 2) / 5
  inghead <- c("refid","rlen");
  for(c in 1:nspp) { inghead <- c(inghead, paste(tvars,c,sep="")) }
  colnames(gtab) <- inghead

  gname= sub("refbee","RefHoneybee", sub("reftribo","RefBeetle",
    sub("refman","RefHuman", sub("refarath","RefArath",
    sub("beetleslocustwhifly","insects",
    sub("ref3a","RefHuDaTr", sub("ref3d","RefHuTrWa", sub("1000","", sub("_top1000","",
    sub(".topgene.tab","",sub("/Users.*/","",inttab))))) ))))))
    
  nrmax <- min(nrow(gtab),NSHOW)
  maint <- "Top Sizes"
  # YALN only, pct of ref only
  ylab <- "Protein alignment %"; fkey <- "paln"; ymax <- 100
  gmat <- gtab[1:nrmax,grep("rlen|algn",colnames(gtab))]

  labid <- gtab[1,grep("tid",colnames(gtab))] ## missing? YES: check >1; can use levels(labid[[i]])[1]
  for (i in 1:length(labid)) { lid <- levels(labid[[i]])[1]; 
    lid<- sub(":.*","",as.character(lid)); 
    lid <- sub("kfish","killifish",sub("cfish","catfish",lid))
    lid <- sub("whitefly","whitefly1evg",sub("locust","locust1evg",lid));
    lid <- sub("(evg|evr|gno|tsa)","_\\1",lid)
    labid[[i]] <- lid; }
  
  if(nchar(SUBSET)>1) {
    colset <- grep(SUBSET,labid) ## FIXME: move this up after read(gmat)
    if(length(colset)>1) { 
      labid <- labid[colset];  gmat <- gmat[,c(1,1+colset)] ;
      if(nchar(subname)<3) subname<- gsub('|','_',SUBSET);
      fkey <- paste(fkey,subname,sep="-") #? gname also
      }
  }  
  
  dmat <- 100*(gmat[,2:ncol(gmat)] / gmat[,1]) # use this == ave pct of ref
  dmat[dmat>100] <- 100;  ## gmat <- dmat  
  dmean= mean(dmat)
  dmax <- apply(dmat,1,max)
  colnames(dmat) <- labid;
  for (j in 1:length(dmean)) { pl<-paste("(",round(dmean[j]),"% ref)",sep=""); 
    labid[j]= paste(labid[j],pl); }
  nx= seq(to=nrow(dmat)) #? got 99 for nrow=100 
  nc<- ncol(dmat)
  
  gcol=terrain.colors(1+nc); # rainbow(nc); #
  gplotfile <- sub("topgene.tab",paste("top",fkey,".pdf",sep=""),inttab)
  labmain=paste("Ref gene x TrAssembly: ",maint,"for",gname)

  pdf( gplotfile,  width=plwidth, height=2.5*nc) 
  layout(matrix(1:nc,nrow=nc))

  for (j in 1:nc) { 
    labmainj= paste(gname, labid[j],sep=": ")
    plot( nx, dmax, type="h", col="skyblue2", lwd=3,  ## pink3 skyblue2 deeppink3 
      ylim=c(1,ymax), ylab = ylab, xlab="Longest Ref Genes", main=labmainj )
    lines( nx, dmat[,j], type="h", col="orangered1", lwd=3)  ## orangered3  gcol[j]
  }

  dev.off()


Developed at the Genome Informatics Lab of Indiana University Biology Department