euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Index of /EvidentialGene/arthropods/honeybee/evg3hbee

      Name                    Last modified       Size  

[DIR] Parent Directory 30-Jun-2015 22:42 - [TXT] 07-Jun-2014 19:02 1k [TXT] evg3hbee.mrna2tsa.log 05-Jun-2014 13:04 24k [TXT] evg3hbee.tr2aacds.log 01-May-2014 22:05 13k [   ] evg3hbee.trclass.gz 01-May-2014 21:53 38.4M [DIR] hbee_rnaseq/ 30-Jul-2014 13:32 - [DIR] inputset/ 30-Jun-2015 22:38 - [DIR] lola/ 30-Jul-2014 13:32 - [DIR] publicset/ 30-Jun-2015 22:39 - [   ] 07-Jun-2014 19:03 2k [   ] 01-May-2014 16:53 2k

====== Honey Bee EvidentialGene gene construction set ========
2014-jun-07, by Don Gilbert, gilbertd at indiana edu

This is a 'reference-free' gene set assembly from mRNA-seq, without reference made to
a genome assembly nor training/mapping from other species genes. As such it has different 
values than genome-based gene sets, one important one is no external artifacts or errors 
contribute to these genes.  Any protein orthology measured has not been influenced by 
gene modelling using other species (with their artifacts), and genome assembly errors.

#t2ac: EvidentialGene VERSION 2013.07.27
#t2ac: BEGIN with cdnaseq= date= Thu May  1 14:16:23 PDT 2014
#t2ac: bestorf_cds= evg3hbee.cds nrec= 6156631
#t2ac: nonredundant_cds= evg3hbeenr.cds nrec= 2257631
#t2ac: nofragments_cds= evg3hbeenrcd1.cds nrec= 1353185

# Class Table for evg3hbee.trclass 
class           okay    drop    okay    drop
althi           4.5     11.4    61969   154724
althi1          8.8     24.4    119671  331456
althia2         0       0.5     0       7687
altmfrag        0.4     0.4     6271    6534
altmfraga2      0       0       696     631
altmid          0.6     0.6     8381    9294
altmida2        0       0       690     498
main            4.4     4.6     60370   62381
maina2          0.3     0.2     4868    3490
noclass         2.3     7.7     32321   105044
noclassa2       0       0       138     166
parthi          0       16.2    0       220332
parthi1         0       9       0       122333
parthia2        0       2.4     0       33187
total           21.8    78.1    295375  1057757  # ok = 5% of
# AA-quality for okay set of evg3hbee.aa.qual (no okalt): all and longest 1000 summary         n=1000; average=2024; median=1725; min,max=1362,16948; gaps=9430,9.4
okay.all         n=97697; average=217; median=131; min,max=40,16948; gaps=681261,6.9

Revised class count table for publicset, after removing 30,000 gut parasite genes (from a euglenoid), 
plus a subset of uninformative, short, fragmented transcripts (no homology, mostly no mapping to genome)

118027 althi      # high identity exon alignment alternates,
  5438 althim     # main/alt swaps
  5297 altmid     # lower identity exon align alts, will contain some paralogs
  2488 altmidfrag # shorter, low ident alts
 59018 main       # main loci with alternates (Apimel3aEVm000000t1), 
 13374 noclass    # and main-noclass without alts (IDt1), adding to 72392 "loci"

Find useful data in honeybee/evg3hbee/publicset/
  protein, cds and mRNA fasta sequence files
  annotation table with homology Name (from blastp)
  view.gff location table, mapped to apis amel45 genome assembly

This name list contains ortholog gene function names
for NOPATH mRNA loci (that is they do not map to amel45 genome). There likely
are some true bee genes in this set, missing from an incomplete genome.
I filtered out obvious contaminant mRNA assemblies, notably 30000 loci of
a euglenoid parasite? from gut mRNA.  Non-genome-mapping mRNA with bee-like 
orthologs should be investigated by someone.  Many are found in other bee and wasp:
 973 DMELA, 857 wasp, 783 bomimp, 503 AECHI, 424 apimel, 361 megrot, 343 apidor, 154 apiflo, 125 TCAST

** Found Lola 100-alts same hub intron as Nasonia, Nasvit alts map to Apis genom

lola = longitudunals lacking (one of those cute fruitfly gene names), however this gene
  is active/expressed and mucking around in most tissues, over development course,
  including brain/nervous tissue, where many alt proteins may come into play for
  interesting biology.

** Please investigate you biologists, this may be a hymenoptera specific alternate expansion
affecting social/nervous behaviour (or maybe sting/venoms, or whatever.. I don't know)

same place Evigene Apis mRNA map.. 60 hub intron variants found in evg3hbee trasm,
   Group9.12       intron  1489888 -> 1506360..1670677 (200 kb span)
most are locus Apimel3aEVm002442t alts (252 alts listed in publicset, not all map w/ diff hub intron)
.. 4 loci have 250+ alts in pubset,
  Apimel3aEVm002442 = lola, amel45:GB53441 matches Nasvi2EG036900t == wasp lola
    locus Group9.12: 1482542i .. 1489888ihub > 1512692..1670677

  Apimel3aEVm000555 = DSCAM, 314 alts, aa=1674,87%,complete, Split genes
     locus Group4.13:611675 -> 611794..671399

  DSCAM is now well known multiply-alternate transcript locus, but it doesn't have
  all that many alt introns, just lots of exons to mix and match..

  more later..

Transcript assemblies and input mRNA seq
  mRNA seq used is all from public data sets found at NCBI SRA  for Apis mel.
  see the hobee_study.list and sra_result.cvs for SRA accessions

  A brief summary of the 6+ Million de-novo transcript assemblies made from these RNA
  are also there, with primary statistic for selection and effort in the 'aastat' tables
  of average protein sizes found in each tr-assembly run.  This I find best way to proceed,
  to learn early on if one has enough mRNA assemblies for a full animal/plant gene set.
  Size and count proteins, not transcripts.  N50 transcript size is meaningless for mRNA
  gene sets.  Measure 1000 longest proteins which has a biological max and is strongly
  correlated with orthology (all the longest proteins are pretty much well-known now).
  (use cd-hit to filter duplicates for quick answer).

  As in other work, 3 main assembliers, in order of value for making orthology-complete
  gene sets are 1. Velvet/Oases, 2. Soap-denovoTrans, 3. Trinity  (I've learned TransAbyss
  is also useful, ~= Soap, but don't use yet myself).  If you just run Trinity (or any 1 assembler), 
  you are not getting complete assemblies of your mRNA-seq.

  more later on methods...

Developed at the Genome Informatics Lab of Indiana University Biology Department