euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Index of /EvidentialGene/other/rnasets_srapublic

      Name                              Last modified       Size  

[DIR] Parent Directory 21-May-2017 19:09 - [TXT] sra_rnaseq2_201403.arpods.listan 30-Mar-2014 22:06 2k [TXT] sra_rnaseq2_201403.csv 30-Mar-2014 12:11 9.2M [TXT] sra_rnaseq2_201403.flies.listan 30-Mar-2014 22:11 12k [TXT] sra_rnaseq2_201403.insects.listan 30-Mar-2014 22:10 8k [TXT] sra_rnaseq2_201403.pe100m.listan 30-Mar-2014 16:49 32k [TXT] sra_rnaseq2_201403.readme.txt 05-Apr-2014 12:39 4k [TXT] sra_rnaseq2_201403.readme2.txt 06-Aug-2014 20:27 4k [TXT] sra_rnaseq2_201408.csv 06-Aug-2014 20:08 20.4M [TXT] sra_rnaseq2_201408.pe100m.listan 06-Aug-2014 20:22 41k [TXT] sra_rnaseq2_201411.csv 17-Nov-2014 15:38 19.7M [TXT] sra_rnaseq2_201411.pe100m.listan 17-Nov-2014 15:50 48k [TXT] sra_rnaseq2_201411.readme2.txt 18-Jan-2016 20:41 6k [TXT] sra_rnaseq2_201506.csv 09-Jun-2015 13:34 27.6M [TXT] sra_rnaseq2_201506.pe100m.listan 09-Jun-2015 14:17 64k [TXT] sra_rnaseq2_201506.readme2.txt 09-Jun-2015 14:27 6k [TXT] sra_rnaseq2_201509.csv 15-Sep-2015 15:24 24.9M [TXT] sra_rnaseq2_201509.pe100m.listan 15-Sep-2015 15:35 70k [TXT] sra_rnaseq2_201509.readme2.txt 17-Sep-2015 13:54 2k [TXT] sra_rnaseq2_201601.csv 18-Jan-2016 20:45 35.1M [TXT] sra_rnaseq2_201601.pe100m.listan 18-Jan-2016 20:55 85k [TXT] sra_rnaseq2_201601.readme2.txt 18-Jan-2016 21:24 3k


Publicly available RNA-Seq data from NCBI SRA, 2014.03
collected by Don Gilbert, gilbertd At indiana.edu, EvidentialGene at  euGenes.org

These are suitable for assembly to complete species gene sets.  Some of these
arthropod species lack existing public gene sets, or have fragmented low-quality ones.
These will be interesting and valuable to assembly into good quality gene sets, which
the EvidentialGene pipeline is now ready to do.  

Please see the subset lists, sra_rnaseq*.arpods, insects, flies.listan,
arpods = arthropods not insects, insects = not diptera, flies = dipterans (Drosophila, etc.)
Table columns are PairSpots=N read pairs, 100 M minimum in these tables, nSets=N experiment sets,
 Mbases=Megabases of rnaseq, Species, SpeciesInfo=clade,taxid,common name,taxon lineage

Potential collaborators or biologists with interest in species in these RNA sets should
contact Don about this.

  Gene set completeness for mRNA-genes and Genome-genes of Ticks
      Human genes found (n=16631)
geneset         hit%    alnh    alnt    Gene set method, species
................................................................
ixodes.evg      95.7    434     415     mRNA-assembly, deer tick (2014.04 rough draft)
ztick.evg       91.4    416     380     mRNA-assembly, zebra tick
ixodes.gno      89.5    364     326     genome-predict, deer tick
tetur.gno       83.2    399     332     genome-predict, spider mite
................................................................
   hit%= percent of ref genes found
   alnh= alignment average, for hit genes
   alnt= alignment average, for all ref genes

#...............................
Tables from http://www.ncbi.nlm.nih.gov/sra
query=
(("biomol transcriptomic"[Properties]) AND "platform illumina"[Properties]) AND "library layout paired"[Properties]
 Taxonomic Groups  n=25171 public set
    eukaryotes (23288)
        animals (17892)
            chordates (14723)
             arthropods (2237)
             nematodes (541)
             more... (391)
        green plants (3899)
             land plants (3705)
             more... (194)
        fungi (1012)
         apicomplexans (208)
         ciliates (46)
         more... (231)
    bacteria (1308)
    unclassified (553)
     viruses (17)
#...................
output file sra_rnaseq2_201403.csv has all above

cat sra_rnaseq2_201403.csv | perl -ne \
'chomp; s/^"//; s/"$//; @v=split"\",\""; if($v[0]=~/Experiment/) { @hd=@v; next; }
($sr,$sp,$mb,$nr,$ns,$libsel)=@v[0,2,9,10,11,17]; 
$sp=~s/ sp\..*$//; $mb{$sp}+=$mb; $nr{$sp}+=$nr; $ns{$sp}+=$ns;
END{ print join("\t",qw(PairSpots nSets Mbases Species))."\n";
for $s (sort{ $ns{$b}<=>$ns{$a} or $a cmp $b } keys %ns) {
($ns,$nr,$mb)=($ns{$s},$nr{$s},$mb{$s}); $mb=int($mb); 
print join("\t",$ns,$nr,$mb,$s)."\n" if($ns>99999999); } }' \
 > sra_rnaseq2_201403.pe100m.list
# n=580 at 100M+ spots, includes bacteria

# run thru NCBI taxonomy species into commontree > taxid, then taxid in entrez batch > taxres.xml
0. cut species names from sra.pe100m.list
1. spp.list > http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi > taxid.list, commontree.txt
2. taxid.list > http://www.ncbi.nlm.nih.gov/sites/batchentrez + taxonomy > taxres.xml

cat sra_rnaseq2_201403.taxres.xml | perl -ne\
'if(/^.Taxon\b/) { $nt++; $in=1; } elsif(/^\s+.Taxon\b/) { $in++; } elsif(m,^./Taxon\b,) { $in=0; } 
elsif($in==1 and /<(TaxId|ScientificName|GenbankCommonName|CommonName|Division|Lineage)>([^\<]+)/) { 
($t,$v)=($1,$2); $tid=$v if($t eq "TaxId"); $tv{$tid}{$t}.="$v,"; $tn{$t}++; } 
END{ @tn= qw(TaxId Division ScientificName GenbankCommonName CommonName Lineage );  
print join("\t",@tn)."\n"; $k="Division"; $j="Lineage"; 
foreach $t (sort{ $tv{$a}{$k} cmp $tv{$b}{$k} or $tv{$a}{$j} cmp $tv{$b}{$j} or $a <=> $b} keys %tv) {
@v=@{$tv{$t}}{@tn}; map{ s/,$// } @v; print join("\t",@v)."\n"; } }' \
 > sra_rnaseq2_201403.taxres.tab

cat sra_rnaseq2_201403.taxres.tab sra_rnaseq2_201403.pe100m.list | perl -ne \
'if(/; Eukaryota/) { chomp; ($tx,$dv,$sp,$cg,$cn,$ln)=split"\t"; $spv{$sp}="$dv,tx$tx,$cg"; } 
elsif(/^\d/) { ($np,$ns,$mb,$sp)=split"\t"; chomp($sp); if($sv=$spv{$sp}) { s/$/\t$sv/; print; } }
else { print; } '\
 > sra_rnaseq2_201403.pe100m.listan
#....................................


Developed at the Genome Informatics Lab of Indiana University Biology Department