Currently, there are over 300,000 Sanger ESTs available for Pinus taeda in NCBI dbEST. Moreover, DOE Joint Genome Institute Pinus taeda EST project has generated 1868.7 M bases of sequences, which were deposited into NCBI Short Read Archive (SRA, the accession numbers: SRX025521, SRX025522, SRX025523, SRX025524 and SRX025525).
Using the aforementioned Sanger and 454 ESTs, we have created our first release of Pinus taeda transriptome assembly. Before EST clustering and assembly, cleaning and trimming of EST data was conducted using SeqClean and GS De Novo Assembler (454 Life Sciences, version 2.6). Based on the clean EST reads, De Novo assembly was performed using CLC Genomics Workbench version 4.8. Then, the resultant contig fasta file and sequence alignment SAM file were processed by CBrowse, our generic software for visualizing and analyzing transcriptome assembly from fasta and SAM files. It is worthy mentioning that we have used CBrowse to create and publish PeanutDB. Similarly, we used CBrowse to generate PineDB that focuses on the transcriptomics data available for loblolly pine, Pinus taeda.
In the current release (PineDB version 1, 2012-06-28), there are 35, 550 contigs in our Pinus taeda transcriptome assembly, covering a total of 3,983,264 individual sequence reads. Not only we provided GO, KEGG and EC annotation to these contigs, but also we identified extensively the potential polymorphisms, both single base (e.g., SNPs and single indel) and multiple bases (e.g., mismatches/indels of multiple bases), and simple sequence repeats for each contig. In particular, we offer many user-friendly web interfaces that definitely facilitate data retrieval, data comparison, data visualization and data mining of Pinus taeda transcriptome.