FY2010 - Oak Ridge National Laboratory
FY2010 - Oak Ridge National Laboratory
FY2010 - Oak Ridge National Laboratory
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Director’s R&D Fund—<br />
Systems Biology and the Environment<br />
Systems Biology, the Joint Genome Institute (JGI) partnership, Bioremediation, Microbial Sequencing,<br />
and Genome Annotation that will be critically dependent on appropriate computational capabilities for<br />
data management, annotation, and support of experimental direction. The prototype developed in this<br />
project will demonstrate a new approach that could be expanded for these programs and the larger<br />
research community.<br />
Results and Accomplishments<br />
A number of whole genome transcriptome data sets for C. thermocellum were collected under normal and<br />
ethanol stress conditions using older standard gene expression array technology, newer high-density tiling<br />
array technology, and high-throughput sequencing. Analysis methods were developed or integrated and<br />
improved to determine differential expression and quantify genes and other features such as operons.<br />
Tools were also developed for visualization of both tiling array data and gene expression as determined<br />
by sequencing (RNAseq) data in conjunction with genome annotation. Analysis using these tools shows<br />
that transcription in the bacterial cell is much more complicated than previously known. We were able to<br />
detect the presence of several previously unknown features such as 5ꞌ regulatory RNAs, small regulatory<br />
RNAs, and alternative transcription start sites including ones in the middle of annotated genes.<br />
Inference of genetic regulatory circuits depends on many things including accurate genome annotation,<br />
correct quantification of the genes in the cell, precise structure of operons including transcription start<br />
sites, quantification of the transcription factors in the cell, and accurate models for the association of<br />
transcription factors to binding sites. Conventional data models (e.g., Genbank files) are inadequate for<br />
assembling data of different types into a computational model of the genetic regulatory network because<br />
they do not adequately describe concepts associated with regulatory networks and the underlying<br />
assumptions of the data used to generate these models. Over the course of this project, we investigated<br />
multiple data representation alternatives including SMBL, XML, Chado, and RDF/XML. The goal was to<br />
find a model that was appropriate for representing a genetic regulatory network. We were able to<br />
determine that BioPAX (built on RDF/XML) is a community adopted data standard that is capable of<br />
representing Genetic Regulatory networks. We developed two alternative methods for linking traditional<br />
genome annotations to the BioPAX standard. First, we implemented a proof-of-concept annotation<br />
representation based on the Semantic Web (RDF/XML). This approach allows one to merge the BioPAX<br />
annotation directly with annotation data using the SPARQL query language. Second, we explored the<br />
approach of directly importing all of the concepts stored in raw ontologies and the raw data from<br />
expression experiments into a Chado relational database schema. The relational database approach is<br />
currently better suited to production use, whereas the Semantic Web-based representation is better suited<br />
for sharing of scientific data and transparency. Future advancements in Semantic Web technologies may<br />
also make it suitable for production.<br />
Program Development<br />
Since this project began it has become clear that transcriptomics will be the “next big wave” in systems<br />
biology research based on rapid advancements in sequencing technology and RNAseq. With the new<br />
Illumina technology using 100× sample multiplexing, RNAseq data will be more cheaply generated than<br />
alternative technologies so it was fortuitous that we focused on this area. We have discussed incorporating<br />
RNAseq transcriptomics as part of the JGI sequencing and annotation pipeline with the JGI management.<br />
As a preliminary test they have agreed to sequence 12 samples from some of the Caldicellulosiruptor<br />
genomes being studied by ORNL’s BioEnergy Science Center using their Illumina sequencing machines.<br />
We will process this data using the RNAseq analysis pipeline created as part of this project. Successful<br />
results in this project could include additional Caldicellulosiruptor transcriptome RNAseq samples<br />
sequenced by both JGI and ORNL and eventually could include an expanded ORNL annotation pipeline<br />
which would include transcriptome analysis as a new product for all researchers. Newmodifications to the<br />
95