09.05.2014 Views

FY2010 - Oak Ridge National Laboratory

FY2010 - Oak Ridge National Laboratory

FY2010 - Oak Ridge National Laboratory

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Director’s R&D Fund—<br />

Systems Biology and the Environment<br />

Systems Biology, the Joint Genome Institute (JGI) partnership, Bioremediation, Microbial Sequencing,<br />

and Genome Annotation that will be critically dependent on appropriate computational capabilities for<br />

data management, annotation, and support of experimental direction. The prototype developed in this<br />

project will demonstrate a new approach that could be expanded for these programs and the larger<br />

research community.<br />

Results and Accomplishments<br />

A number of whole genome transcriptome data sets for C. thermocellum were collected under normal and<br />

ethanol stress conditions using older standard gene expression array technology, newer high-density tiling<br />

array technology, and high-throughput sequencing. Analysis methods were developed or integrated and<br />

improved to determine differential expression and quantify genes and other features such as operons.<br />

Tools were also developed for visualization of both tiling array data and gene expression as determined<br />

by sequencing (RNAseq) data in conjunction with genome annotation. Analysis using these tools shows<br />

that transcription in the bacterial cell is much more complicated than previously known. We were able to<br />

detect the presence of several previously unknown features such as 5ꞌ regulatory RNAs, small regulatory<br />

RNAs, and alternative transcription start sites including ones in the middle of annotated genes.<br />

Inference of genetic regulatory circuits depends on many things including accurate genome annotation,<br />

correct quantification of the genes in the cell, precise structure of operons including transcription start<br />

sites, quantification of the transcription factors in the cell, and accurate models for the association of<br />

transcription factors to binding sites. Conventional data models (e.g., Genbank files) are inadequate for<br />

assembling data of different types into a computational model of the genetic regulatory network because<br />

they do not adequately describe concepts associated with regulatory networks and the underlying<br />

assumptions of the data used to generate these models. Over the course of this project, we investigated<br />

multiple data representation alternatives including SMBL, XML, Chado, and RDF/XML. The goal was to<br />

find a model that was appropriate for representing a genetic regulatory network. We were able to<br />

determine that BioPAX (built on RDF/XML) is a community adopted data standard that is capable of<br />

representing Genetic Regulatory networks. We developed two alternative methods for linking traditional<br />

genome annotations to the BioPAX standard. First, we implemented a proof-of-concept annotation<br />

representation based on the Semantic Web (RDF/XML). This approach allows one to merge the BioPAX<br />

annotation directly with annotation data using the SPARQL query language. Second, we explored the<br />

approach of directly importing all of the concepts stored in raw ontologies and the raw data from<br />

expression experiments into a Chado relational database schema. The relational database approach is<br />

currently better suited to production use, whereas the Semantic Web-based representation is better suited<br />

for sharing of scientific data and transparency. Future advancements in Semantic Web technologies may<br />

also make it suitable for production.<br />

Program Development<br />

Since this project began it has become clear that transcriptomics will be the “next big wave” in systems<br />

biology research based on rapid advancements in sequencing technology and RNAseq. With the new<br />

Illumina technology using 100× sample multiplexing, RNAseq data will be more cheaply generated than<br />

alternative technologies so it was fortuitous that we focused on this area. We have discussed incorporating<br />

RNAseq transcriptomics as part of the JGI sequencing and annotation pipeline with the JGI management.<br />

As a preliminary test they have agreed to sequence 12 samples from some of the Caldicellulosiruptor<br />

genomes being studied by ORNL’s BioEnergy Science Center using their Illumina sequencing machines.<br />

We will process this data using the RNAseq analysis pipeline created as part of this project. Successful<br />

results in this project could include additional Caldicellulosiruptor transcriptome RNAseq samples<br />

sequenced by both JGI and ORNL and eventually could include an expanded ORNL annotation pipeline<br />

which would include transcriptome analysis as a new product for all researchers. Newmodifications to the<br />

95

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!