2010 Best Practices Competition IT & Informatics HPC

2010 Best Practices Competition 

IT & Informatics: HPC 

Pg Nominating User Company 

Project Title 

2 JPR 

Amylin 

Virtual Data Center 

Communication 

s 

5 Bristol Myers 

High Content Screening ‐ Road 

Squibb, Research 

& Development 

20 C ycle Computing Purdue University DiaGrid 

22 DataDirect Cornell University Scalable Research Storage Archive 

Networks, Inc. Center for 

A dvanced 

Computing 

24 FalconStor 

Software 

Human 

Neuroimaging Lab 

(HNL) – Baylor 

College of 

Medicine 

29 Isilon Systems Oklahoma Medical 

Research 

Foundation 

31 National Institute 

of Allergy and 

I nfectious 

Diseases (NIAID) 

37 Panasas Uppsala 

University 

43 TGen, The 

Translational 

Genomics 

Research Institute 

Ensuring a more reliable data storage 

infrastructure at Baylor College of Medicine's 

HNL 

Transition to Nextgen Sequencing and Virtual 

Data Center 

A Centralized and Scalable Infrastructure 

Approach to Support Next Generation 

Sequencing at the National Institute of Allergy 

and Infectious Diseases 

UPPNEX 

NextGen Data Processing Pipeline

2010 Bio IT Award 

1. Nominating Organization 

Organization name: JPR Communications 

Address: 20750 Ventura blvd Ste.350 

City: Woodland Hills 

State: CA 

2. Nominating Contact Person 

Name: Judy Smith 

Title: President 

Phone:8188848282 

Email: judys@jprcom.com 

3. User Organization 

Organization name: Amylin Pharmaceuticals 

Address: 9360 Towne Centre Drive 

City: San Diego 

State: CA 

Zip: 92121 

4. Contact Person 

Name: Steve Phillpott 

Title: CIO 

Phone: 858-309-7585 

Email: Steve.Phillpott@amylin.com 

5. Project 

Project Title: Amylin Virtual Data Center 

Category: IT and Informatics 

6. Description of project (4 FIGURES MAXIMUM): 

See slide presentation 

A. ABSTRACT/SUMMARY of the project and results (800 characters max.) 

Amylin Pharmaceuticals is a San Diego-based biopharma company, focused on providing first in 

class therapies for diabetes and obesity. Accomplishing Amylin’s mission of “Challenging 

Science and Changing Lives” requires tremendous IT capabilities, and the company has a history 

of being an early adopter of technology. 

In 2008, the company’s need for additional technology investment ran headlong into the 

economic realities of the time. Additionally, Amylin began to pursue a more flexible business 

model, emphasizing partnerships and virtualization over doing everything itself. In short, a core 

philosophy became “access tremendous capabilities, without owning those capabilities”. 

Amylin’s CIO, Steve Phillpott, and his IT leadership team applied this new strategy, developing an 

operating model they called the “Amylin Virtual Data Center”, which utilizes detailed service 

costing and cloud and SaaS capabilities to dramatically lower the cost of IT.

B. INTRODUCTION/background/objectives 

Amylin IT set out to move to a flexible technology model that would allow access to world-class IT 

capabilities, without having to operate each of those capabilities. First, the team spent several 

months preparing detailed cost analysis for every service they provide. This “cost by service” 

model included the labor, licensing, maintenance, hardware, data center cost and even power 

usage for each service and application allowed the team to do more accurate comparisons of 

costs between delivering services internally or externally. 

The result was a list of IT services or applications provided by Amylin IT, each of which would be 

assessed to determine whether the same service could be provided at lower cost through utility 

services. Besides cost, other factors were also considered, including security, performance, 

architectural appropriateness for the cloud, and vendor capability. Importantly, the team actively 

looked for opportunities where SaaS and the Cloud would work, rather than enumerating all the 

reasons why cloud doesn’t work. 

Amylin built-out a “toolkit” of Cloud and SaaS offerings that their IT staff could make use of to 

enable flexible IT. For Infrastructure as a service (IAAS), they chose Amazon Web Service 

(AWS) and vertical offerings built on AWS. For Platform as a Service (PaaS), they use 

Force.com. For cloud storage, they started a relationship with Nirvanix. And finally they began a 

deep investigation of Software as a Service (SaaS) capabilities to meet their application needs. 

In each case, internal IT teams would begin pilot projects, have personal “sandboxes”, and get to 

understand these capabilities on a technical level. In the case of Amazon, Force.com, and 

Nirvanix, initial skepticism turned into a positive response, as the capabilities of these tools were 

understood. Getting tools in the hands of technical people was key to gaining their understanding 

an buy-in. 

C. RESULTS (highlight major R&D/IT tools deployed; innovative uses of technology). 

As Amylin rolled out their cloud initiatives, they first focused on Amazon EC2 to host a limited 

number of application use cases. Amazon EC2 will continue to grow as a hosting platform for 

Amylin, and additional migrations are planned for this year and 2011. 

Amylin has a number of internal legacy applications, often without the internal resources to 

manage or upgrade them. As initial pilot applications were successful, the team is now planning 

to move legacy applications to Force.com. The component reusability and rich platform led 

Amylin developers to determine they could be more productive in such an environment. 

The third focus was storage and disaster recovery capabilities. Rather then building out an inhouse 

system, Amylin called upon Nirvanix, a cloud storage partner. Amylin server images and 

data is now stored in the Nirvanix cloud, meeting compliance requirements and providing disaster 

recovery and backup capability for Amylin’s data. 

Finally, Amylin invested significant time understanding the wide range of SaaS offerings 

available. Frequently, they discovered that SaaS offerings were more feature-rich and easier to 

use than internally hosted applications. Currently, Amylin utilized over a half a dozen SaaS 

applications, and migrations to several more are in progress. These include Workday, Microsoft 

Hosted Exchange, LiveOffice, and Saba. 

Amylin used the following tools (cloud services) to meet their business needs: 

Nirvanix: Nirvanix Storage Delivery Network (SDN) for enterprise cloud storage. The project involved 

moving critical validated server images that are used for all business and manufacturing applications 

and drugs simulation process such as Blast, C‐Path and other genomics simulations. Since these are

critical images and are frequently used, they are stored on tier I storage platform to ensure high 

availability and safety. Nirvanix provided better capabilities and provided additional level of protection 

as the images are now stored on the Cloud and are protected against any datacenter/localized 

infrastructure failures within Amylin. Further, Nirvanix’s “Plug and Play” architecture enabled them to 

seamlessly integrate the “CloudNAS” into their environment without any overhaul of their existing setup. 

Further, the new release of the product ties into their existing Netbackup and Commvault set‐up 

further simplifying backup, recovery and e‐discovery process. 

Amazon: Amylin leveraged Amazon for their compute infrastructure services (EC2). Several applications 

have been piloted in EC2, and some are now in full production. Additionally, Amylin expects to 

leverage EC2 and Cycle Computing’s CycleCloud for high performance research computing in the coming 

years. 

LiveOffice: Amylin implemented LiveOffice Mail Archive to store all Amylin email archives, for 

compliance purposes. This saved the significant investment in an in‐house email eDiscovery capability, 

and was available to the business much sooner than building a software solution. 

Symplified: Amylin deployed Symplified’s SaaS identify management package. Amylin found that 

deploying SaaS and cloud applications increased the problem of user account management and logins, 

and Symplified provided a fast to deploy and affordable solution. 

D. ROI achieved or expected (1000 characters max.): 

The storage cloud strategy resulted in a significant reduction in costs compared to the Tier I 

solution by approx 50+%. In many cases, ROI was achieved within couple of months into 

production use. Further, Cloud Storage enabled Amylin to achieve a significant business mile 

stone of having a basic data DR solution by storing and protecting data in the cloud. 

E. CONCLUSIONS/implications for the field (800 characters max.) 

Amylin implemented cloud solutions early, did extensive research, and selected some of the 

leaders in the cloud computing market. Starting a new infrastructure is a learning experience and 

Amylin continues to educate itself on recent cloud advancements and test its current plans. 

Amylin is looking ahead and into possibly launching an internal virtualization and private cloud 

with VMware, thus further complementing their current cloud deployments. 

With their four layers of the cloud in place, Amylin is in a solid position and can make sound 

selections based upon cost, control, performance and best fit. 

6. REFERENCES/testimonials/supporting internal documents 

See power point presentation.

Published Resources for the Life Sciences 

250 First Avenue, Suite 300, Needham, MA 02494 | phone: 781‐972‐5400 | fax: 781‐972‐5425 

BMS is submitting nomination for Best practices at BIO-IT World 2010. 

The category for Best practices is: 

IT & Informatics: LIMS, High Performance Computing, storage, data visualization, 

imaging technologies 

Following is the nomination form that will be filled on line at the 

conference Website. 

_____________________________________________________________ 

Bio-IT World 2010 Best Practices Awards 

Celebrating Excellence in Innovation 

INSTRUCTIONS and ENTRY FORM 

www.bio‐itworld.com/bestpractices 

DEADLINE FOR ENTRY: January 18, 2010 (Updated deadline: February 19, 2010) 

Bio‐IT World is seeking submissions to its 2010 Best Practices Awards. This prestigious awards 

program is designed to recognize outstanding examples of technology and strategic innovation— 

initiatives and collaborations that manifestly improve some facet of the R&D/drug 

development/clinical trial process. 

The awards attract an elite group of life science professionals: executives, entrepreneurs, innovators, 

researchers and clinicians responsible for developing and implementing innovative solutions for 

streamlining the drug development and clinical trial process. All entries will be reviewed and assessed 

by a distinguished peer‐review panel of judges. 

The winners will receive a unique crystal award to be presented at the Best Practices Awards dinner, 

on Wednesday, April 21, 2010, in conjunction with the Bio‐IT World Conference & Expo in Boston. 

Winners and entrants will also be featured in Bio‐IT World. 

INSTRUCTIONS 

1. Review criteria for entry and authorization statement (below).



A. Nominating Organization 

Organization name: Bristol‐Myers Squibb 

Address: 

B. Nominating Contact Person 

Name: Mohammad Shaikh 

Title: Associate Director 

Tel: (609) 818 3480 

Email: mohammad.shaikh@bms.com 

2. User Organization (Organization at which the solution was deployed/applied) 

A. User Organization 

Organization name: Bristol Myers Squibb, Research & Development 

Address: 311 Pennington‐Rocky hill Road 

Pennington. NJ 08534 

B. User Organization Contact Person 

Name: Donald Jackson 

Title: Sr. Research Investigator II 

Tel: 609‐818‐5139 

Email: Donald.jackson@bms.com 

3. Project 

Project Title: High Content Screening ‐ Road 

Team Leader: 

Name: James Gill 

Title: Director 

Tel: 203.677.5708 

Email: james.gill@bms.com 

Team members – Michael Lenard, James Scharpf, Russell Towell, Richard Shaginaw, Normand Cloutier 

4. Category in which entry is being submitted (1 category per entry, highlight your choice) 

Basic Research & Biological Research: Disease pathway research, applied and basic research 

Drug Discovery & Development: Compound‐focused research, drug safety 

Clinical Trials & Research: Trial design, eCTD 

Translational Medicine: Feedback loops, predictive technologies 

Personalized Medicine: Responders/non‐responders, biomarkers 

IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging 

technologies



Knowledge Management: Data mining, idea/expertise mining, text mining, collaboration, 

resource optimization 

Health‐IT: ePrescribing, RHIOs, EMR/PHR 

Manufacturing & Bioprocessing: Mass production, continuous manufacturing 

(Bio‐IT World reserves the right to re‐categorize submissions based on submission or in the event that 

a category is refined.) 


A. ABSTRACT/SUMMARY of the project and results (150 words max.) 

High-content screening (HCS) data has unique requirements that are not supported by 

traditional high-throughput screening databases. Effective analysis and interpretation of 

the screen data requires ability to designate separate positive and negative controls for 

different measurements in multiplexed assays. 

The fundamental requirements are the ability to capture information on the cell lines, 

fluorescent reagents and treatments in each assay; the ability to store and utilize 

individual-cell and image data; and the ability to support HCS readers and software from 

multiple vendors along with third-party image analysis tools. The system supports target 

identification, lead discovery, lead evaluation and lead profiling activities. 

The solution was designed using a combination of complimentary technologies that later 

became part of best practices at Bristol-Myers Squibb’s Research Informatics. The image 

data generated by HCS processes is over 50 TB over five years and has seen exponential 

growth trends. Database and data logistics were built using Oracle (11g) partitioning 

techniques, Isilon storage was used to handle unstructured data and EMC for relational 

data. Application was built using techniques like external tables, caching, materialized 

views, parallel queries and used .Net framework for business rules and visualizations. 

Statistical functions in Oracle API libraries were leveraged for analysis. 

INTRODUCTION/background/objectives 

High content screening (HCS) has demonstrated utility at multiple points in the drug 

discovery process including target identification, target validation, lead identification, lead 

evaluation and profiling 1 , mechanism of action determination 2 and toxicology 

assessment 3 . Within a single organization, HCS may be used for multiple purposes with 

distinct groups and even instruments supporting different stages of drug discovery. The



scope of HCS projects can range from large-scale compound and RNAi collections tested 

in high-throughput screens to the detailed characterization of small numbers of 

compounds in multiple assays and cell lines. Despite their different roles, each group has 

common needs for data analysis including: deriving numeric measurements from images; 

connecting results with treatments, cell lines and assay readouts; identifying positive and 

negative controls to normalize data; rejecting failed data points; and selecting hits or 

fitting concentration-response curves. Establishing a common framework for HCS data 

allows users from different groups to analyze their results and share best practices and 

algorithms between users and instruments. 

HCS data can be divided into three types: image data, derived data (e.g. single cell 

measurements and well-level summary statistics), and metadata 4 . This last data type 

includes both procedural information (e.g., how the images were acquired and analyzed) 

and experimental annotation (what cell lines, fluorescent probes and treatments were 

used). Procedural metadata is captured by most HCS platforms and by open-source 

projects such as the Open Microscopy Environment (OME) 5 . Experimental annotation 

metadata is less well supported even though it is essential for the interpretation and 

analysis of HCS results. The Minimum Information About a Cellular Assay (MIACA) 

standard established guidelines for what experimental annotation should be included in 

scientific publications 6 but is not intended for laboratory data management. 

HCS data shares many requirements with other types of high-throughput screening data, 

especially from cell-based assays. In particular, the need to capture assay design 

information in a structured and consistent manner is essential for the analysis and 

reporting of experimental results 7 . Other essential components include a reagent registry 

(for compounds, RNAi reagents, and other reagent types), a reagent inventory database 

(with information on plate maps), and tools for hit selection and concentration-response 

analysis 8 . 

Despite the parallels to HTS data, managing and analyzing HCS data presents distinct 

challenges not encountered with other assay platforms, including single-endpoint cell 

based assays. First, HCS is image-based. Access to the underlying images is essential to 

troubleshoot problems, confirm and understand results, and communicate results to 

colleagues. Second, HCS produces large amounts of data. For example, a single 384-well 

plate can produce over 2 GB of images and millions of records of derived data 4 ; this scale 

of data requires support from information technology experts along with mechanisms to 

systematically identify and delete unneeded data. Third, HCS assays often multiplex 

several distinct biological readouts in the same well. This requires the ability to designate 

separate positive and negative controls for different channels or even measurements so



that assay performance and result normalization can generate meaningful values. Fourth, 

multiple vendors produce HCS readers and image analysis packages, along with thirdparty 

analysis packages such as CellProfiler 9 . Results and images must be converted to a 

common format so data and analysis tools can be shared between groups. Finally, HCS 

assays are inherently cell-based. Consistent identification of the cell lines, fluorescent 

dyes or antibody conjugates, and fluorescent proteins used in each assay is essential for 

the proper documentation and long-term mining of HCS results. 

To address these requirements we developed HCS Road, a data management system 

specifically designed for HCS. As the name indicates, HCS Road provides a smooth, 

well-defined route from image quantification to data analysis and reporting. The system 

combines an experiment definition tool, a relational database for results storage, assay 

performance reports, data normalization, and analysis capabilities. HCS Road currently 

supports multiple imaging platforms and provides a common repository for HCS data 

across instruments and user groups. In this work, we describe the approaches we took for 

data storage, experimental annotation, and data analysis and the scientific and business 

reasons for those decisions. We also present a XML schema for HCS data that supports 

multiple HCS platforms. 

RESULTS (highlight major R&D/IT tools deployed; innovative uses of technology). 

System Architecture 

Figure 1 shows an overview of the architecture of HCS Road. HCS Road currently 

supports three platforms: the Cellomics Arrayscan, the InCell 1000 (GE Healthcare, 

Parsippany, NJ), or the Evotec Opera. Images are analyzed with the appropriate software 

and the results are collected in platform-specific files or a platform database such as 

Cellomics Store. An HCS Road service converts data to a common XML format for 

import into the HCS Road database. Once the data is loaded into HCS Road it is merged 

with experimental annotation and treatment plate maps. Data import and merging can be 

performed manually or automatically based on previously registered plate barcodes. QC 

metrics and normalized results are calculated automatically and can be reviewed and 

analyzed using the HCS Road client or exported to third-party applications such as TIBCO 

Spotfire (TIBCO Software, Cambridge, MA). 

Users interact with HCS Road through two client applications. The Data Import 

application enables users to select plates for import from the platform-specific data 

repository (Cellomics database, Opera or InCell file share). Multiple plates can be



transferred in parallel for faster import, and well summary results are imported separately 

from cell-level measurements so users can review well-level results more quickly. A webbased 

administration tool controls the number of threaded processes and other data import 

settings. Experimental annotation, data mining and visualization are supported by the 

dedicated Data Explorer client application. Data-intensive operations, including data 

extraction and updates, QC and data analysis are implemented on the servers and the 

database to reduce the amount of data transferred from server to client. The Data Explorer 

also allows users to view images for selected wells or as a ‘poster’ of images for an entire 

plate. Images can also be viewed in third-party applications such as TIBCO Spotfire using 

a web page (Fig. 1). In either case, the image conversion server retrieves images from the 

appropriate platform repository and converts them from proprietary formats to standard 

TIFF or JPEG formats as needed. 

IT Tools & Techniques 

The large volumes of data generated by HCS require particular attention to image and data 

storage and management. 

Storage: HCS system provides scalable and extensible storage that is well suited for 

managing large numbers of images. The distributed nature of the system means that input 

and output bandwidth grow in parallel with capacity, avoiding a potential bottleneck. 

Images are stored at or near the site where they were acquired (and where they are likely 

to be analyzed or viewed) to reduce network latency issues. This approach reduced 

storage costs while increasing the bandwidth for image transfer. 

After extensive product evaluation, we decided on Isilon Systems clustered networkattached 

storage appliances. We deployed these as a file service, exposing several 

Windows networking file shares to the HCS readers, as well as to researcher workstations. 

Key Differentiators influencing our decision for Isilon NAS cluster were: True unified 

name space, robust data protection algorithms, straightforward scalability using building 

block nodes, ease of administration – FreeBSD CLI and lower-cost SATA disks. 

Data Management 

The large number of data records generated by HCS also presents an informatics 

challenge. We store HCS results in Oracle relational databases, as do other HCS users 10 . 

These databases can become very large, primarily because of cell level data. We observed 

that as the size of our databases grew, performance deteriorated. To address this, we used 

Oracle’s database partitioning capabilities. We focused our efforts on the two largest 

tables in the database, which both contain cell-level data. Our partitioning scheme



exploits the fact that, once written, cell level data is unlikely to change. Partitioning the 

tables in a coordinated fashion provided 10-fold reductions in data load times and 20-fold 

reductions in query times. Historical partitions are accessed in read-only mode which 

helps to protect data integrity and speeds up database backup and recovery. 

Experimental annotation 

HCS Road captures information on experimental treatments and conditions in a way that 

enables long-term mining of results across assays and users and enforces consistent 

nomenclature for cell lines, detection reagents, and control or experimental treatments. 

Figure 2 shows the workflow for assay definition, treatment selection, and data import and 

analysis. Much of this information is referenced or imported from other databases. Thus, 

HCS Road imports or references treatment information such as compound structures, 

RNAi targets and sequences, and library plate from existing enterprise databases (green 

box in Fig. 2). Similarly, cell line information is linked to an enterprise registry that tracks 

information on source, tissue type, transgenic constructs, passages and other relevant 

information. This reduces the data entry burden on users, reduces errors, and ensures 

consistency within HCS Road and with data from other platforms. Annotation that cannot 

be imported or referenced is stored in the Road database. For example, information on 

fluorescent probes including probe name, vendor and catalog number, fluorescent 

characteristics and molecular or cellular targets is stored within HCS Road in a way that 

supports re-use across multiple assays. 

The creation of a new assay begins with the selection of the cell line(s) and fluorescent 

probes used in an experiment (yellow box in Fig. 2). Control and reference compounds 

can be selected from the reagent registry or entered manually (as for commercially 

purchased reagents). Business metadata is also collected to enable reports of results 

across multiple assays and to support data retention decisions. Next, one or more ‘master’ 

plates are created with information on cell seeding along with locations and concentrations 

of control and reference treatments and fluorescent probes. HCS Road supports multiple 

plate layouts including 96, 384 and 1536-well; additional custom layouts can be quickly 

defined as needed. Finally, multiple copies of this master plate are created to correspond 

to the physical plates in the assay. Reagents tested in the assay can be entered manually 

(as during assay development) or automatically from existing reagent databases (green 

box in Fig. 2). Assays and plates can also be copied to streamline small changes to 

experimental designs or plate layouts. 

The last step in experimental annotation is the assignment of positive and negative control 

treatments (blue box in Fig. 2). Different treatments can be designated as positive and 

negative controls for different measurements. This provides the flexibility needed to



support multiplexed, multi-parameter HCS assays and provide meaningful performance 

metrics and normalized results. Control status is assigned to treatments (or treatment 

combinations) rather than to well locations. Any wells that receive the control 

treatment(s) become controls for the specified measurement(s). This reduces the amount 

of data users must enter, allows a single analysis protocol to support multiple plate layouts 

(for example, in screening multiple existing reagent collections with different layouts), 

and facilitates the re-use of assay definitions. 

Data loading and analysis 

Once images have been collected and analyzed, the results are loaded into HCS road for 

analysis (pink box in Fig. 2). Images and numeric results are imported from platform 

repositories using a dedicated, internally developed application. Data can be loaded 

automatically using pre-defined criteria or selected manually after image acquisition and 

analysis are complete. Multiple sets of images and results can be loaded for a single assay 

plate to support kinetic imaging and re-imaging or re-analysis of plates using different 

objectives, filters or analysis algorithms. Results are associated with assay plates 

manually or using barcodes on the assay plates. 

HCS Road calculates multiple quality control metrics and provides tools for rejecting 

failed wells or plates. In addition to the Z’ metric of Zhang et al 11 , the plate mean, 

median, standard deviation, minimum and maximum are reported for negative control, 

positive control and sample wells for each plate in a run. Users can review individual 

plates and may choose to reject all measurements from a well or only reject selected 

measurements. The ability to selectively reject measurements is necessary because of the 

multi-parameter nature of HCS assays. For example, a treatment may reduce cell count in 

a multiplexed assay; this is a legitimate result but measurements in other channels may not 

be reliable. 

Data analysis 

Commonly used analyses are implemented as fixed workflows within the HCS Road Data 

Explorer application. HCS Road automatically performs multiple normalizations when 

data is loaded. The calculations include percent control, percent inhibition, signal to 

background and z-score 12 . The first analysis we implemented was concentration-response 

curve fitting. Curves are fit using a 4-parameter logistic regression with XLFIT equation 

205 (IDBS Business Solutions, Guilford, UK). A graphic view shows the fit line and data 

points for an individual compound. Data points are linked to the corresponding images so 

users can review the images for a well and choose to reject it and recalculate the fit. The 

resulting IC50 values were consistent with those produced by our existing HTS analysis 

tools (not shown).



We also identified a need to export results and annotation from HCS Road to third party 

applications so researchers can perform calculations and generate visualizations that are 

not part of a common workflow. We use TIBCO Spotfire for many of our external 

visualizations because: it can retrieve data directly from the HCS Road database; it 

supports multiple user-configurable visualizations; it provides tools for filtering and 

annotating data, and it can perform additional analyses using internal calculations or by 

communicating with Accelerys PipelinePilot (SanDiego, CA). Figure 3 shows a Spotfire 

visualization for analyzing RNAi screening results. This workflow retrieves results and 

treatment information from the HCS Road database. The user is presented with 

information on the distribution of normalized values for each endpoint and can select 

wells that pass the desired activity threshold. Additional panels identify RNAi reagents 

where multiple replicate wells pass the threshold and genes where multiple different RNAi 

reagents scored as hits, an analysis that is unique to RNAi screening. Within Spotfire, 

HCS assay results can be cross-referenced with other information such as mRNA 

expression profiling to identify RNAi reagents whose phenotype correlates with levels of 

target expression in the assay cell line (not shown). 

Cell-level data 

Managing and analyzing cell-level data was a high priority in the development of HCS 

Road. Cell level data enables the analysis of correlations between measurements at the 

cellular level, the use of alternative data reduction algorithms such as the Kolmogorov- 

Smirnov distance 13; 14 , classification of subpopulations by cell cycle phase 15 , and other 

approaches beyond basic well-level statistics 16 . However, the volume of cell data in an 

HCS experiment can be very large. Storing cell data as one row per measurement per cell 

creates a table with large numbers of records and slows down data loading and retrieval. 

Because cell data is typically used on a per-plate/feature basis for automated analyses and 

for manual inspection, we chose to store it in files on the HCS Road file share (Fig. 1) 

rather than in the database. When cell data is needed, it is automatically imported into a 

database table using Oracle’s bulk data loading tools. When the cell measurements are no 

longer needed the records are deleted from the Road database (but are still retained in 

files). This controls database growth and improves performance compared to retaining 

large numbers of records in the database. 

ROI achieved: 

HCS Road currently supports target identification, lead identification and lead profiling 

efforts across multiple groups within BMS Applied Biotechnology. Scientists can analyze 

their experiments more rapidly and the time needed to load, annotate and review 

experiments has been reduced from days to hours. Integration with existing databases



reduces the amount of data users must enter, reduces errors and facilitates integration with 

results from other assay platforms. HCS Road enables new types of experiments that were 

not supported by our previous data management tools, including 1536-well HCS assays 

and cell cycle analysis based on DNA content measures for individual cells. HCS Road 

provides a single source for data from Cellomics Arrayscan, GE InCell and Evotec Opera 

instruments. Finally, HCS Road facilitates the sharing of assays and analysis tools 

between groups. Users can review assay data from other groups, determine whether a cell 

line or fluorescent probe has been used before, and see how a hit from their assay 

performed in previous experiments. 

The data management solutions we implemented allow us to handle the large volumes of 

data that HCS generates. Database partitioning reduces backup times and improves query 

performance; network attached storage systems enable the storage and management of 

large numbers of images; and the use of file-based storage with transient database loading 

for cell level data allows us to analyze this unique result type while minimizing database 

size. 

CONCLUSIONS. 

Successfully developing an enterprise-wide data management system for HCS results 

presents challenges. The diversity of instruments, users and projects begs the question of 

whether it is better to develop isolated systems tailored to the requirements of a single 

group or instrument type. We concluded that the benefits of an integrated system were 

worth the effort required. HCS Road currently supports multiple imaging platforms and 

research groups and provides a single point of access for results and experimental 

annotation. It facilitates the sharing of assays and data analysis methods between groups 

and provides a rich and structured model for annotating cell-based assays. 

We chose to develop our own system for HCS data management so that we could 

accommodate our needs and workflows and could integrate it with other enterprise 

databases. A consequence of this integration is that no two organization’s solutions will 

look exactly the same. Large organizations will wish to accommodate their existing 

workflows and databases whereas smaller organizations may need to implement some of 

those functions within their HCS data management system. We believe that the 

requirements and solutions we identified will be informative to other HCS users looking to 

develop or purchase their own data management solution. 

The system was built using technologies by multiple vendors who made several updates to 

their architectures to make optimize the performance and reliability of the solution. The



partitioning techniques first deployed at BMS for this application was later adopted and 

standardized by Cellomics. 

BMS was one of the first in the Pharmaceutical industry to use Isilon storage for managing 

structured as well as unstructured Lab data. Isilon systems accommodated several 

suggestions by BMS design team to it’s firmware and architecture which benefited many 

other use cases. At BMS, use of Isilon storage was later extended to manage Neuroscience 

video files, Mass spectrometry raw & result files, NMR data, Bright field images, HPLC 

LIMS contents, Non-chrome LIMS contents and Oracle recovery files generated by 

RMAN and Flash recovery systems.



Instruments and 

platform data 

repositories 

Cellomics 

Store 

database 

Reagent and cell 

line registries 

Enterprise 

results 

repository 

HCS Road Data 

Explorer 

ArrayScan 

Image 

share 

Image + 

Data 

share 

Services 

Image 

Conversion 

Third-party tools 

(TIBCO Spotfire) 

Opera 

InCell 1000 

Image + 

Data 

share 

File 

share 

Database 

HCS Road 

FIG. 1. Overview of HCS Road components showing data flow from HCS instruments through 

the HCS Road database and file share to data analysis and visualization tools. Blue icons 

designate instrument-specific databases and file shares. Green arrows and green box indicate 

HCS Road components. Gray arrows indicate data import or export to existing enterprise 

databases or third-party analysis tools.



Library 

Definition 

(external) 

Register 

Reagents 

Define 

Plate maps 

Register 

Barcodes 

Assay 

Definition 

Define or 

Select 

select 

Cell Line 

fluorescent 

probes 

Enter Business 

Define or 

Metadata 

select 

• Client group 

additional 

• Program 

compounds 

Define master plate layout 

• Cell line(s) 

• Seeding density 

•Probes 

• Control/reference treatments 

Create 

Assay Plates 

Analysis 

Definition 

Select 

Measurements 

for analysis 

Designate 

control 

treatments for 

measurement 

• Well-level 

•Cell-level 

Data 

Loading & 

Analysis 

Create 

Imaged 

Plates 

Calculate 

QC Metrics 

• Z’ 

•Mean 

•CV 

Review results 

Reject outliers 

Images & Data 

from HCS 

reader/software 

Analyze 

Data 

Publish results to 

enterprise results 

database 

FIG. 2. Workflow for experiment definition, data import and analysis. White boxes show 

workflow steps and colored boxes indicate functional subsets of the process. Black arrows 

indicate workflow progression and dependencies between steps.



Available Measurements 

Normalized data distribution 

Measurement name 

TargetActivationV3Well:SelectedObjectCount… 

TargetActivationV3Well:MEAN_ObjectAreaCh1 

Run 47 Run 48 Run 55 

TargetActivationV3Well:MEAN_ObjectShapeP2ACh1 

TargetActivationV3Well:MEAN_ObjectTotalIntenCh1 

100 

TargetActivationV3Well:SelectedObjectCount 

TargetActivationV3Well:SelectedObjectCountPerValidF 

80 

TargetActivationV3Well:ValidObjectCount 

60 

Value PctInh 

40 

20 

0 

-20 

-40 

-60 

Distribution of 

ALL normalized 

data across all 

plates 

Data table: 

SHADOW results 

Color by 

Status 

NEG 

POS 

SAMPLE 

Reference points 

Median 

Normalized data statistics 

Measurement name, Treatment role 

TargetActivationV3Well:Selecte 

dObjectCountPerValidField 

(Column Names) 

NEG POS SAMPLE 

UAV 34.45 102.24 103.86 

Q3 9.26 100.60 63.69 

Median -0.04 100.09 42.43 

Q1 -9.05 99.49 23.39 

LAV -35.25 97.84 -36.96 

Mean + 3SD 40.94 103.30 126.89 

Mean 0.00 100.00 43.18 

Mean - 3SD -40.94 96.70 -40.54 

Summary 

statistics for 

ALL 

normalized 

data across all 

plates. 

Data table: 

SHADOW res 

NEG POS SAM… NEG POS SAM… NEG POS SAM… 

Status 

UAV, Q3, Median, Q1, LAV, Mean + 3SD, Mean, Mea… 

Wells per treatment 

Normalized results for hits 

Measurement name, siRNA index 

Number of wells that match 

6 

current filters for each 

TargetActivationV3Well:SelectedObjectCountPerV… Grand 

Median 

treatment 

1 2 3 4 (Empty) total 

normalized 

value for 

Data limited by: 

5 

Gene579 100.37 99.73 99.39 45.73 - - - 99.50 

each siRNA 

Active measurement 

Gene59 83.49 66.61 89.87 58.68 84.88 84.87 

for hits 

4 

Data table: 

Gene735 94.31 87.21 82.25 66.20 - - - 84.37 

Data table: 

Results by well and measure 

Gene672 84.69 13.92 96.64 80.50 - - - 84.24 

SHADOW r 

3 

Marking: 

Gene254 83.56 94.82 18.03 82.87 - - - 83.56 

Color by 

Hit treatments 

Gene597 79.00 85.68 81.44 55.55 - - - 79.65 

Median(Valu 

2 

Color by 

Gene694 81.13 72.30 86.55 53.80 - - - 79.30 

Min (5.9 

Treatment role 

Max (10 

Gene195 75.64 71.84 95.17 69.35 - - - 77.26 

SAMPLE 

1 

Gene536 77.07 87.20 78.93 27.19 - - - 77.02 

0 

Gene150 84.23 91.17 70.13 60.68 - - - 76.60 

N NU NU NU NU NU NU NU NU NU NU NU NU NU NU… 

Gene109 81.08 62.99 85.96 75.22 - - - 76.29 

S SA SA SA SA SA… 

SA SA SA SA SA SA SA SA SA 

Gene43 86.84 76.77 50.12 75.52 - - - 75.46 

Treatment role, treatment,1 Value 

Gene98 89.32 76.14 71.90 5.93 - - - 75.15 

Treatments per gene 

Gene406 80.80 25.61 77.28 75.79 - - - 74.33 

Gene611 69.31 72.58 76.66 62.55 - - - 72.13 

Gene550 79.76 70.80 71.10 40.41 - - - 71.99 


3 

Gene707 74.31 70.90 92.75 28.29 - - - 71.83 


Hit treatments 

Gene180 71.11 62.81 96.32 71.82 - - - 71.58 

Gene335 71.76 83.26 69.17 52.51 - - - 70.30 

Data table: 

2 

Results by well and measure 

Gene433 63.75 92.22 41.97 70.78 - - - 66.79 

Marking: 

Median(Value PctInh) 

Hit genes 

1 

Number of hits 

Color by 

Treatment role 

Count(Well id) 338 


SAMPLE 

UniqueCount(treatment… 61 


0 

UniqueCount(Gene id) 20 

Hit genes 

5 230 100 597 285 115 167 555 232 185 381 139 687 844 5604 

S SA SA SA SA SA SA SA SA SA SA SA SA SA SA… 

Data table: 

Treatment role, Gene id 

UniqueCount(Well id) 

UniqueCount(treatment,1 Value) 

Gene 

(Column N… 

Figure 3: TIBCO Spotfire workflow for hit selection from RNAi screens from HCS Road 

showing: (top left) table of available measurements; (top center) histograms of cell count percent 

inhibition for control and library wells across multiple runs; (top right) table of summary 

statistics for normalized cell count for control and library reagents; (middle left) bar chart of 

numbers of wells per RNAi reagent with normalized values above a user-defined threshold (blue 

shading indicates hit reagents where at least 4 of 6 replicate wells passed the threshold); (bottom 

left) bar chart of numbers of individual RNAi reagents per gene where 4 or more replicate wells 

passed the normalized value threshold (red shading indicates hit genes where 3 or more 

independent RNAi reagents for the same gene were selected as hits); (middle right) table of 

median cell count percent inhibition values for all hit genes; (bottom left) numbers of wells, 

RNAi reagents and genes selected as hits.



7. REFERENCES/testimonials/supporting internal documents (If necessary; 5 pages max.) 

1. Agler M, Prack M, Zhu Y, Kolb J, Nowak K, Ryseck R, Shen D, Cvijic ME, Somerville J, Nadler S, Chen 

T: A high-content glucocorticoid receptor translocation assay for compound mechanism-of-action 

evaluation. J Biomol Screen 2007; 12:1029-1041. 

2. Ross-Macdonald P, de Silva H, Guo Q, Xiao H, Hung CY, Penhallow B, Markwalder J, He L, Attar RM, 

Lin TA, Seitz S, Tilford C, Wardwell-Swanson J, Jackson D: Identification of a nonkinase target mediating 

cytotoxicity of novel kinase inhibitors. Molecular cancer therapeutics 2008; 7:3490-3498. 

3. Zock JM: Applications of high content screening in life science research. Combinatorial chemistry & high 

throughput screening 2009; 12:870-876. 

4. Dunlay RT, Czekalski WJ, Collins MA: Overview of informatics for high content screening. Methods in 

molecular biology (Clifton, NJ 2007; 356:269-280. 

5. Goldberg IG, Allan C, Burel JM, Creager D, Falconi A, Hochheiser H, Johnston J, Mellen J, Sorger PK, 

Swedlow JR: The Open Microscopy Environment (OME) Data Model and XML file: open tools for 

informatics and quantitative analysis in biological imaging. Genome biology 2005; 6:R47. 

6. Miaca Draft Specification Retrieved from http://cdnetworks-us- 

2.dl.sourceforge.net/project/miaca/Documentation/MIACA_080404/MIACA_080404.pdf. 

7. Palmer M, Kremer A, Terstappen GC: A primer on screening data management. J Biomol Screen 2009; 

14:999-1007. 

8. Ling XB: High throughput screening informatics. Combinatorial chemistry & high throughput screening 

2008; 11:249-257. 

9. Carpenter AE, Jones TR, Lamprecht MR, Clarke C, Kang IH, Friman O, Guertin DA, Chang JH, Lindquist 

RA, Moffat J, Golland P, Sabatini DM: CellProfiler: image analysis software for identifying and quantifying 

cell phenotypes. Genome biology 2006; 7:R100. 

10. Garfinkel LS: Large-scale data management for high content screening. Methods in molecular biology 

(Clifton, NJ 2007; 356:281-291. 

11. Zhang JH, Chung TD, Oldenburg KR: A Simple Statistical Parameter for Use in Evaluation and Validation 

of High Throughput Screening Assays. J Biomol Screen 1999; 4:67-73. 

12. Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R: Statistical practice in high-throughput screening data 

analysis. Nature biotechnology 2006; 24:167-175. 

13. Giuliano KA, Chen YT, Taylor DL: High-content screening with siRNA optimizes a cell biological 

approach to drug discovery: defining the role of P53 activation in the cellular response to anticancer drugs. J 

Biomol Screen 2004; 9:557-568. 

14. Perlman ZE, Slack MD, Feng Y, Mitchison TJ, Wu LF, Altschuler SJ: Multidimensional drug profiling by 

automated microscopy. Science (New York, NY 2004; 306:1194-1198. 

15. Low J, Huang S, Blosser W, Dowless M, Burch J, Neubauer B, Stancato L: High-content imaging 

characterization of cell cycle therapeutics through in vitro and in vivo subpopulation analysis. Molecular 

cancer therapeutics 2008; 7:2455-2463. 

16. Collins MA: Generating 'omic knowledge': the role of informatics in high content screening. Combinatorial 

chemistry & high throughput screening 2009; 12:917-925.

Bio‐IT World 2010 Best Practices Awards 

Nominating Organization name: Cycle Computing 

Nominating Organization address: 456 Main Street 

Nominating Organization city: Wethersfield 

Nominating Organization state: CT 

Nominating Organization zip: 06109 

Nominating Contact Person: Ashleigh Egan 

Nominating Contact Person Title: Account Executive, Articulate 

Communications 

Nominating Contact Person Phone: 212‐255‐0080 x12 

Nominating Contact Person Email: aegan@articulatepr.com 

User Organization name: Purdue University 

User Organization address: 504 Northwestern Ave. 

User Organization city: West Lafayette 

User Organization state: IN 

User Organization zip: 47907 

User Organization Contact Person: John Campbell 

User Organization Contact Person Title: Associate Vice President of 

Information Technology 

User Organization Contact Person Phone: 212‐255‐0080 x12 

User Organization Contact Person Email: aegan@articulatepr.com 

Project Title: 

DiaGrid 

Team Leaders name: 

Team Leaders title: 

Team Leaders Company: 

Team Leaders Contact Info: 

Team Members name: 

Team Members title: 

Team Members Company: 

Entry Category: 

IT & Informatics 

Abstract Summary: 

Introduction: The demand for computational power at Purdue for 

scientific, quantitative and engineering research was rapidly outpacing the 

budget for new space, power and servers to run them. At the same time, most 

machines across campuses, enterprises or government agencies are only used less 

than half of the time. The challenge was to harness these unused computational 

cycles for multiple colleges/departments while building a framework that 

maintains scalability, management and ease of use. 

Purdue wanted to build a grid of idle campus computers/servers and provide the 

computational capacity to researchers throughout the nation. By collaborating 

with several other campuses, including Indiana University, University of Notre 

Dame (Ind.), Indiana State University, Purdue’s Calumet and North Central 

campuses and Indiana University‐Purdue University Fort Wayne, Purdue was able to 

increase the total capacity to more than 177 teraflops – the equivalent of a $3 

million supercomputer requiring several thousand square feet of datacenter space.

Results: Purdue selected the free, open‐source Condor distributed 

computing system developed by the University of Wisconsin and the CycleServer 

compute management tool from Cycle Computing. Computers in the pool run client 

software and efficiently and securely connect them to front‐end servers, to which 

jobs are submitted and parceled out to various pool machines when idle. In this 

way, tens of thousands of processors can be brought to bear on problems from 

various researchers. The work is automatically reshuffled when the owner of a 

machine needs it. Using Condor’s flexible policy features, technical staff can 

control over when and how their machines are used (on idle, evenings only, etc.). 

Today, with more than 28,000 processors, DiaGrid offers more than two million 

compute hours per month. The research clusters within the DiaGrid pool average 

about 1‐2 percent idle – providing one of the highest utilization levels. 

Purdue was able to: 

• Squeeze every bit of performance out of each hardware dollar already 

spent. Desktop machines are continually providing computational cycles during 

off hours and the research clusters average only 1‐2 percent idle. 

• Avoid purchasing additional computational capacity by harvesting more 

than 177 Teraflops, for two million compute hours a month using hardware it 

already owns. Purchasing equivalent cycles would cost more than $3 million. 

• Build installation packages that easily pull information from the 

CycleServer centralized management tool. 

• Achieve something no one has tried before: pooling the variety of 

hardware represented in DiaGrid, including computers in campus computer labs, 

offices, server rooms and high‐performance research computing clusters running a 

variety of operating systems. 

• Easily manage policy configuration information with CycleServer, using 

repeated templates for machines across various pools of resources with more than 

28,000 processors – and a goal of eventually hitting 120,000 processors across 

many universities. 

• Put owner’s policies in place for when machines could run calculations. 

• Get status, reporting and management capabilities across pools of 

resources on many campuses. 

• Enable creative uses of computation. For example, DiaGrid is used in 

creating a virtual pharmacy clean room for training student pharmacists; 

rendering fly‐through animation of a proposed satellite city to serve as a refuge 

for Istanbul, Turkey, in the event of a catastrophic earthquake; and animating 

scenes for “Nano Factor,” a game designed to for junior‐high‐aged kids interested 

in science and engineering. 

ROI achieved: 

Conclusions: 

References:


Nominating Organization name: DataDirect Networks, Inc. 

Nominating Organization address: 9351 Deering Avenue 

Nominating Organization city: Chatsworth 

Nominating Organization state: CA 


Nominating Contact Person: Jeffrey Denworth 

Nominating Contact Person Title: VP, Marketing 

Nominating Contact Person Phone: 1‐856‐383‐8849 

Nominating Contact Person Email: jeffdenworth@hotmail.com 

User Organization name: Cornell University Center for Advanced 

Computing 

User Organization address: 512 Frank H. T. Rhodes Hall 

User Organization city: Ithaca 

User Organization state: NY 


User Organization Contact Person: David A. Lifka, PhD 

User Organization Contact Person Title: Director, Cornell University 

Center for Advanced Computing 

User Organization Contact Person Phone: 607‐254‐8621 

User Organization Contact Person Email: lifka@cac.cornell.edu 


Scalable Research Storage Archive 

Team Leaders name: 

Team Leaders title: 

Team Leaders Company: 

Team Leaders Contact Info: 

Team Members name: Dr. Jaroslaw Pillardy 

Team Members title: Sr. Researcher at Cornell’s Computational Biology 

Service Unit 

Team Members Company: Cornell University 




Introduction: The Cornell Center for Advanced Computing (CAC) is a 

leader in high‐performance computing system, application, and data solutions that 

enable research success. As an early technology adopter and rapid prototyper, CAC 

helps researchers accelerate scientific discovery. 

Located on the Ithaca, New York campus of Cornell University, CAC serves faculty 

and industry researchers from dozens of disciplines, including biology, 

behavioral and social sciences, computer science, engineering, geosciences, 

mathematics, physical sciences, and business. 

The center operates Linux, Windows, and Mac‐based HPC clusters and the staff 

provides expertise in HPC systems and storage; application porting, tuning, and 

optimization; computer programming; database systems; data analysis and workflow 

management; Web portal design, and visualization.

CAC network connectivity includes the national NSF TeraGrid and New York State 

Grid. 

The DataDirect Networks S2A9700 storage system is used as the central storage 

platform for a number of departments and applications. Initially deployed for 

backup and archival storage, CAC is increasingly using the S2A9700 as front‐line 

storage for applications such as genome sequencing. 

Since CAC provides services to a wide range of Cornell departments and 

applications, implementing centralized storage platforms is critical in ensuring 

an efficient, reliable and cost‐effective infrastructure. 

Cornell researchers were considering buying commodity, off‐the‐shelf storage 

solutions to locally store their research data. While the cost of such technology 

appeared initially low – the lack of coordination, data protection and system 

reliability detracted from the long‐term value of this approach. As research 

productivity and access to data are directly correlated – the primary focus of 

the storage solution had to be high reliability and scalability. 

It was clear that an affordable, centrally managed, highly available research 

storage system was needed in order to control costs and also to ensure that 

researchers remained productive. Accommodating a variety of applications and 

departments would prove a challenge for ordinary storage systems, but the DDN 

S2A9700 proved capable even beyond the initial scope of the project. 

Results: The center selected an S2A9700 storage system from DDN with 

40TB unformatted capacity in RAID‐ 6 configurations. DDN partnered with Ocarina 

Networks to provide transparent, content‐aware storage optimization at CAC, 

reducing the overall capacity need by more than 50 percent. For some Microsoft 

SQL database applications, a compression rate of up to 82 percent was achieved. 

DDN storage technology enables massive scalability and capacity optimization 

through storage collaboration. As compared to other storage technologies in it's 

class ‐ the S2A9700 features industry leading throughput (at over 2.5GB/s per 

system), capacity (scalable to hold up to 

2.4 Petabytes in a single system) and data center efficiency (DDN systems are the 

densest in the industry, housing up to 600 hard drives in a single data center 

rack ‐ also featuring Dynamic MAID power management technology). The combination 

of the S2A9700 system scale and the data center optimized configuration proved to 

Cornell that installing and adding capacity could be done very cost‐effectively 

and the system could scale to meet the Center's evolving storage volume 

requirements without a forklift upgrade. 

"We have been very impressed with the performance DDN's S2A9700 delivers," 

said David A. Lifka, CAC director. "For genomics research ‐ Cornell uses Solexa 

Sequencers and the DDN storage system is directly connected to the compute 

cluster, while at the same time continuing to provide backup and archive storage 

for our other projects and departments." 

‐ David A. Lifka, CAC Director 

Ocarina’s ECOsystem platform uses an innovative approach to data reduction. The 

ECOsystem first extracts files into raw binary data and applies object boundaries

to the data. It then applies object dedupe and content‐aware compression to the 

natural semantic objects found within. 

The object dedupe approach finds object duplicates in compressed, encoded data 

that would never be found using standard block dedupe. After processing object 

duplicates, the ECOsystem then applies content specific compression to the 

remaining unique object. This dual approach provides better space savings than 

either block dedupe or generic compression alone would. Ocarina’s ECOsystem 

includes multiple data compressors for the types of files commonly found in 

research computing environments and includes over 100 algorithms that support 600 

file types. 

> ROI achieved: 

As compared to the alternative of disparate storage "islands" managed by various 

independent departments, Cornell experienced a substantial ROI through the 

consolidation and optimization of a globally accessible storage pool. 

By deploying scalable, high‐speed DDN S2A Storage with intelligent Ocarina data 

optimization software, Cornell projected a nearly full return on investment 

within as little as one year. Aggregate capacity requirements were reduced, 

administration was consolidated and economies of scale were gained. It is 

expected that the savings associated with a cost‐effective 

(capacity‐optimized) petabyte‐scalable storage pool, in addition to the FTE 

savings the University realized, will have fully paid for the new system within 

12 months time. 

> Conclusions: 

As multi‐departmental and multi‐application organizations adopt higher fidelity 

research tools and engage in high‐throughput research, storage requirements will 

balloon across the enterprise. As evidenced at Cornell, a well planned storage 

consolidation, optimization and deployment strategy can not only allow 

researchers to focus on research, but also aids organizations through substantial 

cross‐departmental budgetary relief. Scalable storage systems from DataDirect 

Networks, coupled with intelligent file‐format‐aware Ocarina Networks storage 

optimization software, have proven to enable consolidation, savings and 

simplification with tools optimized for the life sciences researcher. 

References: DDN Case Study: 

http://www.datadirectnet.com/index.php?id=246 

Drug Discovery News Article: 

http://www.drugdiscoverynews.com/index.php?newsarticle=2787 

GenomeWeb Article: 

http://www.genomeweb.com/informatics/ocarina‐pitches‐content‐aware‐compressionapproach‐storing‐life‐science‐data?page=1




1. Nominating Organization (Fill this out only if you 

are nominating a group other than your own.) 

A. Nominating 

Organization 

Organization name: 

FalconStor Software 

Address: 


Name: 

Kathryn Ghita 

Title: 

PR 

Tel: 

617‐236‐0500 

Email: 

Kathryn.ghita@ 

@metiscomm.com 




Address: 

Human Neuroimaging Lab – Baylor Collegee of Medicine 

1 Baylor Place, Houston, TX 77030 


Name: 

Justin King 

Title: 

Systems Administrator 

Tel: 

713‐798‐4035 

Email: jking@hnl. .bcm.edu 

3. Project 


Team Leader Name: 

Justin King 

Title: Systems Administrator 

Tel: 

713‐798‐4035 

Email: jking@hnl. .bcm.edu 

Team members – name(s), title(s) and company (optional): 

4. Category in which entry is being 

submitted (1 category per entry, highlight your choice) 


Drug Discovery & Development: Compound‐focused research, drug 

safety 



Personalized Medicine: Responders/non‐responders, 

biomarkers



IT & Informatics: LIMS, High 

Performancee Computing, storage, data visualization, 

imaging technologies 

Knowledge Management: Data mining, idea/expertisee mining, text mining, collaboration, resource 

optimization 

Health‐IT: ePrescribing, RHIOs, EMR/PHRR 


(Bio‐IT World reserves the right to re‐categorize submissions based on submission or in the 

event that a category 

is refined.) 

5. Description of project (4 FIGURES MAXIMUM) ): 

A. ABSTRACT/SUMMARY 

of the project 

and results (150 words max.) 

The Human Neuroimaging Laboratory (HNL) is part of the Department of Neuroscience at Baylor 

College of 

Medicine thatt concentrates on research projects covering 

neuroscience, psychology, political 

science and economics. This groundbreaking research requires a reliable infrastructure to match 

the 

speed of discovery. Previously relying 

on standard tape and disk-to-disk backups, the HNL was 

handcuffedd by cumbersome management and disk space constraints. With a small IT staff, the HNL set 

out to enhance its storage 

management processes, without disruption, to accomplish the goals of 

improving 

reliability, increasing retention and becoming less dependent on tape. Through the use of 

technologies such as virtual tape libraries (VTL) and data deduplication, the HNL 

was able to protect the 

invaluable 

data and more 

efficiently to keep up with the daily demand of cutting-edge neuroscience 

research. 


As one the 

top 10 medical and research institutions, the HNL focuses on researching social 

interaction through hyperscanning, a method by which multiple subjects, each in a separate MRI 

scanners, can interact with one another 

while their brains are simultaneously scanned. Scientistss use the 

Internet to 

control multiple scanners, even if they are located thousands of miles apart in different centers, 

to scan and 

monitor brain 

activity simultaneously while they are interacting with each other. 

Researchers at The HNL 

are running hyperscans at the same time and the solution 

was needed to 

take 

each of these scans as they were done and consistently back them up. Experiments are extremely difficult, 

time consuming and expensive to reproduce, so the data storage solution needed to 

save it quickly and 

reliability. Once the scans were completed, three copies of each file 

would be made to do three different 

types of analysis, creating a glut of similar data on the 

system. 

The HNL needed a more 

reliable data storage infrastructure to store 

these multiple 

scans during data 

analysis, as well as ensure that no of the information was lost. Previously, The HNL was using a physical 

tape backup solution thatt required swapping out of tapes during a backup as well as putting a limit on the 

length any 

data may be retained. 

In addition, Systems Administrator, Justin King, was 

often called upon to fix tape 

backup issues, as well 

as constantly switch out the various tapes. As a result, King lost valuable research 

time on updating and 

perfecting the hyperscanning software. King was determined to find a simpler solution that could run



without his constant attention, grow with The HNL demands for storage while providing a much 

reliable, 

quick dataa protection solution. 


King’s goal in finding a new solution was to end the reliance on tape as a data protection solution. Tape 

was proving to be too faulty and unreliable. Althoughh he could have bought more 

disks and tapes to 

continue with the same data protection, 

King felt thatt a different solution would be able to scale with 

HNL better in the future, 

as well as increase reliability. 

From researching various data backup solutions, King 

chose a virtual tale library (VTL) solution 

with 

deduplication that would 

easily integrate into the existing VMware environment. The FalconStor VTL 

with data deduplication allowed King to complete faster, more reliable backups, while the data 

deduplication feature reduced the amount of data thatt needed to be stored on a disk. In fact, the 

implementation of the VTL solution was done with little to no change needed to the backup environment. 

The fact that there wasn’ ’t any extensivee architecting or hardware changes needed to implement the new 

VTL solution made it an 

even better solution as King 

was able to get it running quickly. 

Prior to the VTL solution, all the information was backed up regardless of similar data and files. 

The 

deduplication feature has 

greatly increased the amount of files saved 

with a 15:1 ratio – or out of 

15 

similar files; 1 is processed and stored for backup. The HNL’s storage footprint was greatly reduced so 

that more data could be stored for longer lengths of time. 

The additional data storage time allows for 

quicker and deeper research into the discovery process of the brain. 

The FalconStor VTL solution with deduplication greatly reduced the backup issues, freeing King’s time to 

focus on improving the hyperscanning software and other research topics. At any 

given there may be 

multiple people running MRIs or analyzing the scans, so each hyperscan is extremely important to 

achieving greater understanding the brain and how individuals react 

to one another. The VTL with data 

deduplication ensures that no information is lost regardless of the amount of people using the data or new 

scans being added to the system. 

D. ROI achieved or expected (200 words max.): 

The greatest value of the VTL with data deduplication solution has been the simplification of HNL’s 

data protection solution. King has since achieved a six‐fold increase in data retention rates 

for the 

hyperscans from one month to six months with the ability to extend this out to a full 12 months if it 

needed. The improved retention time allows for more in‐depth 

analysis, social interactionn research 

and a greater overall understanding 

of brain functions and processes. 

The improved reliability 

of the virtual solution over the physical tape allowed 

King to 

to fully focus on the research neededd for major diseases such as 

personality disorders and others. 

His time is 

no longer spent switching 

out tapes or fixing problems that resulted in a faulty backup. 

The data deduplication 

ROI is seen in 

the ratio of data files to those actually processed and 

saved 

for backup. The 15:1 ratio means that 150 TB of logical data could be stored on a 10TB disk. With 

more information on a smaller disk size, data retention rates increased exponentially allowing the



researchers longer access to the dataa with the aim 

of learning more about various brain functions; 

brain disorders and other issues. 

E. CONCLUSIONS/implications for the field. 

The most 

compelling aspect of The HNL’s story 

is there are solutions on the market there where 

one person, such as King, could run 

a lab while also being able to conduct important research 

into the brain. As a successful implementation within a data‐intensive lab, it is a proof point for 

other labs 

or research firms looking for a scalable, reliable data protection solution that may be 

quickly installed with minimal environment change. As the FalconStor data protection solution is an 

out‐of the‐box solution for most environments, King was able to 

install it and 

forget about it within 

a short period of time. 

The HNL 

research is vital to understanding the brain and how 

it processes information in a 

variety of 

environments. This research may help 

lead to breakthrough in a number of areas 

including 

for conditions such as Parkinson’s, schizophrenia, Autism as well as other disorders. 

With a secure data protection solution in place, The HNL could focus on what it does best – 

conducting ground breaking research into analyzing the brain 

and creating better measurement 

and research solutions. 

6. REFERENCES/testimonials/supporting internal documents (If necessary; 5 pages max.)


Nominating Organization name: Isilon Systems 

Nominating Organization address: 3101 Western Ave 

Nominating Organization city: Seattle 

Nominating Organization state: WA 


Nominating Contact Person: Lucas Welch 

Nominating Contact Person Title: PR Manager 

Nominating Contact Person Phone: 206‐315‐7621 

Nominating Contact Person Email: lucas.welch@isilon.com 

User Organization name: Oklahoma Medical Research Foundation 

User Organization address: 825 NE 13th Street 

User Organization city: Oklahoma City 

User Organization state: OK 


User Organization Contact Person: Stuart Glenn 

User Organization Contact Person Title: Software Engineer 

User Organization Contact Person Phone: 405‐271‐7933 x35287 

User Organization Contact Person Email: stuart‐glenn@omrf.org 


Transition to Nextgen Sequencing and Virtual Data Center 

Team Leaders name: Stuart Glenn 

Team Leaders title: Software Engineer 

Team Leaders Company: OMRF 

Team Leaders Contact Info: 405‐271‐7933 x35287, stuart‐glenn@omrf.org 

Team Members name: 

Team Members title: 

Team Members Company: 




Introduction: Oklahoma Medical Research Foundation (OMRF), a leading 

nonprofit biomedical research institute, experienced an unprecedented influx of 

mission‐critical, genetic information with the introduction of a high‐powered, 

next‐generation Illumina Genome Analyzer and server virtualization. To maximize 

both its infrastructure investment and the value of its genetic data, OMRF needed 

a storage solution capable of keeping pace with its tremendous data growth while 

still powering its virtual data center without the burden of costly upgrades and 

tedious data migrations 

In its efforts to identify more effective treatments for human disease, OMRF 

generates tremendous amounts of mission‐critical genomic information. 

This data is then processed and analyzed using Linux computer servers running the 

VMware ESX virtualization software application. With its previous NAS system, 

OMRF would have been forced to migrate genetic information back and forth between 

disparate data silos, slowing sequencing runs and depriving its virtual servers 

of the data access and high throughput necessary to realize the full potential of 

virtualized computing.

Results: Using scale‐out NAS from Isilon Systems, OMRF has unified both 

its DNA sequencing pipeline and virtualized computing infrastructure into a 

single, high performance, highly scalable, shared pool of storage, simplifying 

its IT environment and significantly speeding time‐to‐results. 

OMRF can now scale its storage system on‐demand to meet the rapid data growth and 

unique performance demands of its mission‐critical workflow, increasing 

operational efficiency and decreasing costs in an effort to identify genetic 

precursors to diseases such as Alzheimer’s, Lupus and Sjögren’s Syndrome. 

With its scale‐out NAS solution, OMRF has created a single, highly reliable 

central storage resource for both its entire next‐generation sequencing workflow 

and its virtual computing infrastructure, dramatically simplifying storage 

management and streamlining data access across its organization. Today, OMRF can 

cost‐effectively manage rapid data growth from a single file system, eliminating 

data fragmentation caused by traditional NAS in virtual environments and 

maximizing the performance of both its virtual servers and its DNA sequencing 

workflow. 

By deploying a second Isilon system off‐site and using Isilon’s SyncIQ® 

asynchronous data replication software to replicate data between its primary and 

off‐site clusters, OMRF also has a highly reliable solution in place to ensure 

its data is immediately available even in the case of IT failure or natural 

disaster. 

ROI achieved: 

Conclusions: 

References:




1. Nominating Organization (Fill this out only if you are nominating a group other than your own.) 



Address: 


Name: 

Title: 

Tel: 

Email: 



Organization name: National Institute of Allergy and Infectious Diseases (NIAID) 

Address: 10401 Fernwood Rd., Bethesda, MD 20892 


Name: Nick Weber 

Title: Scientific Informatics & Infrastructure Analyst 

Tel: 301.594.0718 

Email: webermn@niaid.nih.gov 

3. Project 

Project Title: A Centralized and Scalable Infrastructure Approach to Support Next Generation Sequencing 

at the National Institute of Allergy and Infectious Diseases 

Team Leader 

Name: Nick Weber (Lockheed Martin Contractor) 

Title: Scientific Informatics & Infrastructure Analyst 

Tel: 301.594.0718 

Email: webermn@niaid.nih.gov 


• Vivek Gopalan – Scientific Infrastructure Lead (Lockheed Martin Contractor) 

• Mariam Quiñones – Computational Molecular Biology Specialist (Lockheed Martin Contractor) 

• Hugo Hernandez – Senior Systems Administrator (Dell Perot Systems Contractor) 

• Robert Reed – Systems Administrator (Dell Perot Systems Contractor) 

• Kim Kassing – Branch Chief, Operations and Engineering Branch (NIAID Employee)



• Yentram Huyen – Branch Chief, Bioinformatics and Computational Biosciences Branch (NIAID 

Employee) 

• Michael Tartakovsky – NIAID Chief Information Officer and Director of Office of Cyber Infrastructure 

and Computational Biology (NIAID Employee) 







IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging technologies 

Knowledge Management: Data mining, idea/expertise mining, text mining, collaboration, resource 

optimization 



(Bio‐IT World reserves the right to re‐categorize submissions based on submission or in the event that a category 

is refined.) 



Recent advances in the “next generation” of sequencing technologies have enabled high‐throughput 

sequencing to expand beyond large specialized facilities and into individual research labs. Improved 

chemistries, more powerful software, and parallel sequencing capabilities have led to the creation of many 

terabytes of data per instrument per year that will serve as the basis for diverse genomic research. 

In order to manage the massive amounts of data, many researchers will require assistance from IT experts and 

bioinformaticians to store, transfer, process, and analyze all the data generated in their labs. The Office of 

Cyber Infrastructure and Computational Biology (OCICB) at the National Institute of Allergy and Infectious 

Diseases (NIAID) has developed a centralized and scalable infrastructure to support Next Generation 

Sequencing efforts across the Institute. Primary goals of this approach are to standardize practices for data 

management and storage and to capitalize on the efficiencies and cost savings of a shared high‐performance 

computing infrastructure. 


The Office of Cyber Infrastructure and Computational Biology (OCICB) manages technologies supporting NIAID 

biomedical research programs. The Office provides a spectrum of management, technologies development, 

applications/software engineering, bioinformatics support, and professional development. Additionally, OCICB 

works closely with NIAID intramural, extramural, and administrative staff to provide technical support, liaison, 

coordination, and consultation on a wide variety of ventures. These projects and initiatives are aimed at 

ensuring ever‐increasing interchange and dissemination of scientific information within the Federal



Government and among the worldwide scientific network of biomedical researchers. Both the Operations and 

Engineering Branch (OEB) and the Bioinformatics and Computational Biosciences Branch (BCBB) are branches 

of the OCICB. 

The OEB provides technical and tactical cyber technologies management and support for NIAID extramural 

biomedical research programs. OEB delivers essential and assured services to facilitate communication using 

electronic systems and a collegial, authorized, and accessible framework for automated information sharing 

and collaboration. The BCBB provides three suites of scientific services and resources for the NIAID research 

community and its collaborators: Biocomputing Research Consulting, Bioinformatics Software Development, 

and Scientific Computing Infrastructure. 

The primary objectives of the ‘Centralized and Scalable NIAID Infrastructure’ project include the following: 

• To assist NIAID laboratories in assessing their infrastructure needs for data storage and analysis of 

massively‐parallel sequencing. 

• To procure, operate, and maintain computing hardware that supports the data storage and processing 

needs for Next Generation Sequencing across the Institute. 

• To procure, build, and assist in the use of third‐party applications to be hosted on the NIAID Linux High 

Performance Computing Cluster. 

• To provide a robust, reliable, cost‐effective, and scalable cyber infrastructure that will serve as the 

foundation to support Next Generation Sequencing at the NIAID. 

A secondary objective of this project is to develop a standardized process for handling infrastructure requests 

for similar high‐performance computing endeavors that will require access to large amounts of data storage 

and processing. 

Project responsibilities of the OCICB Operations and Engineering Branch include: 

• Designing and provisioning appropriate resources to meet the scientific and business goals of the 

Institute 

• Consulting regularly with clients to assess performance and modify the core facility to maintain 

appropriate performance 

• Selecting and managing the operating system, grid engine, and parallelizing software for computing 

resources 

• Selecting, developing, maintaining, and managing computing resources pursuant to effective processing 

of associated data 

• Selecting, developing, maintaining, and managing the enterprise storage components 

• Selecting, developing, maintaining, and managing effective networking components 

• Managing the security of the data, operating systems, appliances, and applications 

• Provisioning user accounts necessary for user applications 

• Collaborating with the Bioinformatics and Computational Biosciences Branch to ensure appropriate 

resources are provisioned that enable effective use of the facility 

The Bioinformatics and Computational Biosciences Branch’s responsibilities include: 

• Facilitating coordination and communications among OCICB groups and NIAID laboratories



• Maintaining a shared intranet portal for collaboration and document sharing between the OCICB and 

NIAID laboratories 

• Documenting minimum requirements for software applications that will be hosted on the NIAID Linux 

High Performance Computing Cluster (in order to aid OEB in the determination of hardware specifications 

for the cluster) 

• Working with the NIAID laboratories to analyze and document workflows/pipelines for downstream data 

analysis 

• Installing, maintaining, upgrading, and supporting software applications on the NIAID Linux High 

Performance Computing Cluster 

• Providing user‐friendly, web‐based interfaces to software applications hosted on the NIAID Linux High 

Performance Computing Cluster 

• Evaluating and selecting a Laboratory Information Management System (LIMS) to assist with end‐to‐end 

processing and analysis of Next Generation Sequencing data 


The OCICB’s Operations and Engineering Branch (OEB) has made several significant investments to support 

Next Generation Sequencing research, including improvements in the NIAID network, in data storage and 

processing hardware, and in the personnel required to build and maintain this infrastructure. Specific upgrades 

include the following: 

• Expansion of network bandwidth from 1 to 10 gigabits per second to support increased network traffic 

between NIAID research labs and the NIAID Data Center 

• Construction of a high‐speed and highly‐dense enterprise storage system, originally built at 300‐ 

terabyte capacity but rapidly scalable to up to 1.2 petabytes 

• Creation of a high‐performance Linux computing cluster hosting many third‐party applications that 

enables efficient data processing on a scalable and high‐memory pool of resources 

• Deployment of a localized mirror of the UCSC Genome Browser for rapid data visualization and sharing 

In addition to these upgrades, the OCICB’s Bioinformatics and Computational Biosciences Branch (BCBB) will 

provide bioinformatics collaboration and support to researchers. Specific resources that will be provided 

include the following: 

• End‐to‐end laboratory information management system (LIMS) to support sample preparation and 

tracking; task assignment; interaction with the instrument; downstream analysis and custom pipelines 

between applications; data sharing; and data publication/visualization 

• Training on the use of bioinformatics applications and development of custom workflows and 

application pipelines to streamline data analysis 

• Collaboration on the data integration, analysis, and annotation/publication processes 

Some policy decisions for using the centralized infrastructure have yet to be made, including formalizing 

procedures for long‐term data retention as well as balancing data privacy/security requirements while 

concurrently facilitating data sharing and publication. Nevertheless, NIAID’s centralized approach highlights 

the need for a cooperative partnership between bench researchers, computational scientists, and IT 

professionals in order to advance modern scientific exploration and discovery.



D. ROI achieved or expected (200 words max.): 

Expected returns on this investment are many and include the tangible and intangible benefits and cost 

avoidance measures listed below: 

Tangible Benefits: 

• Cost savings through reduction of people‐hours for IT development, application deployment, system 

maintenance, and customer support for centralized implementation (versus distributed 

implementations to support labs separately) 

Intangible Benefits: 

• Improved security/reduced risk by managing a single, centralized pool of infrastructure resources 

(includes enterprise‐level security, storage, and back‐up; dedicated virtual LAN; failover/load‐sharing 

file services cluster and scheduler; and a single, formal disaster recovery and continuity of operations 

plan) 

• Increased awareness of bioinformatics resources available to labs at NIAID and other NIH Institutes 

• Elevated access to single, integrated team of subject matter experts including system administrators, 

infrastructure analysts, bioinformatics developers, and sequence analysis experts 

• Enhanced collaboration with research organizations external to NIAID that will take advantage of 

high‐performance computing environment 

• Improved research productivity to work toward combating/eradicating critical diseases 

Cost Avoidance: 

• Efficient use of centralized storage and computing resources used at higher capacity 

• Leveraged energy efficiency of data center power and cooling systems 

• Estimated 5‐fold savings in software licensing fees for shared deployment on cluster 

• Limited consolidation and migration costs for systems/data in centralized implementation 


Genomic research is a rapidly growing field with broad implications at the NIAID and in the global 

research community in general. Rather than having laboratory staff attempt to develop the requisite 

storage, network, and computing capacity themselves, NIAID’s Chief Information Officer has made a 

significant investment to centralize infrastructure resources in order to maximize efficiency and minimize 

cost and risk. Major network and storage upgrades, in addition to the construction of a powerful and 

scalable Linux computing cluster, are the most visible parts of this investment. However, additional 

personnel – including an experienced Linux Systems Administrator and bioinformatics support staff – 

have also been acquired. By utilizing the centralized infrastructure and resources, researchers doing 

important and influential work in immunology, vaccinology, and many other research areas that are 

immensely beneficial to the public will be better able to conduct their research. 

Large datasets and powerful multi‐core computers are not unique to Next Generation Sequencing. Other 

research areas of interest at the NIAID will also benefit from the new high‐performance computing 

resources. The NIAID has been able to reuse many of its successful development, procurement, and



communications processes of this project to continue to foster cooperation between bench researchers, 

bioinformaticians, and IT professionals. Sharing this experience as a best practice – including highlighting 

the hurdles and setbacks in addition to the progress – can provide a strong starting point for other 

organizations that plan to increase their Next Generation Sequencing and high‐performance computing 

capabilities. 

1. REFERENCES/testimonials/supporting internal documents (If necessary; 5 pages max.)




1. Nominating Organization (Fill this out only if you are nominating a group other than your own.) 


Organization name: Panasas 

Address: 6520 Kaiser Drive, Fremont, CA 94555 


Name: Angela Griffo, Trainer Communications (agency contact) 


Tel: 949‐240‐1749 

Email: agriffo@trainercomm.com 



Organization name: Uppsala University 

Address: P.O. Box 256 SE‐751 05 Uppsala, Sweden 


Name: Ingela Nystrom PhD 

Title: Director of Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) 

Tel: +46 70 1679045 

Email: Ingela.Nystrom@cb.uu.se 

3. Project 

Project Title: UPPNEX 

Team Leader 

Name: Ingela Nystrom 


Tel: +46 70 1679045 

Email: Ingela.Nystrom@cb.uu.se 


Professor Kerstin Lindblad‐Toh, Broad Institute/Uppsala University 

PhD Jukka Komminaho, Systems expert manager of UPPMAX, Uppsala University 

Jonas Hagberg, Systems expert of UPPMAX, Uppsala University 


Basic Research & Biological Research: Disease pathway research, applied and basic research







IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging technologies 

Knowledge Management: Data mining, idea/expertise mining, text mining, collaboration, resource 

optimization 



(Bio‐IT World reserves the right to re‐categorize submissions based on submission or in the event that a category 

is refined.) 



Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) is Uppsala 

University's resource of high‐performance computing. 

In recent years, Swedish researchers have become overwhelmed with data from next‐generation 

sequencing machines. UPPMAX’s challenge was to provide the researchers with a centralized 

compute and storage facility, capable of handling multiple terabytes of new bioinformatics data per 

week. 

In 2008 the Knut and Alice Wallenberg Foundation granted research funding for a national IT facility 

dedicated to the compute and storage of genomic data. UPPMAX therefore had a new project, 

‘UPPmax NEXt generation sequence Cluster & Storage’ (UPPNEX). 

Today, a centralized resource for the compute and storage of next‐generation sequencing data is in 

place, resulting in faster conclusion for scientific research. Since the introduction of UPPNEX, 

project times have decreased by several months! 

Groundbreaking research, using UPPNEX resources, has already developed improvements in 

agriculture processes and the understanding of human growth and obesity. 


The UPPMAX facility was founded in 2003 at Uppsala University. UPPMAX is part of the Swedish 

National Infrastructure for Computing (SNIC). Since its establishment, UPPMAX has provided 

researchers (both locally and nationally) with access to a number of high‐performance computing 

(HPC) systems.



UPPMAX’s users traditionally come from research areas such as physics, chemistry, and computer 

science. Lately, however, the number of Life Sciences users has increased dramatically. This is 

mainly due to the technical advances, affordability and increased deployment of next‐generation 

sequencing (NGS) machines. 

In 2008 it had become apparent, to Swedish researchers, that the tsunami of data from NGS 

systems created a problem that individual research grants could not solve. In many cases, Life 

Sciences research teams were trying to manage the problem themselves. However, due to the 

sheer volume of data, they wasted a lot of time copying data between systems, waiting for others 

to complete their computing before they could start their own ‐ and often writing custom code to 

manage jobs that would typically max‐out system resources. In short, the teams often spent as 

much time solving computing challenges as they did on scientific research! 

It was for these reasons that, in 2008, a national consortium of life sciences researchers was formed 

to address the challenges presented by this massive increase in bioinformatics data. These 

researchers would normally compete for resources and research funding. However, it had become 

apparent that a centralized facility was required. The computation and data storage requirements 

of NGS data created a workload that, at peak processing times and for long‐term data archiving, had 

to be handled by a larger facility. 

The consortium therefore submitted an application to SNIC and the Knut and Alice Wallenberg 

Foundation to fund a centralized life sciences compute and storage facility to be hosted at UPPMAX. 

The united conviction of the consortium being that a sufficient compute and storage facility would 

ultimately strengthen their attempts to combat disease. 

The application was successful with the Knut and Alice Wallenberg Foundation noting that the 

consortium’s collaborative effort was a major advantage. 

And so, the “UPPmax NEXt generation sequence Cluster & Storage (UPPNEX)” project was formed. 

Today, a 150 node (1200 core) compute cluster from HP with Infiniband as interconnect is in 

production with half‐a‐petabyte (500TBs) of Panasas parallel storage. The solution passed a onemonth 

acceptance period, at the first time of asking, and entered production in October 2009. 

The objectives of the UPPNEX solution were to provide Life Sciences Researchers throughout 

Sweden with: 

1. Sufficient high‐performance computing resources to cover their regular and peak project 

requirements



The key challenge was to provide a compute system with enough performance and resources to 

handle the massively parallel software algorithms required to process the genomic data. 

Furthermore, to provide a sufficient high‐performance storage solution that could handle the 

large number of clients with concurrent I/O requests. 

2. Longer‐term data storage facilities to provide a centralized, national data repository 

With multiple terabytes of new data being received by UPPNEX on a weekly basis, the storage 

solution had to scale capacity, without incremental complexity and management costs. To 

protect the data, the storage solution had to be highly‐available (with failover and redundancy 

features built in). Additionally, the storage had to be compatible with UPPMAX’s existing backup 

infrastructure. 


In order to address the challenges of the massive ingest of bioinformatics data, UPPNEX leverages a 

parallel storage solution from Panasas. Panasas was born out of a 1990’s US DOE research project 

into Petascale computing and the file‐system technologies required to process and manage massive 

amounts of data. 

Since Panasas was formed in 1999, the company has developed its modular storage hardware 

platform in unison with its parallel file‐system, PanFS. With strong initial success in traditional HPC 

markets, Panasas has complemented its performance with enterprise class features and easy 

management. The past few years have seen Panasas at an inflection point, where the company’s 

solutions have been gaining swift traction in data‐intensive workflows such as seismic processing, 

computational fluid dynamics and life sciences (in particular around next‐generation sequencing 

and medical imaging). 

UPPNEX chose Panasas parallel storage because it provided the performance required by their HPC 

system when processing massively parallel life sciences applications, additionally Panasas provided 

a lower‐cost (yet highly reliable) storage pool for the longer‐term storage requirement. The unique 

aspect of the Panasas solution is that both of these storage pools sit under the same management 

layer. It is therefore easy to manage both storage pools, which results in the administration 

overhead of UPPNEX being significantly reduced, if compared to a traditional NFS‐based solution. 

It is anticipated that the long‐term storage pool for UPPNEX will grow by 250 Terabytes in 2010. 

However, unlike alternative NAS solutions, the management complexity of the Panasas solution will 

not grow as the storage capacity grows. The Panasas solution scales to tens of Petabytes in a single 

management layer and additional capacity is added with zero loss in productivity. 

D. ROI achieved or expected (200 words max.):



Technology ROI: 

Individual research groups no longer have to over‐spec IT solutions to meet peak requirements. By 

moving towards centralized solutions, there are substantial gains thanks to the coordination of 

staff, computer halls, etc. 

Research ROI: 

An example research project, that leveraged UPPNEX, has reduced its time‐to‐completion by several 

months . The project focused on gaining a deeper understanding of the relationship between 

genetic variation and phenotypic variation. Through whole genome resequencing, the researchers 

distinguished key genes causing the differences between wild and domestic chickens. They have 

identified candidate mutations that cause special effects on the phenotype. This is an efficient 

strategy to increase our understanding of how different genes control different traits. 

One gene, associated with the fast growth of broiler chickens, is associated with obesity in humans. 

The study established a new animal model that can be used to explore the mechanics of how this 

gene influences human growth and obesity. 

Lastly, the domestic chicken is the most important global source of animal protein. The research has 

established the possibility to develop domestic chickens that are extremely efficient producers of 

animal proteins, namely eggs and meat. 


The recent technological advancements, affordability and wide deployment of NGS machines is 

feeding a tsunami of digital data. The information technology infrastructure required to compute 

and store such vast amounts of data is beyond the funding of Individual research groups. 

Centralized HPC and data‐storage facilities are being deployed at regional, national and global level 

to provide researchers with access to the IT infrastructure they require. 

The challenge for the centralized facilities is to provide sufficient compute and data‐storage 

resources to fuel multiple research projects simultaneously. With ever‐increasing amounts of 

digital data being ingested, how do they process, manage and store the data both reliably and 

efficiently. 

Traditional storage technologies cannot keep pace. Their limitations on capacity encourage data 

silos, multiple copies of data, system administration headaches and an escalating management 

overhead. Clustered storage technologies struggle to address diverse performance requirements 

within the life sciences workflow, again encouraging data silos and disparate storage management 

layers.



Panasas parallel storage caters for the diverse performance, reliability and cost requirements across 

the life sciences workflow. Scaling to tens of petabytes under a single management layer, Panasas 

users can scale storage with zero loss in productivity. 

The industry is at an inflection point that goes beyond the capabilities of traditional storage 

technologies. Centralized facilities such as UPPNEX are blazing a trail and deploying innovative 

technologies to enhance national scientific discovery that ultimately benefits the global community. 

1. REFERENCES/testimonials/supporting internal documents (If necessary; 5 pages max.) 

Can we add here a link to the paper on the chicken? 

The journal Nature is very strict on that links to manuscripts should not be spread prior to release. On the other hand, the 

paper is very soon out, so please ask the PI of the project if it is possible: Professor Leif Andersson 

leif.andersson@imbim.uu.se. 

Can we add a link to anything about the research consortium or the grant approval? 

UPPMAX: 

www.uppmax.uu.se 

UPPNEX: 

www.uppnex.uu.se (very soon available) 

Uppsala University’s press-release of the grant approval: 

http://www.uu.se/news/news_item.php?id=534&typ=pm 

SNIC announcement of the grant (in Swedish): 

http://www.snic.vr.se/news-events/news/kaw-och-snic-30-miljoner-kronor-till-storskaliga



Organization name: Translational Genomics Research Institute 

Address: 445 N 5 th Street Phoenix AZ 85004 


Name: James Lowey 

Title: Director HPBC 

Tel: 480‐343‐8455 

Email: jlowey@tgen.org 

3. Project 

Project Title: NextGen Data Processing Pipeline 

Team Leader James Lowey 

Name: James Lowey 

Title: Director HPBC 

Tel: 602‐343‐8455 

Email: jlowey@tgen.org 

Team members – name(s), title(s) and company (optional): Carl Westphal – IT Director, Dr. 

Waibhav Tembe – Sr Scientific Programmer, Dr. David Craig ‐Associate Director of the 

Neurogenomics Division, Dr. Ed Suh ‐ CIO 







x IT & Informatics: LIMS, High Performance Computing, storage, data visualization, imaging 

technologies

Knowledge Management: Data mining, idea/expertise mining, text mining, collaboration, 

resource optimization 



Abstract Evolving NextGen sequencing requires high throughput scalable Bio-IT infrastructure. 

Organizations committed to using this technology must remain nimble and design workflows and IT 

infrastructures that are capable of adapting to the dramatic increase in demands driven by changes in 

NextGen sequencing technology. TGen as an early adopter of multiple NextGen sequencing platforms 

has experienced the evolution first hand and has implemented infrastructure and best-practices that have 

enabled our scientists to effectively leverage this technology. This paper will provide an overview of the 

challenges presented by NextGen sequencing, the associated impact in terms of informatics workflow 

and IT infrastructure, and will discuss what TGen has done to address this challenge. 

[Introduction] 

Beginning of the data deluge in 2009 

In late 2008, NextGen sequencing at TGen was just beginning. One Illumina SOLEXA and a single 

ABI/Lifetech SOLiD sequencer were the initial NextGen platforms brought into TGen. At that point, just 

one whole-genome alignment with SOLiD had been successfully completed using ABI/LifeTech’s corona 

software pipeline on TGen’s large parallel cluster supercomputer and the SOLEXA Genome Analyzer 

pipeline was running on a smaller internal cluster. A team of bioinformaticians were still going through the 

steep learning curve involved in getting familiarized with the technology, file types, analytical challenges, 

and data mining opportunities. In January 2009, TGen investigators began work on a SOLiD NextGen 

sequencing processing and analysis project. This project needed to demonstrate before March 2009 the 

capability to align 4x SOLiD pilot data from about 110 samples against the whole-genome and carrying 

out the required annotation and disseminating the results to collaborating centers. The sheer volume and 

computational resource requirements for processing this data within 90 days presented a formidable 

challenge. Turning this challenge into an opportunity, the TGen IT team working in conjunction with 

bioinformaticians, designed and implemented a customized version of corona pipeline configured to 

maximally utilize the available computational horsepower of the TGen’s High Performance Computing 

(HPC) Cluster computer [1]. The NextGen data processing pipeline depicted in Figure 1 distributed the 

computational task of data alignment over multiple cores, while analyzing and annotating both single 

fragment and mate pair analysis using several custom scripts. This set-up proved to be sufficient to 

successfully carry-out this project. However, it was quickly realized that a radically different IT 

infrastructure was required to meet the computing and infrastructure challenges to make NextGen data 

analysis a standardized service to the scientists.

Figure 1 Data processing Pipeline (March 2009) 

Challenges Faced 

TGen’s NextGen sequencing demand is growing at an unparalleled pace, requiring large scale storage 

infrastructure, high performance computing and high throughput network connectivity. This demand 

places considerable strain upon conventional analysis tools and scientific data processing infrastructure. 

The volume of data being generated for NextGen sequencing is based upon the specific technology, 

instrument version, sample preparation, experimental design, and sequencing chemistry. Each 

experimental run typically generates between 25 to 250 GB of data consisting of sequenced bases and 

quality scores. Each such dataset must be moved from the sequencer to longer term storage, and also be 

made available to computational resources for alignment and other tasks, such as variant detection and 

annotation. Results need to be written back to long-term storage and optionally be made available to 

external collaborators over the Internet. 

Some of the specific challenges TGen had to overcome in the early days of NextGen sequencing at TGen 

are as follows: 

1. Fair allocation of resources: The analysis of one sample from the NextGen sequencers takes 3-4 

days on the HPC cluster. The ability of TGen to process and analyze 110 samples in 90 days,

equires running multiple-jobs using hundreds of processing cores. However, the HPC cluster 

was a shared system being used by hundreds of users so it was necessary to ensure that jobs 

were properly prioritized in order to meet the requirements of the project. 

2. System optimization: Multiple instances of the software being used in the sequence data analysis 

pipeline pushed the limits of the I/O capabilities of the underlying Lustre [2] file system on the 

HPC cluster. Manual intervention from the system administrators was required to build custom job 

processing queues to allow the system to reallocate its resources in order for the HPC cluster to 

continue functioning optimally. 

3. Evolving tools: The software tools for converting the output sequence files into the deliverable 

format were evolving. It was necessary to maintain sequence data in a variety of different formats 

to test and validate the software tools used for converting the output sequence files. This required 

TGen to keep intermediate data files resulting in a considerable demand for storage resources. 

4. Data deluge and transfer: The amount of data being generated from the sequencers and postprocessing 

led to the challenge of managing tens of terabytes of data. The volume of 

computational processing pushed the limits of the existing 80 TB Lustre file system on the HPC 

cluster. In addition, transferring Terabytes of data for data processing and sharing over a 1Gb link 

was a bottleneck in the sequence data processing pipeline. 

5. User Education and support: The bioinformatics team dedicated to the analysis was relatively 

new to SOLiD data processing and using the full functionality of the available HPC cluster 

resources. Therefore, end-user education and providing 24x7 help on data analysis tasks was 

necessary. 

These factors hindered the implementation of a fully automated data processing pipeline and manual 

supervision of every single analysis was necessary. Next-generation sequencing was being increasingly 

adopted by TGen investigators and more sequencing projects were in the pipeline for 2009. This required 

TGen to build and provide scalable sequence data processing infrastructure within the given financial and 

time constraints. 

In response to these challenges IT worked closely with the scientific community and designed a new 

internal workflow and deployed an advanced IT infrastructure for NextGen sequencing data processing. 

The new software and hardware infrastructure accelerates data processing and analysis, and enables 

scientists to better leverage the NextGen sequencing platform. The following section provides the 

lessons learned from the challenges we faced. 

Lessons Learned: 

The challenges above provided the opportunity to learn many valuable lessons in how to construct and 

provide a NextGen sequencing data processing pipeline. The following is a summary of the key lessons 

we learned to date. 

The Impact of I/O: We quickly learned that it is possible for a 4000 CPU core cluster to be rendered 

nearly useless by less than 1/3 of the nodes saturating the file system with I/O operations. Many small I/O 

requests can quickly overwhelm the cache on disk controllers which causes a large queue of requests to 

accumulate, having a negative impact on performance. The Lustre-based file system remains intact, 

however the delay of doing an I/O on the shared file system increases and essentially all operations on

the cluster grind to a halt. The key is to actively manage and schedule computational jobs in such a way 

as to prevent select jobs from overwhelming the system and impacting other jobs. 

WAN Transport & System Tuning: TGen’s initial sequence data processing pipeline included 

transferring the raw sequenced data via a 100Mb WAN Ethernet link from the sequencer to the HPC 

cluster environment that is located at our off-site data center. Despite upgrading the 100Mb WAN 

Ethernet link to 1Gb, the data transfer time of NFS over TCP at a 12 mile distance was still slow. This was 

due to the effect of latency and TCP checksums. Basically, the round trip time for packets meant that 

every checksum that was verified took upwards of 4.5 ms to complete, resulting in a fairly substantial 

delay between each frame. In order to mitigate this, we fine tuned Linux Kernel network parameters, such 

as TCP_Window_Size. We used open source tools such as iperf [3] to test the effects of kernel tuning 

which showed dramatic increases in throughput. However, the performance of data transfer over NFS 

was still unsatisfactory. Due to the variety and number of hosts that required connections across the 

Ethernet link, performing individual kernel tuning on each host was impractical. The solution to the data 

transfer issues was doing the NFS mounts over UDP. This introduced the new issue of silent data 

corruption because UDP does not perform checksums. This meant that MD5 checksums must be 

generated for data files being transferred to ensure data integrity. The key lesson learned was that careful 

attention should be paid to performance tuning measures. There is a lot of benefit to be gained by taking 

the time to understand and optimize system parameters. Doing so may reduce costs associated with 

unnecessary bandwidth upgrades that may not deliver the expected performance improvement. 

LAN Data Transport Capacity: Moving data off of the sequencers to storage and computational 

resources became a very time consuming task. Having multiple sequencers producing and transporting 

data simultaneously quickly overwhelmed 1Gb LAN segments. Fortunately TGen had previously invested 

in 10Gb core network components enabling us to extend 10Gb networking to key systems and resources 

in data processing pipeline thus eliminating bottlenecks on the LAN. As a result we learned or validated 

the importance of fully exploiting the capabilities of the infrastructure available and the importance of 

having a flexible network architecture. 

Internet Data Transport & Collaboration: As TGen began to exchange sequenced data with external 

collaborators, it became immediately apparent that traditional file transfer methods such as FTP would 

not be practical as the data sets were simply too large and the transfer times were not acceptable. This 

problem could not be addressed by simply increasing bandwidth as TGen has no control over the 

bandwidth available at collaboration sites. Internet latency issues became magnified when attempting to 

transfer large data sets. This project required TGen to receive sequenced data from other organizations, 

perform analysis, and make the results available to the other organizations. After researching various 

approaches and exchanging ideas with others at the Networld Interop conference, TGen chose to 

implement the Aspera FASP file transfer product. Aspera enabled scientists to send and receive data at 

an acceptable rate, and enhanced TGen’s ability to participate in collaborative research projects involving 

NextGen sequencing. Lesson, actively seek out best practices and leverage the experiences of others in 

your industry. Participating in user groups and other industry related forums can reduce the time it takes 

to identify and implement significant improvements to your infrastructure or workflow. 

Data Management: The sheer volume of NextGen sequencing data had an immediate and significant 

impact on our file management and backup infrastructure and methods. Scientists were initially hesitant 

to delete even raw image data until they were comfortable with the process of regenerating the 

information. This resulted in scientists keeping multiple versions of large data which quickly consumed 

backup and storage capacity. TGen’s IT department worked collaboratively with the scientific community 

to optimize data management methods. This involved achieving consensus on what is “essential data”, 

defining standard naming conventions, and establishing mutually agreed upon rules regarding the

location and retention of key data. Specifically, IT took the following steps to improve the data 

management process and accelerate the scientific workflow: 

• Dedicated NFS storage for raw reads, attached to back-up tape library 

• Dedicated NFS storage for all results, attached to back-up tape library 

• Automated backup process for “key” files 

• User education on how to mount/unmount the storage space 

• Configured Aspera server to read directly from designated NFS mount points eliminating 

unnecessary data moves 

• Weekly cron jobs for monitoring and informing users about storage resource capacity 

• Automated monitoring of user jobs utilizing the HPC Cluster 

• Established a SharePoint based web portal to share NextGen project related information 

These changes had to be synchronized and communicated across multiple scientific divisions as well as 

the within the IT department. The end result was a more streamlined scientific workflow, improved data 

management environment and reduced impact on the storage, backup and network infrastructure. 

Lesson, be flexible in regards to data management procedures and the supporting infrastructure. Rapidly 

advancing technologies such as NextGen sequencing can render your current methods obsolete and you 

must be willing to make dramatic changes in response to the needs of the scientific community and the 

demands of the technology. 

Benchmarking: 

Alignment of billions of reads to reference genomes is computationally expensive. An effort was initiated 

to benchmark sequence alignment tools. TGen’s IT team was actively involved in this process by 

providing several performance measurement and tuning tools and creating automated scripts for 

collecting data about computing resource utilization associated with six popular sequence alignment 

programs. IT used performance measurement tools for cluster computing environments to benchmark 

the speed, CPU utilization and input-output bandwidth needed for the program. This information is now 

being used for selecting the best tool for various projects and planning the resource requirements for 

future NextGen sequencing projects. Lesson, time spent benchmarking can provide significant benefit in 

terms of reducing the cost and effort associated with the “trial & error” approach to selecting and using 

complex technology such sequencing alignment tools. 

[Results] 

Key Technologies & Supporting Methodologies 

The TGen High Performance Bio-Computing Center (HPBC) manages a diverse collection of HPC 

systems, storage and networking resources, including two large supercomputers. The first supercomputer 

is called Saguaro2, and is a Dell Linux cluster. This system consists of ~4000 Intel x86-64 processor 

cores, with 2 GB RAM per core. This system has a shared parallel 250 TB (Lustre) file system that allows 

massive amounts of concurrent input/output operations spread across many compute nodes. This system 

is very effective at running thousands of concurrent discrete processing jobs, or at running very large 

parallel processing workloads. This large HPC cluster system is installed at the Arizona State University 

campus in the Fulton High Performance Computing Initiative (HPCI) center and was funded via NIH grant 

S10 RR25056-01.

Figure 2 Saguaro2 supercomputer 

In addition to the Saguaro2 cluster system, TGen also has a large memory Symmetric Multi-Processor 

(SMP) system available. This system, is an SGI Altix 4700 consisting of 48 Intel IA-64 cores and 576 GB 

of globally shared memory. The SGI system is well suited for solving memory intensive problems, or 

algorithms that are not easily parallelized. With the resources available on this system, it can run several 

concurrent memory intensive jobs, without having a performance penalty inflicted due to the architecture 

of both the processors and the I/O backplanes on this system. This system was funded via NIH Grant 

S10 RR023390-01. 

Updated NextGen sequencing workflow: 

Learning from the experience and systematically identifying the resource requirements at various stages 

of the NextGen data analysis and transfer, TGen developed and installed a significantly improved 

NextGen sequencing data processing pipeline (Figures 3 & 4). The updated data processing pipeline 

utilizes several customized scripts tailored to the software implementation underlying various data 

analysis tools have been developed, which improve the effectiveness of using HPC for analyses. By 

indentifying the critical files at various stages, redundancy of storage has been minimized and policies 

have been established to delete intermediate files automatically after fixed time. Several compute 

systems have been dedicated to local data processing, such as annotation and parsing. Involving PIs in 

the infrastructure design process and educating their research staff has helped significantly in creating a 

team of proficient and more mindful users of the data processing pipeline.

Figure 3 Scientific data workflow (Feb. 2010) 

Scalable storage 

The dedicated storage capacity for NextGen projects has increased from ~80TB to over 200TB with a 

new scalable Isilon storage system with a single name space file system. This system provides robust 

performance, redundancy and scalability. Being able to manage the very large amounts of storage 

required to support biomedical research using the minimal of IT support allows researchers to 

concentrate on their research and IT to concentrate on building better IT infrastructure in support of 

scientific programs. The Isilon system uses a modular architecture and symmetric clustered file system so 

that tasks such as adding additional storage to the storage cluster is as simple as plugging in additional 

storage arrays. This helps to minimize costs while providing a solution that can grow as the data storage 

requirements continue to increase. 

Backup optimization 

In addition to the Isilon storage system, TGen used Ocarina storage optimization appliances to compress 

data before backup, saving considerable overhead on the backup systems. This makes it feasible to 

backup more of the sequencing data. 

File sharing 

File sharing with external collaborators and other partners is accomplished using the Aspera FASP file 

transfer technology. This technology allows optimal use of the network bandwidth to achieve high 

throughput file transfer across the Internet.

[ROI Expected or Achieved] 

Figure 4 March 2010 Nextgen sequencing infrastructure pipeline 

Highly scalable IT infrastructure supporting high-throughput NextGen sequencing data processing and 

analysis 

1. High-speed, shared file transfer infrastructure that enables TGen scientists to participate in largescale 

collaborations involving NextGen sequencing-based research 

2. Improved data management procedures resulting in a more cost effective use of storage and 

other infrastructure resources 

3. Efficient scientific data processing workflow including computational tools that can be leveraged 

to expedite research 

4. Robust HPC infrastructure that is capable of supporting large-scale NextGen sequencing projects 

As a result of the above benefits, TGen is better positioned to compete in large-scale grants and 

contracts involving NextGen sequencing technology. 

[Conclusions] 

In spite of resource limitations, infrastructure constraints and a relatively short time to carry out the largescale 

sequencing data analysis, TGen has successfully aligned approximately 270 Giga-bases out of 550 

Giga-bases processed against the human genome. 

Throughout 2009, several new research groups at TGen incorporated NextGen sequencing technologies 

into their research, consequently the number of bioinformatics personnel carrying out NextGen data 

analysis is increasing. Concurrently, the number of sequencers at TGen has gone from two to seven (two 

SOLEXA and five SOLiD). TGen expects to add six more SOLiD sequencers in early 2010. The 

throughput of each sequencer at TGen has more than doubled relative to early 2009 and this trend is 

expected to continue or even accelerate. Large volumes of data generated by external collaborators and

industrial partners are being processed at TGen. The increase in throughput and data volume 

necessitates scalable storage, HPC, and high-bandwidth network connectivity to store, manage and 

process sequencing data. These challenges will continue to provide opportunities for IT to play an 

increasingly important role in scientific research. 

[REFERENCES] 

[1] Saguaro supercomputer (http://www.top500.org/system/9789) 

[2] Lustre Parallel File system (http://www.oracle.com/us/products/servers‐storage/storage/storagesoftware/031855.htm) 

[3] iperf (http://sourceforge.net)

2010 Best Practices Competition IT & Informatics HPC

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?