02.10.2015 Views

2010 Best Practices Competition IT & Informatics HPC

IT Informatics - Cambridge Healthtech Institute

IT Informatics - Cambridge Healthtech Institute

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

the cluster grind to a halt. The key is to actively manage and schedule computational jobs in such a way<br />

as to prevent select jobs from overwhelming the system and impacting other jobs.<br />

WAN Transport & System Tuning: TGen’s initial sequence data processing pipeline included<br />

transferring the raw sequenced data via a 100Mb WAN Ethernet link from the sequencer to the <strong>HPC</strong><br />

cluster environment that is located at our off-site data center. Despite upgrading the 100Mb WAN<br />

Ethernet link to 1Gb, the data transfer time of NFS over TCP at a 12 mile distance was still slow. This was<br />

due to the effect of latency and TCP checksums. Basically, the round trip time for packets meant that<br />

every checksum that was verified took upwards of 4.5 ms to complete, resulting in a fairly substantial<br />

delay between each frame. In order to mitigate this, we fine tuned Linux Kernel network parameters, such<br />

as TCP_Window_Size. We used open source tools such as iperf [3] to test the effects of kernel tuning<br />

which showed dramatic increases in throughput. However, the performance of data transfer over NFS<br />

was still unsatisfactory. Due to the variety and number of hosts that required connections across the<br />

Ethernet link, performing individual kernel tuning on each host was impractical. The solution to the data<br />

transfer issues was doing the NFS mounts over UDP. This introduced the new issue of silent data<br />

corruption because UDP does not perform checksums. This meant that MD5 checksums must be<br />

generated for data files being transferred to ensure data integrity. The key lesson learned was that careful<br />

attention should be paid to performance tuning measures. There is a lot of benefit to be gained by taking<br />

the time to understand and optimize system parameters. Doing so may reduce costs associated with<br />

unnecessary bandwidth upgrades that may not deliver the expected performance improvement.<br />

LAN Data Transport Capacity: Moving data off of the sequencers to storage and computational<br />

resources became a very time consuming task. Having multiple sequencers producing and transporting<br />

data simultaneously quickly overwhelmed 1Gb LAN segments. Fortunately TGen had previously invested<br />

in 10Gb core network components enabling us to extend 10Gb networking to key systems and resources<br />

in data processing pipeline thus eliminating bottlenecks on the LAN. As a result we learned or validated<br />

the importance of fully exploiting the capabilities of the infrastructure available and the importance of<br />

having a flexible network architecture.<br />

Internet Data Transport & Collaboration: As TGen began to exchange sequenced data with external<br />

collaborators, it became immediately apparent that traditional file transfer methods such as FTP would<br />

not be practical as the data sets were simply too large and the transfer times were not acceptable. This<br />

problem could not be addressed by simply increasing bandwidth as TGen has no control over the<br />

bandwidth available at collaboration sites. Internet latency issues became magnified when attempting to<br />

transfer large data sets. This project required TGen to receive sequenced data from other organizations,<br />

perform analysis, and make the results available to the other organizations. After researching various<br />

approaches and exchanging ideas with others at the Networld Interop conference, TGen chose to<br />

implement the Aspera FASP file transfer product. Aspera enabled scientists to send and receive data at<br />

an acceptable rate, and enhanced TGen’s ability to participate in collaborative research projects involving<br />

NextGen sequencing. Lesson, actively seek out best practices and leverage the experiences of others in<br />

your industry. Participating in user groups and other industry related forums can reduce the time it takes<br />

to identify and implement significant improvements to your infrastructure or workflow.<br />

Data Management: The sheer volume of NextGen sequencing data had an immediate and significant<br />

impact on our file management and backup infrastructure and methods. Scientists were initially hesitant<br />

to delete even raw image data until they were comfortable with the process of regenerating the<br />

information. This resulted in scientists keeping multiple versions of large data which quickly consumed<br />

backup and storage capacity. TGen’s <strong>IT</strong> department worked collaboratively with the scientific community<br />

to optimize data management methods. This involved achieving consensus on what is “essential data”,<br />

defining standard naming conventions, and establishing mutually agreed upon rules regarding the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!