2010 Best Practices Competition IT & Informatics HPC
IT Informatics - Cambridge Healthtech Institute
IT Informatics - Cambridge Healthtech Institute
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
the cluster grind to a halt. The key is to actively manage and schedule computational jobs in such a way<br />
as to prevent select jobs from overwhelming the system and impacting other jobs.<br />
WAN Transport & System Tuning: TGen’s initial sequence data processing pipeline included<br />
transferring the raw sequenced data via a 100Mb WAN Ethernet link from the sequencer to the <strong>HPC</strong><br />
cluster environment that is located at our off-site data center. Despite upgrading the 100Mb WAN<br />
Ethernet link to 1Gb, the data transfer time of NFS over TCP at a 12 mile distance was still slow. This was<br />
due to the effect of latency and TCP checksums. Basically, the round trip time for packets meant that<br />
every checksum that was verified took upwards of 4.5 ms to complete, resulting in a fairly substantial<br />
delay between each frame. In order to mitigate this, we fine tuned Linux Kernel network parameters, such<br />
as TCP_Window_Size. We used open source tools such as iperf [3] to test the effects of kernel tuning<br />
which showed dramatic increases in throughput. However, the performance of data transfer over NFS<br />
was still unsatisfactory. Due to the variety and number of hosts that required connections across the<br />
Ethernet link, performing individual kernel tuning on each host was impractical. The solution to the data<br />
transfer issues was doing the NFS mounts over UDP. This introduced the new issue of silent data<br />
corruption because UDP does not perform checksums. This meant that MD5 checksums must be<br />
generated for data files being transferred to ensure data integrity. The key lesson learned was that careful<br />
attention should be paid to performance tuning measures. There is a lot of benefit to be gained by taking<br />
the time to understand and optimize system parameters. Doing so may reduce costs associated with<br />
unnecessary bandwidth upgrades that may not deliver the expected performance improvement.<br />
LAN Data Transport Capacity: Moving data off of the sequencers to storage and computational<br />
resources became a very time consuming task. Having multiple sequencers producing and transporting<br />
data simultaneously quickly overwhelmed 1Gb LAN segments. Fortunately TGen had previously invested<br />
in 10Gb core network components enabling us to extend 10Gb networking to key systems and resources<br />
in data processing pipeline thus eliminating bottlenecks on the LAN. As a result we learned or validated<br />
the importance of fully exploiting the capabilities of the infrastructure available and the importance of<br />
having a flexible network architecture.<br />
Internet Data Transport & Collaboration: As TGen began to exchange sequenced data with external<br />
collaborators, it became immediately apparent that traditional file transfer methods such as FTP would<br />
not be practical as the data sets were simply too large and the transfer times were not acceptable. This<br />
problem could not be addressed by simply increasing bandwidth as TGen has no control over the<br />
bandwidth available at collaboration sites. Internet latency issues became magnified when attempting to<br />
transfer large data sets. This project required TGen to receive sequenced data from other organizations,<br />
perform analysis, and make the results available to the other organizations. After researching various<br />
approaches and exchanging ideas with others at the Networld Interop conference, TGen chose to<br />
implement the Aspera FASP file transfer product. Aspera enabled scientists to send and receive data at<br />
an acceptable rate, and enhanced TGen’s ability to participate in collaborative research projects involving<br />
NextGen sequencing. Lesson, actively seek out best practices and leverage the experiences of others in<br />
your industry. Participating in user groups and other industry related forums can reduce the time it takes<br />
to identify and implement significant improvements to your infrastructure or workflow.<br />
Data Management: The sheer volume of NextGen sequencing data had an immediate and significant<br />
impact on our file management and backup infrastructure and methods. Scientists were initially hesitant<br />
to delete even raw image data until they were comfortable with the process of regenerating the<br />
information. This resulted in scientists keeping multiple versions of large data which quickly consumed<br />
backup and storage capacity. TGen’s <strong>IT</strong> department worked collaboratively with the scientific community<br />
to optimize data management methods. This involved achieving consensus on what is “essential data”,<br />
defining standard naming conventions, and establishing mutually agreed upon rules regarding the