ScientificComputing.com
Technologies for Science and Engineering
ISC’15 Special Edition
Accelerating Scientific Discovery INFORMATICS If These Walls Could Talk: The IoT Linking an Instrument to a Tablet MRHA Data Integrity Guidance for Industry
HIGH PERFORMANCE COMPUTING ISC’15 Preview Optimizing Workflows in Globally Distributed, Heterogeneous Environments Thoughts on the Exascale Race
DATA ANALYSIS Partek Genomics Suite 6.6 Review Survival Models
Explore a universe where humanity’s greatest challenges are ¡¬¯³£°£¢ · ²¦£ £Þ °²± ¤ ²¦£ ¡««³¬§²· \ ¤°« ¡³°§¬¥ ¢§±£ ±£ ² ß ¬¢§¬¥ ³° ¬£¶² £¬£°¥· ±³°¡£± ² ®°£¢§¡²§¬¥ ²¦£ next dangerous weather system. «£ ´§±§² ²¦£ ¬²£ª ²¦ ² ±££ °£ ª[μ°ª¢ £¶ «®ª£± ¤ °¥ ¬§¸ ²§¬± ¬¢ ¡«® ¬§£± μ¦ °£ °£ ª§¸§¬¥ ²¦£§° °£ ©²¦°³¥¦± ¤ ±²£° ²¦°³¥¦ ® ° ªª£ª ¡«®³²§¬¥ §¬ ¬²£ª ¬¢ ® °²¬£°± ¤° ®°¢³¡² ¢£«¬±²° ²§¬± ®°£±£¬² ²§¬± ¬¢ ¢§±¡³±±§¬± ¬ ²¦£ §«®°² ¬¡£ ¤ ® ° ªª£ª§¸§¬¥ ±¤²μ °£ ¤° «¢£°¬ ¦ °¢μ °£ °¡¦§²£¡²³°£± 즧ª£ « §¬² §¬§¬¥ ¡««¬ ¡¢£ ±£
VISIT US IN BOOTH #930. Follow @IntelHPC on Twitter
½ ¬²£ª °®° ²§¬ ªª °§¥¦²± °£±£°´£¢ ¬²£ª ²¦£ ¬²£ª ª¥ £¬ £¬ ¦§ ¬¢ ²¦£ ¬²£ª ¶®£°§£¬¡£ ¦ ²Ì± ¬±§¢£ ² ¥ª§¬£ °£ ²° ¢£« °©± ¤ ¬²£ª °®° ²§¬ in the U.S. and/or other countries.
Contents INFORMATICS 6 If These Walls Could Talk The Internet of Things will save lives, time and treasure
8 Linking an Instrument to a Tablet Still a bridge too far?
9 Review and Critique of the MRHA Data Integrity Guidance for Industry
HIGH PERFORMANCE COMPUTING 11 Advanced Computation Plays Key Role in Accelerating Life Sciences Research ISC’15 Preview
18 Optimizing Workflows in Globally Distributed, Heterogeneous HPC Computing Environments This complex task requires significant software support
20 Thoughts on the Exascale Race HPC has become a mature market
DATA ANALYSIS 22 Software Review: Partek Genomics Suite 6.6 Combining easy-to-use statistics with interactive graphics
25 Survival Models An important technique employed in medical and engineering sciences
July 2015
ScientificComputing.com 3
EDITORIAL
Understanding Health, Disease and the Human Brain
A
s the start of 2015 ISC High Performance approaches, Scientific Computing is excited to present a sneak peak at two life sciences sessions taking place this year in Frankfurt. In our cover story, “Advanced Computation Plays Key Role in Accelerating Life Sciences Research,” Manuel Peitsch and Thomas Lippert provide exclusive previews of the “Supercomputing and the Human Brain Project — a 10-year Quest” and “Understanding Health and Disease” sessions — two excellent examples of the trend toward increased reliance on advanced computation to accelerate research. Also in this issue, our expert contributors share their insights on topics ranging from the Internet of Things, to MRHA Data Integrity Guidance for Industry, to optimizing workflows in globally distributed, heterogeneous HPC computing environments. In “If These Walls Could Talk,” William Weaver, an associate professor in the Department of Integrated Science, Business, and Technology at La Salle University, talks about how the future Internet of Things will save lives, time and treasure. Is linking an instrument to a tablet still a bridge too far? Peter Boogaard, founder of Industrial Lab Automation and chairman of the Paperless Lab Academy, brings us up-to-speed on challenges faced in today’s laboratory; while R.D. McDowall provides a review and critique of the MRHA Data Integrity Guidance for Industry.
The complex task of “Optimizing Workflows in Globally Distributed, Heterogeneous HPC Computing Environments” requires significant software support, and HPC expert Rob Farber walks us through it. Meanwhile, as the HPC community hurtles toward the exascale era, Steve Conway, Research VP, HPC at IDC, Suzanne Tracy shares his “Thoughts on the Exascale Race.” Editor in Chief John Wass, a statistician based in Chicago, editor@ IL, reviews Partek Genomics Suite 6.6, and ScientificComputing.com Mark Anawis, a Principal Scientist and ASQ Six Sigma Black Belt at Abbott, talks about “Survival Models” — an important technique employed in medical and engineering sciences. Finally, a special four-page insert showcases “The Big 5 at ISC High Performance 2015” and features highlights from this year’s program. As an added resource, Scientific Computing’s ISC Web page offers a one-stop destination featuring comprehensive information on all things ISC. The page is specifically designed to help you quickly find everything you need, and to arm you with useful resources. Be sure to check it out at http://www.scientificcomputing.com/ISC. Enjoy the issue!
Horst D. Simon, Ph.D. Deputy Director, Lawrence Berkeley National Laboratory John A. Wass, Ph.D. Statistician/Consultant General Manager David A. Madonia 973-920-7048 david.madonia@advantagemedia.com Editorial Director Bea Riemschnider bea.riemschnieder@advantagemedia.com Editor in Chief Suzanne Tracy suzanne.tracy@advantagemedia.com Contributing Editors Mark Anawis Peter Boogaard Steve Conway Michael Elliott Rob Farber Wolfgang Gentzsch, Ph.D. John R. Joyce, Ph.D. Mike Martin Bob McDowall, Ph.D. Siri Segalstad Caitlin Smith, Ph.D. John A. Wass, Ph.D. Bill Weaver, Ph.D. editor@ScientificComputing.com EDITORIAL BOARD Steve Conway Research Vice President, HPC, IDC Michael H. Elliott President, Atrium Research John R. Joyce, Ph.D. Laboratory Informatics Specialist
4 ScientificComputing.com
INDUSTRY ADVISORY COMMITTEE Sean Fitzgerald VP of Technology, Visual Numerics Stephen Fried President, Microway Randy C. Hice Director of Strategic Consulting, STARLIMS Joe Przechocki Product Marketing Manager, OriginLab Wayne Verost President, QSI NORTH AMERICAN SALES OFFICE NEW ENGLAND Luann Kulbashian 973-920-7685 luann.kulbashian@advantagemedia.com MID-ATLANTIC Joy DeStories 973-920-7112 joy.destories@advantagemedia.com MID-ATLANTIC Traci Marotta 973-920-7182 traci.marotta@advantagemedia.com MID-ATLANTIC Greg Renaud 973-920-7189 greg.renaud@advantagemedia.com MIDWEST Tim Kasperovich 973-920-7192 tim.kasperovich@advantagemedia.com
MIDWEST Jolly Patel 973-920-7743 jolly.patel@advantagemedia.com WEST Fred Ghilino 973-920-7163 fred.ghilino@advantagemedia.co Reprints The YGS group, 1-800-290-5460 reprints@theygsgroup.com List Rentals INFOGROUP TARGETING SOLUTIONS Senior Account Manager, Bart Piccirillo 402-836-6283; bart.piccirillo@infogroup.com Senior Account Manager, Michael Costantino 402-863-6266; michael.costantino@infogroup.com For subscription related matters please contact: ABM@ omeda.com<mailto:ABM@omeda.com> Or phone them at: 847-559-7560, or fax requests to: 847-291-4816
100 Enterprise Drive, Suite 600 Rockaway, NJ 07866-0912 1-973-920-7535 Fax: 1-973-920-7542 Chief Executive Officer Jim Lonergan Chief Operating Officer/ Chief Financial Officer Theresa Freeburg Chief Content Officer Beth Campbell July 2015
All Things ISC Introducing Scientific Computing’s New International Supercomputing Resource Site Scientific Computing’s ISC Conference resource site is a one-stop destination that offers comprehensive information on all things ISC, collected together in one place, making it easy to locate resources on topics such as:
• Everything You Need to Know About Supercomputers • Usable Quantum Computing • HPC in the Life Sciences • The HPCAC-ISC Student Cluster Competition • Return on Investment and Market Competitiveness • International Collaboration • Industry Innovation through HPC • The Human Brain Project • Extreme Computing Challenges • Emerging Trends for Big Data Updates will include the latest news on what’s happening at this year’s International Supercomputing events, articles and blogs authored by technology experts, conference videos, profiles of keynote speakers and ISC fellows, and more.
Visit www.scientificcomputing.com/ISC today.
INFORMATICS
If These Walls Could Talk The Internet of Things will save lives, time and treasure
O William Weaver, Ph.D.
6
ScientificComputing.com
n April 15, 2014, Charles Allen, Jr., a Wilmington, DE, business owner, was driving home on the I-495 bypass, as he had done for 25 years. While traveling on the twin three-lane bridges that span the Christiana River, he noticed that the normally parallel bridges were offset by nearly 18 inches in height and the apparent listing of one of the lanes had created a large gap between the bridges through which the ground was visible, some four stories below. Unable to reach the Delaware Department of Transportation (DelDOT) after business hours he called the 911 dispatcher to alert officials to the problem. The bridges continued to carry over 90,000 vehicles per day until R. David Charles, a geotechnical engineer with Duffield Associates, sent photos of the tilted bridge supports to DelDOT on May 29, 2014. Bridge inspectors followed up on this report four days later on June 2, 2014, and immediately closed the bridges to traffic. On June 6, 2014, U.S. Senator Chris Coons, D-Delaware, visited the closed bridge and commended DelDOT Secretary Shailen Bhatt for the “prompt and effective and responsible action” his office took to address the problem, some 48 days after Allen’s initial 911 emergency call. After affixing electronic tilt sensors to the bridge supports, engineers were able to determine that a 55,000-ton stockpile of topsoil placed near the bridge had cracked eight footings and caused four pairs of supports to list as much as four degrees from vertical. Had low-power, wireless tilt sensors been attached a season earlier, the damage to the bridge could have been detected in real time and repairs initiated before conditions deteriorated. Earlier in March of 2014, Malaysia Airlines Flight 370 was lost, and our thoughts go out to all those involved. The Boeing 777 was outfitted with twin Rolls Royce Trent 800 engines that were equipped with sensors that periodically report real-time
operating conditions through the Aircraft Communications Addressing and Reporting System (ACARS) via satellite to the manufacturer’s Global Engine Health Monitoring Center in Derby, UK. Designed to alert ground-based mechanics of engine maintenance requests upon landing, the ACARS data was used to determine the flight path of the missing airplane for several hours after communication with the flight crew was lost. As connected monitoring devices continue to seep into consumer products that are less costly than jet engines and interstate bridges, the term “Internet of Things” begins to apply. Coined in 1999 by Kevin Ashton while serving as the Executive Director of the Auto-ID Center, an RFID technology development effort of the Massachusetts Institute of Technology, the IoT can be deployed inexpensively and with immense return on investment (ROI). Unlike the industrial Supervisory Control and Data Acquisition (SCADA) systems that involve humans in the information and analysis loop, the IoT merges existing low-cost sensors, protocols and network technologies with automated analysis and cloud-based machine learning. While the sensor data is useful for monitoring longterm trends, real-time value is realized by anticipating maintenance and avoiding equipment failure, as evidenced by the use of ACARS since 1978. One new device in the IoT space was funded by a 2013 Kickstarter campaign that raised 280 percent of the funding requested to develop a “home intelligence” sensor known as “Neurio.” This harmonica-sized device from Vancouver based Neurio Technology, plugs into an electric breaker panel and monitors the voltage, current and frequencies of the entire electric circuit connecting every electrical device in the home. When an appliance is switched on, it adds its unique power “footprint” to the circuit. Neurio
June July 2014 2015
INFORMATICS sends these measurements into the “Neurio cloud” via WiFi where they are analyzed to detect the operation of specific appliances, such as heaters, air conditioners and room lights, which themselves do not contain connected sensors. Users can choose to act on Neurio-detected events through apps on their smartphones or Web devices. In addition to serving as an energy usage monitoring and optimization system, Neurio’s public application program interface (API) is allowing developers to use the data to trigger home automation events. Conceivably, this system or one with increased sensitivity could be used to monitor the health and performance of electronic
EVERY THING E
appliances in the home. In addition to alerting users of a forgotten hot clothing iron, Neurio could report on the decreasing efficiency of its heating element and predict the amount of time remaining before a trip to the local big box store is required. Elected officials of all stripes correctly pontificate on the need to address the declining condition of our nation’s infrastructure. Perhaps, instead of awarding contracts to projects based on their shovelreadiness, the increased deployment of sensors connected by the Internet of Things will not only advise the prioritization of projects but help to avoid the tragic and costly effects of catastrophic failures.
• Kevin Ashton: www.howtoflyahorse.com/about-the-author/ • Neurio Technologies, Inc.: www.neurio.io/ • Neurio Installation Video: www.youtube.com/watch?v=ky8vmwRFGuk
William Weaver is an associate professor in the Department of Integrated Science, Business, and Technology at La Salle University. He may be contacted at editor@ScientificComputing.com.
www.ScientificComputing.com
Robot Revolution Explores Visionary World through Cutting-edge Robots by Suzanne Tracy, Editor in Chief “They’re here … to help and improve our lives,” The Museum of Science and Industry, Chicago announces on its Web site. MSI is hosting a new national touring exhibit, Robot Revolution, which explores how robots, created by human ingenuity, will ultimately be our companions and colleagues, changing how we play, live and work together. bit.ly/1HWJ7M2
Prostate Cancer Jungle: Navigating Diagnosis and Treatment Options is Daunting by Randy C. Hice What do Rudolph Guliani, Robert De Niro, Dennis Hopper, James Brown, Arnold Palmer, Joe Torre, Dan Fogelberg, Colin Powell, John Kerry, Johnny Ramone, Francois Mitterand, Robert Frost, and Frank Zappa have in common? They have lived with, or died from, prostate cancer. Every 19 minutes, an American man dies from prostate cancer. It is the second leading cause of cancer death among men and is the most common cancer in men... bit.ly/1HWJ6HY
Deriving Real Time Value from Big Data by Pat McGarry, Ryft Systems Everyone has heard the old adage that time is money. In today’s society, business moves at the speed of making a phone call, looking something up online via your cell phone, or posting a tweet. So, when time is money (and can be a lot of money), why are businesses okay with waiting weeks or even months to get valuable information from their data? bit.ly/1FHRgpZ
July 2015
RESOURCES • I-495 April 2014 Bridge Closure: www.delawareonline.com/story/news/ traffic/2014/06/06/april-call-crazy-emergencybridge/10095489/
Software and Moore’s Drumbeat (Moore’s Law) by James Reinders, Intel Moore’s Law recently turned 50 years old, and many have used the milestone to tout its virtues, highlight positive results that stem from it, as well as advance suggestions on what the future dividends will be and boldly project the date for its inevitable demise. Moore’s Law is an observation that has undoubtedly inspired us to innovate to the pace it predicts. It has challenged us to do so. Therefore, I think of it as Moore’s drumbeat. bit.ly/1R15iET
Cost of LIMS: True Pricing includes more than Purchase, Implementation and Annual Licensing by Siri H. Segalstad The real benefit of laboratory information management systems (LIMS) is difficult to calculate. Let’s take a look at some key considerations, starting with the question of whether to build the LIMS yourself or buy a commercial LIMS… Advocates for building a new LIMS themselves usually state that their lab is so unique, they cannot use a commercial LIMS. However, very few labs are truly unique ... bit.ly/1JUff2P
Computerized Systems in the Modern Laboratory: A Practical Guide by John R. Joyce, Ph.D. Scientific Computing periodically features special informatics focus articles that attempt to help you with such complex tasks as selecting a laboratory information management system or interfacing systems together. Unfortunately, there is only a limited amount of information that one can cram into one of these articles. Where we are limited to just a few pages for each of our attempts, Joe Liscouski has written a whole book on the subject.
bit.ly/1cTXKph
ScientificComputing.com
7
INFORMATICS
Linking an Instrument to a Tablet Still a bridge too far?
I Peter J. Boogaard
n its simplest form, an electronic laboratory notebook (ELN) can be thought of as an electronic embodiment of what is currently being done in a paper laboratory notebook. It is a tool that facilitates the workflows that play out in your particular laboratory. However, laboratory information management system (LIMS), electronic laboratory notebook and lab execution system (LES) applications all support this basic definition to a greater or lesser extent. The capabilities of all these product categories are blurring. Today’s market perception is that an ELN is being used on a tablet and not on a traditional computer. An interesting detail worth mentioning is that, when the term “electronic lab notebook” was introduced in the late 90s, tablets didn’t exist at all! Many laboratory operations are still predominantly paper-based. Even with the enormous potential to increase data integ-
rity for compliance and global efficiency gains when working within virtual teams, significant barriers to implementing successful paperless processes still remain. In 2012, Michael Elliott mentioned in his Scientific Computing column1 that “the explosive growth of tablet computing has taken many by surprise” and that “vendors are starting to test the waters.” An interesting statement, since it demonstrates how our industry lacks true innovation. We are masters of finding ways to be different, and we use all our scientific creativity to prove that we are right. So, what could be the reason why, in other industries, the tablet phenomenon was accepted so rapidly, and how can we learn and adopt from it? In recent years, the electronic banking and travel industries, patient and healthcare, and retail industries have adopted new ways of working and have created
xxxxxx
Fully automated internet/Wi-Fi
8
ScientificComputing.com
Slow manual or semi-manual RS232/Bluetooth July 2015
INFORMATICS significant benefits for their consumers. At the same time, the impact of this transformation resulted in a 20- to 40-percent cost reduction and improved throughput time.2 Many of these industries also have to comply with regulations, and with safety and security standards. Instead of learning from other industries, adopting best practices and optimizing them, we (suppliers, regulators, industry, etcetera) are all struggling to make a fundamental mindset shift to eliminate the vital barriers to create a huge adoption of new technologies. Examples… • When did you last wait in line to check in for a flight at an airport? • How often have you been able to reserve your favorite seat in advance? • Have you mailed a check to initiate a financial transaction lately? • Have you adjusted the way you communicate with your peers and colleagues during the last 10 years? The bottom line is that we constantly adjust our own processes and mindsets to maximize our personal way of working. So, how much has been changed in the way we capture experimental data from instruments in our daily laboratory work?
Review and Critique of the MRHA Data Integrity Guidance for Industry
T R.D. McDowall, Ph.D.
his new series of four articles takes a look at the UK’s Medicines and Heathcare products Regulatory Agency (MHRA) guidance for industry on data integrity. The focus of these articles is an interpretation and critique of the second version of the MRHA data integrity guidance for laboratories working to European Union GMP regulations, such as analytical development in R&D and quality control in pharmaceutical manufacturing. In doing so, some of the main differences between the first and second versions of the document will be highlighted and discussed.
Part 1: Overview Topics in this introductory article include the global data integrity problem, drowning in integrity definitions and an overview of the MHRA data integrity guidance overview — setting the scene, and a discussion of the wrong definition of data and integrity criteria. There is a discussion of “1 in 28” member states of the European Union; why is the UK not co-ordinating with the rest of the European Union? • Read Part 1: http://bit.ly/1eDIehP
Part 2: Data Governance System In the second part, we look at the MHRA requirement for a data governance system — is there a basis for this when interpreting EU GMP Chapter 1? The main elements for a data governance system are presented and discussed — we also compare with an extreme example of a data governance system set up by a company following a consent decree with the FDA. • Read Part 2: http://bit.ly/1eDIj4X
Part 3: Data Criticality and Data Life Cycle This section of the guidance first looks at data generation: data integrity risk and criticality via different ways of generating data from observation to an electronic computerized system using electronic signatures. In addition, we consider the components of a laboratory data lifecycle and look at some of the issues surrounding this. • Read Part 3: http://bit.ly/1FeFLzR
Part 4: System Design, Definitions and Overall Assessment In this final part of the series, we look at the section on system design and discuss the key definitions that constitute the bulk of the guidance document. The series finishes with an overall assessment of the guidance document: the good parts and the poor parts. • Read Part 4: http://bit.ly/1RvwETY
R.D. McDowall is Director, R D McDowall Limited. He may be reached at editor@ScientificComputing.com.
July 2015
ScientificComputing.com
9
INFORMATICS Over 75 percent of a laboratory experiment or analysis starts with some kind of manual process, such as weighing. The majority of the results of these measurements are still written down manually on a piece of paper or re-typed into a computer or tablet. ELN and mobile devices like tablets are married to each other. However, to connect a balance, you need to be an IT professor. Many modern ELN and LES systems do allow electronic connection to a network. However, to integrate simple instruments like a pH, balance, titration and Karl-Fischer instruments to mobile devices, a simpler approach is required in order to achieve mainstream adoption. Complex protocols, expensive technical consultancy and difficult-to-validate processes are often avoided, resulting in retyping the same information over and over again. While the perception is that we reduce transcription errors, the reality in our daily laboratory activities is often the opposite. In many other industries, LEAN “waste” processes are de-facto standard and have been adopted, resulting in significant reduction in what is required to “do it right” the first time, as well as reduction in unnecessary movements of people in parts between processes. It is amazing how mainstream modern cars have adopted modern GPS and wireless technologies. Connecting any smartphone from almost any brand can be integrated into almost any car manufactured around the world. No need to be an IT professor to con-
nect these devices. No need to be a computer expert to test and operate these (complex) devices. No need to buy expensive software updates or applications to enjoy music, get the shortest road to your favorite location or dial any phone number via your audio system. To lock your car — wireless and secured! So, why is it that, in our scientific
shocked that some laboratories are making photos of a balance display and then using optical character recognition (OCR) to translate the photo to a number in their tablet. What happens if a comma is interpreted as decimal point? We all know that in the USA and Europe the use of a . (dot) or , (comma) has a different meaning!
“Waste is anything other than the minimum amount of equipment, materials, parts and working time which is absolutely essential to add value to the product or service.” — Ohno Taiicho, father of LEAN quality improvement process industry, instead of adoring these great new (cheap) technologies, we are finding ways to avoid using them or ignoring them completely? During some of my recent surveys, I learned that new graduates are not familiar with RS232 connectivity. However, this connector still dominates laboratory instruments to connect to external devices. Explaining that the USB port is the modern replacement and that Bluetooth is the wireless equivalent opened their eyes. During the latest SmartLab Exchange in Philadelphia and Paperless Lab Academy in Barcelona,3 I challenged the audience with this observation and invited suppliers, IT professionals and end users to participate in a trend watch discussion: Why are we not able to adopt well-accepted low-cost industry standard technologies in our industry? I was
The point of no return has been passed. The acceptance of tablets and mobile devices will exponentially expand in the laboratory. The question is: When will we be able to integrate manual processes as seamlessly as we integrate our mobile devices in our personal lives? There is no difference between securely pairing a Bluetooth device to car or computer and pairing a balance to an ELN application on a tablet. There is no need for fast-speed WiFi when we only want to transfer some simple digits from a pH meter to a tablet. There will be no need for expensive consultancy to make simple connections. The need for a plug-and-play standardized protocol, similar to those used for other occasions, is required. To make this happen, a united change of mindset also is required. Finally, cheating is allowed, to re-align our scientific industry and to take advantage of what has been learned in other industries to adopt lean processes. I like to challenge the industry that pairing a balance with a tablet using an ELN application should be as simple as connecting a phone in your car. Some of us may remember the days of connecting a printer in the pre- plug-and-play era…
REFERENCES 1. Michael Elliott. Tablets and ELN – A Honeymoon – Scientific Computing August 2012 2. McKinsey - ISPE Annual Meeting – Las Vegas October 2014 3. Paperless Lab Academy – Barcelona 2016 www.paperlesslabacademy.com
A plug-and-play standardized protocol will simplify processes.
10
ScientificComputing.com
Peter Boogaard is the founder of Industrial Lab Automation and chairman of the Paperless Lab Academy. He may be reached at editor@ScientificComputing.com. July 2015
HIGH PERFORMANCE COMPUTING
Advanced Computation Plays Key Role in Accelerating Life Sciences Research
Thomas Lippert, Ph.D.
3-D model of the human brain: A 3-D model of the human brain n, which considers cortical architeccture, connectivity, genetics and function. Courtesy of Research Centre Ju uelich
a session on the various work being done under the EU’s Human Brain Project (HPB). In this article, Manuel Peitsch and Thomas Lippert provide exclusive previews of what each of these sessions will cover. Manuel Peitsch, Ph.D.
SUPERCOMPUTING AND THE HUMAN BRAIN PROJECT — A 10-YEAR QUEST Thomas Lippert, Ph.D.
L
ife scientists are increasingly reliant on advanced computation to advance their research. Two very prominent examples of this trend will be presented this summer at the ISC High Performance conference, which will feature a five-day technical program focusing on HPC technologies and their application in scientific fields, as well as their adoption in commercial environments. On Tuesday, July 14, Manuel Peitsch, VP of Biolologic Systems Research at Philip Morris International and Chairman of the Executive Board at The Swiss Institute of Bioinformatics, will chair a session on computational methods used to understand health and disease, while on Wednesday, July 15, Juelich Supercomputing Centre director Thomas Lippert will direct July 2015
The goal of the FET Flagship Human Brain Project (HBP), funded by the European Commission, is to develop and leverage information and communication technology fostering a global collaboration on understanding the human brain by using models and data-intensive simulations on supercomputers. These models offer the prospect of a new path for the understanding of the human brain, but also of completely new computing and robotics technologies. Given the opportunity to take advantage of the ISC High Performance conference (ISC 2015) as a discussion platform that will accompany the HBP during its project duration, this year’s HBP session aims at reporting on the first 18 months of this global collaborative effort and at highlight-
ing some underlying technologies, such as the Web-accessible portal and the information and technology platforms, as well as aspects of the neural networks simulations employed by the project and beyond. We especially look forward to focusing on the design of the high performance computing platform, which will make a major contribution to further improve the simulation of neural networks. Launched in October 2013, the HBP is one of the two European flagship projects foreseen to run for a 10-year period. It aims at creating a European neuroscience-driven infrastructure for simulation and big-data-aided modeling and research. The HBP research infrastructure will be based on a federation of supercomputers contributing to specific requirements of neuroscience in a complementary manner. It will encompass a variety of simulation services created by the HBP, ranging from simulations on a molecular level towards the synaptic and neuronal level, up to cognitive and robotics models. The services and infrastructures of the HBP will succes-
ScientificComputing.com
11
HIGH PERFORMANCE COMPUTING sively include more European partners and will be made available step-by-step to the pan-European neuroscience community. Based on this, this year’s HBP session is divided thematically into three sections: the HPC platform, simulation activities and neuromorphic computing approaches. It will be complemented by an external speaker from the new Cortical Learning Center (CLC) at IBM Research on approaches that are being pursued outside the project. One of the talks is dedicated to the collaboratory, service-oriented Web portal serving as a central unified access point to the six HBP platforms (neuroinformatics, brain simulation, HPC, medical informatics, neuromorphic and neurorobotics platforms) for the researchers inside the flagship project. Another talk will highlight the UNICORE middleware, which plays an important role in the HPC platform as an interface technology, creating a uniform view of compute and data resources. The design philosophy and the implementation details used both in the creation of the HBP’s HPC platform and for the integration of the platform components will also be addressed. Brain simulation is a main objective of the HBP. The project aims to reconstruct and simulate biologically detailed multilevel models of the brain displaying emergent structures and behaviors at different levels appropriate to the state of current knowledge and data, the computational power available for simulation, and the demands of researchers. A specific talk will focus on two aspects of brain simulation,
the model space of relevant neurosimulations, and how this may lead to different computational requirements. Existing neuroscientific simulation codes are large scalable software packages that are used successfully by a vast community. The inclusion of advanced numerical methods to simulate effects beyond the scope of the currently used methods requires the design of minimally invasive methods that require only small changes in the existing code and that use the present communication routines to the highest amount possible. The inclusion of advanced numerical schemes for the integration of systems of ordinary differential equations in the NEST software package without harming the scalability will also be presented. With respect to neuromorphic computing approaches, the session will highlight two large-scale complementary neuromorphic computing systems. The status of construction and commissioning of both phase-1 systems is summarized, and an overview of the roadmap until 2023 will be given. The HBP session will end with an overview of the new IBM CLC, its motivation, and technology, which is based on the Hierarchical Temporal Memory cortical model. Early applications and hardware plans will also be discussed. • Session: Supercomputing & Human Brain Project – Following Brain Research & ICT on 10-Year Quest Wednesday, July 15, 2015, 8:00 a.m. Chaired by Thomas Lippert http://bit.ly/1JQg7c8
UNDERSTANDING HEALTH AND DISEASE Manuel Peitsch, Ph.D. The session on “Computational Approaches to Understanding Health and Disease” at the ISC High Performance 2015 conference will address computational aspects of systems biology. The session will also provide a practical example of the application of systems biology in the field of toxicology and show how experimental approaches can be integrated with sophisticated computational methods to evaluate the toxicological profile of a potentially reduced-risk alternative to cigarettes. Systems biology is a highly multidisciplinary approach that considers the biological systems as a whole and combines large-scale molecular measurements with advanced computational methods. It aims to create knowledge of the dynamic interactions between the many diverse molecular and cellular entities of a biological system and of how interfering with these interactions, for instance with genome mutations, drugs or other chemicals, leads to adverse reactions and disease. This knowledge is captured in richly annotated biological network models, which are represented as graphs of nodes (entities) and edges (relationships). Systems biology will, thus, elucidate the fundamental rules and, ultimately, the laws that describe and explain the emergent properties of biological systems — in other words, life itself. Like rules and laws in physics, biological network models may well be the description of complex systems that will enable our understanding of therapeutic interventions and how they influence disease progression. The application of systems biology, there-
Three-dimensional polarized light imaging (3D-PLI) represents a novel neuroimaging technique to map nerve fibers, i.e. myelinated axons, and their pathways in human postmortem brains with a resolution at the sub-millimeter scale, i.e. at the mesoscale. This is not a simulation, but a technique that helps to understand the human brain. Courtesy of Research Centre Juelich
12
ScientificComputing.com
July 2015
HIGH PERFORMANCE COMPUTING
fore, leads to new approaches in pharmacology, toxicology and diagnostics, as well as drug development. It promises to pave the way for the development of a more precise and personalized approach to medicine, both from a therapeutic and a prevention perspective. To reach these ambitious goals, the first objective of systems biology is to elucidate the relevant biological networks and to understand how perturbing their normal function leads to disease. The second objective is to develop mathematical models with predictive power. Towards this end, both experimental and computational methods are needed, and employing them has led to great progress over the last decade. Firstly, the identification of the entities in a network and the elucidation of their interactions — or their connectivity — requires highly accurate large-scale molecular measurement (‘omics’) methods, which permit the quantification of most components of a biological system in relation with physiological observations gathered at the cell, tissue and organ level. Consequently, key measurement technologies have been developed which enable the systematic and comprehensive interrogation of a biological system by producing large sets of quantitative data of all major classes of biological molecules including proteins, gene expression and metabolites. Secondly, the complexity, diversity and shear amount of data produced by these ‘omics’ methods require sophisticated HPC-enabled data analysis methods that yield the knowledge to build and refine the biological network models. These representations require not only a wealth of quantitative and dynamic data about the entities and their interactions, but also
July 2015
dictionaries describing their entities, a di specialized computer language to capsp tture u the nature of their interactions, and mathematical descriptions of their dynamic m behaviors. Clearly, integrated approaches b e tto o HPC-enabled big data analysis still a challenge, and much work is rrepresent e needed to develop more efficient methods. ne Nevertheless, the combination of recent N developments in these areas are yielding de novel computational methods that allow no quantification of how much healthy biological networks are perturbed by active substances, such as toxic chemicals. Furthermore, we are seeing the emergence of mathematical models that are aimed at predicting the effect of substances on skin sensitization, drug-induced liver injuries and embryonic development. The past decade has witnessed massive developments in measurement technologies, as well as in HPC and scientific computing. These developments will continue, providing systems biology with increasingly accurate quantitative molecular and physiological data, biological networks and predictive models of biological mechanisms involved in human diseases. With the gradual development of reliable and validated methods for molecular measurements in the clinical setting, systems biology will likely enable the enhancement of such disease models with personal molecular data and, hence, is likely to play a major role in personalized medicine. Indeed, the “personalization” of disease-relevant biological network models would enable the selection of the most effective drug, or combination of drugs, for the treatment of certain diseases. Therefore, by placing the mechanistic understanding of adverse effects and disease at the center of drug discovery, medicine and toxicology, systems biology plays a major role in driving the paradigm shift in biomedical sciences from “correlation” to “causation.” For instance, the development of novel molecular diagnostics increasingly relies on systems biology. Not only does it enable the identification of molecules associated with disease to unprecedented levels of comprehensiveness, but it also drives the selection of the most diagnostic
combination of molecules to be measured. This selection process will increasingly be based on the understanding of disease-causing mechanisms. In systems toxicology, the quantitative analysis of large networks of molecular and physiological changes enables the toxicological evaluation of substances and products with unprecedented precision. For example, Philip Morris International is developing a range of potentially reduced-risk alternatives to cigarettes. To conduct the non-clinical assessment of these potentially reduced-risk products and determine whether they are indeed lowering the development of disease, we have implemented a systems toxicology-based approach. HPC enables our data analysis processes and building of biological network models, which are used to compare mechanism-by-mechanism the biological impact of novel product candidates with that of conventional cigarettes. While systems biology is starting to yield practical applications in diagnostics, drug discovery and toxicology, much research is needed to leverage this approach fully and to increase its range of applications. The computational enablers that need most attention are the development of computational approaches to big data analysis and the development of HPC systems that are optimized for the complexity of long-term multi-level simulations. While still in their infancy, multi-level mathematical models of organs, such as the heart and brain, hold the promise to enable future drug discovery. Their development, however, requires unprecedented efforts, not only in terms of experimental data collection, but also in computer science and scientific computing. It is clearly through the collaboration between computer scientists, scientific computing specialists and biologists that such developments will come to fruition. • Session: Computational Approaches to Understanding Health & Disease Wednesday, July 14, 2015, 1:45 p.m. Chaired by Manuel C. Peitsch http://bit.ly/1MFydLP For more information about ISC 2015, visit http://www.scientificcomputing.com/ISC.
ScientificComputing.com
13
THE
BIG 5 AT ISC HIGH PERFORMANCE 2015
By Julie Dirosa By the time you read this piece, you’ll either be making your final preparations to get to Frankfurt or you are on the show floor flipping through this magazine, the ISC 2015 edition that Scientific Computing magazine has published specially for this show. Whether you are an academician, researcher, vendor, or simply an HPC techie, you have definitely chosen the right event to attend this year. Alongside 2,600 like-minded peers, colleagues and complete strangers, you’ll be jumping into sessions, meetings, and exploring the show floor with one common intention: to make the best out of your time at the conference. We say carpe diem to you!
Again this year, project co-founder, Dr. Erich Strohmaier of Lawrence Berkeley National Laboratory, will present the 45th list, revealing how well the selected top systems have performed with regard to the Linpack benchmark, and how much energy they require to achieve their scores. Be there on Monday, July 13 at the opening session to witness the awarding of the world’s three fastest machines.
This is the largest show in our 30 years of organizing this conference. We are happy to offer you 25 HPC topics, 67 sessions, 400 speakers, 160 exhibitors and a plethora of networking opportunities. Here are five things that we think you should absolutely not miss out on at this year’s show. We bet these are also the items that will be most talked about in conversations and on social media.
#1 THE NEW TOP500 LIST For the last 23 years the TOP500 authors have been compiling, analyzing and sharing their statistics on the world’s 500 fastest supercomputers with the general public. This list, published twice a year, is very valuable to the HPC industry because it is the only consistently historical effort that tracks the evolution of supercomputers.
#2 THE VENDOR SHOWDOWN This unique and interactive format aims to give the audience a unique perspective of the HPC market. Leading players in both hardware and software will be given exactly 11 minutes to reveal their organizations newest products and R&D efforts, before the judges (industry experts) start probing them with insightful questions. The best part is that the audience will be rating the presentations to decide which two companies will be awarded as the best vendors of the year. This year we have selected 20 exhibitors to take part in the showdown. Participating companies will be divided into two groups to present on the afternoon of July 13.
1
ScientificComputing.com
July 2015
This event isn’t only entertaining, it is also highly informative as you’ll gain a good overview of some of the latest products and solutions available out there in the market. Don’t mistake this session for a sales pitch – it is not!
#3 THE ISC KEYNOTES (MONDAY - WEDNESDAY) No conference is complete without a keynote address, and at ISC 2015 we offer you three – one on each conference day, focusing on present innovations and future challenges. This year’s conference keynote will be about some of the classiest cars on the road today! Juergen Kohler, the head of the NVH CAE and Vehicle Concepts at Daimler AG, will share how his business unit at Mercedes-Benz employs high-capacity computer clusters to digitally predesign each component. The goal is to fulfill functional requirements such as passive and active safety and driving comfort, as well to prevent manufacturing variations during the actual production. This talk, titled “High- Performance Computing – Highly Efficient Development – Mercedes-Benz Cars,” will be delivered on the morning of Monday, July 13. Professor Dr. Yutong Lu the Director of System Software Laboratory at the National University of Defense Technology, China, will
deliver her keynote on Tuesday, July 14. In this presentation “Applications Leveraging Supercomputing Systems,” she will analyze the key issues of application scalability for HPC systems in the post-petascale era and will highlight a co-design approach for the R&D activities to deliver a system capable of efficient computing at scale. The Wednesday July 15 keynote will be delivered by ISC perennial favorite, Dr. Thomas Sterling, who is currently the Professor of Informatics & Computing at Indiana University. Once again he will be focusing on “HPC Achievement and Impact,” sharing with attendees the important advances in HPC system capabilities, new results of technology research, and ambitious plans towards the realization of future extreme-scale computing. Please come early since the room usually gets packed rather quickly for this particular talk.
#4 THE ISC EXHIBITION A visit of the ISC Exhibition is a must! With 160 HPC vendors, international research organizations, laboratories and universities exhibiting in over 2,000 square meters of booth space, this is easily the largest HPC exhibition in Europe. Apart from a couple of HPC companies missing out on the show, the rest of the world’s biggest players in the industry will be here to connect with attendees.
ISC High Performance Conference (ISC 2015) will be held from July 12–16, 2015, in Frankfurt, Germany July 2015
ScientificComputing.com
1
Feel free to enjoy the talks at the Exhibitor Forum, which will be held daily on the show floor. This is where exhibitors will be showcasing their cutting-edge technologies, plus taking questions from the audience. The talks will focus on system architectures, interconnects, processors, memory, storage, applications and system software.
Finally, to browse the full program, please use our interactive agenda planner at http://www.isc-events.com/isc15_ap/ or download the mobile app, ISC 2015 Agenda App, from Google Play, iTunes and Windows Phones store, for free. This app will also give you an update on the exhibition.
Willkommen in Frankfurt and at ISC 2015!
#5 PROST TO 30 YEARS The ISC welcome parties are legendary, guaranteed to put you in a good mood. This year it will be extra special because weâ&#x20AC;&#x2122;re celebrating our 30th anniversary. So, please come join the fun on the exhibition floor at 6:30 pm on Monday, July 13. Expect an evening with plenty to eat and drink and, of course, live music. Also use this opportunity to network with your peers and the exhibitors.
Relive the
ISC Experience in 2015
1
ScientiďŹ cComputing.com
July 2015
ISC Exhibitors A*STAR Computational Resource Centre ACAL BFi Germany GmbH Adaptive Computing AIC Allinea Software Altair Altera Europe Amazon Web Services AMD ANSYS Germany GmbH Applied Micro Circuits Corporation Asetek ASRock Rack Automation N.V. Avnet Technology Solutions GmbH Barcelona Supercomputing Center (BSC) Boston Limited Bright Computing BV Bull CADFEM GmbH CALYOS SA Cavium Inc. CEA Chenbro Europe BV CHPC (CSIR) ClusterVision BV CoCoLink Corp COMSOL Multiphysics concat AG CoolIT Systems Inc CPU 24/7 GmbH Cray CSC - IT Center for Science Ltd. CSCS and hpc-ch DataDirect Networks Dell Deutsches Klimarechenzentrum (DKRZ) DINI Group Dr. Markus Blatt - HPC-Simulation-Software & Services D-Wave Systems Dynatron Corporation E4 Computer Engineering EMC Emulex EPCC, Edinburgh University ETP4HPC European Exascale Projects European Open File System (EOFS) ExaScaler Inc. EXTOLL Fabriscale Technologies Finisar Corporation Fivetech Technology Inc.
July 2015
Fraunhofer Institut SCAI Fraunhofer Institute for Industrial Mathematics ITWM FUJIFILM Recording Media GmbH Fujitsu Limited Gauss Centre for Supercomputing e. V. GENCI GiDEL GIGABYTE Technology Gigalight Technology Co., Ltd Globus - University of Chicago Go Virtual Nordic AB Goopax GRAU DATA AG Greek Research and Technology Network S.A. Heidelberg University (URZ) Hessisches Kompetenzzentrum für Hochleistungsrechnen (HKHLR) Hewlett-Packard HGST, a Western Digital Company HLRN HLRS Stuttgart HPC Advisory Council HPC in Latin America HPC Today Huawei IBM Iceotope IEEE Spectrum Inspur Intel Irish Centre for High End Computing (ICHEC) IT4Innovations National Supercomputing Center JARA-HPC Jülich Supercomputing Centre Kalray KISTI KIT / SCC Kitware SAS Leibniz Supercomputing Centre Lenovo Megware Computer Mellanox Micron Technology Microtronica - A DIVISION OF ARROW Moscow State University Nallatech National Center for High-performance Computing (NCHC) National Computational Infrastructure National University of Defense Technology (NUDT)
NEC Deutschland NICE Numascale Numerical Algorithms Group (NAG) Omnibond One Stop Systems Optomark Oracle Panasas Pawsey Supercomputing Centre Penguin Computing Percona PRACE QCT (Quanta Cloud Technology) Q-Leap Networks GmbH RapidIO.org Rausch Netzwerktechnik GmbH Red Oak Consulting Riken Rogue Wave Software RSC Samsung Semiconductor Europe SanDisk International LTD Scality scapos AG Scientific Computing World Scilab Enterprises Seagate SGI SLURM Spectra Logic Stäubli Tec-Systems GmbH STFC – Hartree Centre Sugon Information Industry Supermicro Tabor Communications Inc. Technische Universität Dresden Teraproc The Platform The Portland Group Tokyo Institute of Technology TOP500 Toshiba Electronics Europe GmbH T-Platforms transtec AG T-Systems UNICORE Forum e.V. UNIVA University of NizhnZ Novgorod University of Tokyo 10Gtek Transceivers Co., LTD 2CRSI
ScientificComputing.com
1
HIGH PERFORMANCE COMPUTING
Optimizing Workflows in Globally Distributed, Heterogeneous HPC Computing Environments This complex task requires significant software support
O Rob Farber
ptimization of workflows in a modern HPC environment is now a globally distributed, heterogeneous-hardware-challenged task for users and systems administrators. Not only is this a mouthful to say, it is also a complex task that requires significant software support. In the old days, job schedulers were only tasked with running jobs efficiently on relatively similar hardware platforms inside a single data center. Not that optimizing hardware utilization was a simple task back then, but current
HPC users and systems administrators must now manage workflows that may include a mix-n-match collection of heterogeneous hardware like CPUs, GPUs and Intel Xeon
18
ScientificComputing.com
Phi devices installed in multiple, distributed clusters that may be located within an organization, physically located across multiple data centers around the world, and may even include resources nebulously contained within the cloud. IBM, for example, offers their Platform LSF suite of tools built on top of the wellknown LSF job scheduler that has been a core component in HPC centers for many years. Platform LSF provides the IBM Platform Session Scheduler and IBM Platform Data Manager tools to create ‘virtual private clusters’ that can asynchronously run jobs on a local cluster, geographically distant cluster, or inside the cloud. Jobs running within these virtual private clusters need only communicate with the scheduler inside the virtual private cluster. This means users can submit large volumes of tasks within the virtual private cluster that are able to run asynchronously on the remote hardware without needing to wait for the main scheduler’s approval. In this way, IBM’s Platform LSF is able to avoid communications limitations and speed-of-light latency — even across long distances — to deliver extreme scaling within the job scheduler. Similarly, the IBM Platform LSF Data Manager is used to stage data across distributed clusters via localized smart caches to eliminate data access delays as much as possible. Such software tools help in creating and running tasks that can scale to run in these asynchronous, geospatially distributed environments — even with the added caveats July 2015
HIGH PERFORMANCE COMPUTING that the environment can dynamically change through the addition and removal of cloud resources and clusters. These same software tools help users and systems administrators optimize their workflows and job scheduling to efficiently utilize systems that contain massively parallel accelerators and coprocessors, as well as address more ‘mundane’ hardware differences, such as variations in memory capacity and CPU type. In the IBM Platform ecosystem, workflow creation is supported via a (graphical user interface) GUI that lets users draw the data flow and computational interactions. People interact much more naturally with a GUI, as it lets them graphically visualize the overall computational work and data flows. A well-designed GUI (and set of GUI templates) can abstract the workflow sufficiently so that script generators — much like a compiler for a parallel computer — can then create the scripts that contain the complex task and command invocations that implement the user workflow. Further, these scripts can be targeted to run on a specific hardware configuration (again much like a compiler generating code for multiple CPU architectures) be it for a local cluster or aggregation of multiple clusters and cloud environments containing a number of asynchronous ‘virtual private clusters.’ Optimizing resource utilization means the systems team needs to see what is happening inside their globally distributed, asynchronously running multi-cluster HPC environment in real-time — a non-trivial data collection and visualization task by itself. Further, both users and the systems management team need to be able to analyze the performance of the HPC center so users can improve the efficiency of their workflows over the short term, and both users and the systems team can collaborate on HPC upgrades and new procurements to improve efficiency over the long term. Both real-time data acquisition and the analysis of aggregate HPC datacenter information are big-data tasks that might be larger than some of the scientific questions being investigated! Think of the amount of monitoring and profile data that can be generated by many thousands of nodes in real-time, or the amount of data that must be gathered and stored for later analysis
July 2015
IBM Platform Analytics: Cluster utilization (jobs) over time Courtesy of IBM from those same nodes over the lifetime of the hardware. However, targeted data-driven decision-making is an essential part of data center operations and the procurement process, be it for a new system, system upgrade or to quantify cost and runtime machine requirements when contracting with a cloudbased service. Balance ratios, as discussed in my 2007 Scientific Computing article, “HPC Balance and Common Sense,” are a commonly used set of metrics that can extrapolate the characteristics of a newer, faster machine that can run a job mix efficiently based on the hardware characteristics of an existing system. The TOP500 site uses balance metrics based on synthetic benchmarks to compare systems. By extension, balance ratios and other metrics based on historical workload data for a site can be — and are — an invaluable tool for workload optimization and procurement planning. In short, balance ratios can distill a tremendous amount of ‘big data’ HPC performance data into a few numbers. They are but a few of the many analytic tools (many of which are not so concise) that can be used to analyze and optimize HPC data center procurements and operations. Packages, such as IBM’s Platform LSF, are nice in that they provide an integrated from user to systems management team experience. Other robust and respected job scheduling packages such as SLURM are also available. The SLURM ecosystem of
tools also provide a number of similar tools, including the ability to run applications in a distributed environment such as the Teragrid. Alternative profiling and analysis packages also exist. One example is the free NWperf tool set discussed in my February 2015 Scientific Computing article “Using Profile Information for Optimization, Energy Savings and Procurements1.” The commercial Allinea MAP profiler also provides information programmers need to optimize their HPC workflows. Regardless, people need the ability to find quantifiable, data-driven answers to their questions about application, workload and data center efficiency. The increasing size and dynamic nature of global HPC operations along with the inclusion of heterogeneous hardware just means people need additional help in monitoring and optimizing workflows and data center operations.
REFERENCE 1. “Using Profile Information for Optimization, Energy Savings and Procurements, February 2015, Scientific Computing. www.scientificcomputing.com/articles/2015/02/ using-profile-information-optimization-energy-savings-and-procurements
Rob Farber is an independent HPC expert to startups and Fortune 100 companies, as well as government and academic organizations. He may be reached at editor@ScientificComputing.com. ScientificComputing.com
19
HIGH PERFORMANCE COMPUTING
Thoughts on the Exascale Race HPC has become a mature market
A
Steve Conway
20
ScientificComputing.com
s the HPC community hurtles toward the exascale era, it’s good to pause and reflect. Here are a few thoughts… The DOE CORAL procurement signaled that extreme-performance supercomputers from the U.S., Japan, China and Europe should reach the 100-300PF range in 20172018. That’s well short of DOE’s erstwhile stretch goal of deploying a trim, energy-efficient peak exaflop system in 2018 or so, but still impressive. It would appear to leave room for one more pre-exascale generation before full-exascale machines begin dotting the global landscape in the 2020-2024 era. An exaflop is an arbitrary milestone, a nice round figure with the kind of symbolic lure the four-minute mile once held. And as NERSC Director Horst Simon pointed out many moons ago, there are three temporal stages to these computing milestones that have occurred about once a decade. First will come peak exaflop performance, then a Linpack/TOP500 exaflop, and finally the one that counts most but will likely be celebrated least: sustained exaflop performance on a full, challenging 64-bit user application. A peak exascale system is merely an “exasize” computer, to cite the term Chinese experts used in an SC13 conference talk. It’s a show dog without a repertoire of tricks. A system that completes a Linpack run at exascale shows at least that a major fraction of the system can be engaged to tackle a dense system of linear equations. The path to the third stage — sustained exaflop performance on challenging user applications
— is where many of the biggest hurdles lie. Prominent among these, as is well-known, are scaling the software ecosystem, providing enough reliability and resiliency to finish exa-jobs, and supplying enough IO to keep the heterogeneous processing elements busy. These are the same challenges advanced users face today, only more so. The IO challenge is particularly nasty. In recent decades, HPC systems have become extremely compute-centric (“f/lopsided”). This increasing imbalance has aggravated the memory wall and narrowed the breadth-of-applicability for each succeeding generation of high-end supercomputers, especially for data-intensive simulation and the growing importance of advanced analytics. Fortunately, strategies are under way to alleviate (but not fix) this issue, including more capable interconnect fabrics, burst buffers and NVRAM, tighter linkages between CPUs and accelerators, clever data reduction methods, and more besides. But no one should expect supercomputers to return to the more balanced status of yesteryear. IDC vendor studies show that the basic architecture of HPC systems is unlikely to change in the next five to seven years, although configurations and some components will shift. Not long ago, a fundamental premise underlying advanced supercomputer development was that evolutionary market forces were too slow and governments needed to stimulate revolutionary progress. The idea was that the government would do the heavy lifting to pave the way, and the mainstream HPC market would follow
July 2015
HIGH PERFORMANCE COMPUTING to take advantage of the revolutionary advances. In our annual HPC predictions, IDC back then pointed out the risk that the government-supported high-end HPC market might split off as its own ecological niche, while the mainstream market continued to evolve on its own inertial path. That split, though still possible, has not happened. Instead, government officials, for the most part, have realized that they are no longer the primary drivers of HPC. Market forces have usurped that role. The worldwide HPC market’s diversification and ten-fold expansion in the past three decades, from $2 billion to more than $20 billion, has removed the government from the kingpin position it once held. Government officials in most HPC-exploiting countries have inflected their strategies to take better advantage of market forces, especially technology commoditization and open standards. The fact that governments have met HPC market forces partway is ultimately a good thing for all parties. It means that many of the government-supported advances for exascale computing will sooner or later benefit the mainstream HPC market, including SMEs that buy only a rack or two of technical servers. That, in turn, means that savvy government officials can help justify the skyrocketing investments needed for extreme-scale supercomputers by pointing to ROI that benefits the large mainstream market, including industry and commerce. Government-driven advances can be used both to out-compute and to out-compete. So, it appears that at least through the early exascale era, vendors will continue to build Linpack machines, because most government buyers will continue to see superior Linpack performance as a mark of leadership. Things might develop differently if more leading sites followed the example of the NCSA “Blue Waters” procurement, where the overwhelming stress was on the
July 2015
assessed needs of user applications and Linpack performance was not even reported. That was a deliberate decision, because “Blue Waters” is also a competent Linpack machine at heart and could have recorded impressive Linpack results. The point here is that, if lots of buyers gave primary consideration to user requirements in the procurements, this should lead to better system balance and wider applicability over time. At the high end of the supercomputer market, money talks, too. Government funding appetites will play a major role in determining the sequence in which the entrants cross the exascale finish line. In earlier times, the global supercomputer race pitted “muscle cars” from the U.S. and Japan against each other, and these monoliths featured lots of custom technology. But today, as a successful Arnold Schwarzenegger once advised a neophyte bodybuilder, “it’s not the size of your muscles that counts; it’s the size of your wallet.” Among governments, the U.S. is still the largest funder and the Obama Administration’s budget request puts a high priority on exascale funding — although Congress has not approved this yet. The EU has been ramping up exascale funding, although not as fast as China, and Japan is likely to give everyone a run for their money. World-leading supercomputers have not exactly morphed from muscle cars to family sedans yet, but they’ve been on that path — and it’s generally a healthy one. The adop-
tion of industry standards has been necessary for the expansion and democratization of the HPC industry, for broader collaboration, for better reliability, and for preserving and leveraging investments in software and hardware development. It’s hard to imagine how vendors could make exascale muscle cars affordable, even for government buyers with the deepest pockets. The “Blue Waters” and CORAL procurements, among others, prove that, in the era of evolutionary HPC systems, important innovations can be pursued on behalf of users. Governments around the world have increasingly recognized that HPC is a transformational technology that can boost not only scientific leadership, but also industrial and economic competitiveness. Accompanying this recognition is the notion that HPC is too strategic to outsource to another country, meaning to the U.S. in most cases. Exascale initiatives in Asia and Europe are promoting the development of indigenous technologies, often in conjunction with non-native components. I’ve been talking so far about hardware, but we’ve said for some years at IDC that software advances will be more important than hardware progress in determining future HPC leadership. It’s gratifying to see national and regional exascale initiatives increase funding for exascale software development, although the amounts still seem unequal to the task. The long-term good news is that HPC has become a mature market, one driven by market forces. That gives strong assurance that the market will behave rationally over time. Demand, in the form of buyer and user requirements, will increasingly win out. Steve Conway is Research VP, HPC at IDC. He may be reached at editor@ScientificComputing.com.
ScientificComputing.com
21
DATA ANALYSIS
Software Review: Partek Genomics Suite 6.6 Combining easy-to-use statistics with interactive graphics
John A. Wass, Ph.D.
Y
our corresponding editor really loves to review these genomics programs, as genomics (the study of the entire gene complement in an organism) is his area of research, and an exciting one at that. It is now at the center of a cutting-edge movement within the area of personalized medicine, as we have come to realize that no two humans are exactly alike and that drugs and other treatments, which are useful in one patient may not help (or may actively hurt) another. The software for doing this is highly advanced in that its functioning mates the precision of mathematics/statistics with the variability of biology. Now on to Partek’s latest version... Version 6.6 is unique in that it can easily integrate data from a variety of sources, assays and vendors into a single study, which is very useful from the biological interpretation standpoint. Unfortunately, we have no genomics software that goes the full
22
ScientificComputing.com
gamut of integrating all known inputs to new drug discovery (e.g. pathway analysis, proteomics, post-translational events, epigenetic events and drug metabolism, to name but a few). Researchers in the area quickly learned that there is much more involved in drug discovery than simply gene expression. Still, in the strictly genomic arena, the software has much to recommend it, including ease-of-use features, gradual learning curve, customization of many features, excellent help materials, and a strong tech support department, all within a menu-driven base! Enhanced features in this latest version include:
Microarray • gene expression • miRNA expression • exon expression • copy number • allele-specific copy number • loss of heterozygosity (LOH) • association • trio • ChIP-chip Next-generation sequencing • RNA-Seq • miRNA-Seq • ChIP-Seq • DNA-Seq • Methylation July 2015
DATA ANALYSIS
All statistical features standard to genomic analysis and graphics are included: • three-dimensional PCA plots to visualize data distribution and outliers • hierarchical clustering to classify vast numbers of genes into revealing patterns • profile trellis to identify groups of differentially expressed genes • Venn diagrams to identify common and unique sample characteristics • motif discovery to determine binding-site motifs of sequences • gene ontology enrichment to determine gene grouping based on their molecular functions • chromosome viewer to visualize how samples map to the reference genome with customizable tracks for SNP detection, differential expression, peak detection, gene annotation and methylation regions, all integrated into a single study As per usual, there are import and data formatting functions that are a pain until the user gains a bit of familiarity with the system. As the dialog boxes get more intelligent, they keep asking for further information. There are also the nomenclature problems, Figure 1 as some software asks for the same thing as others but with different verbiage (e.g. when I used my old WinZip program, I could ask it to Unzip the file. Now, it asks me to Extract the file; when certain programs ask for the “type” of data some specify the precision, some by mathematical type (real, integer, text), others, such as my venerable JMP, classify them as continuous, ordinal or nominal! It takes a bit of practice to translate from one to the other. These are actually minor quibbles, as when the user acquires the necessary knowledge to address the software’s idiosyncrasies (it doesn’t take long; the easiest way is to call the help desk if you are impatient). The tools are actually a delight to use with a little practice and are com-
July 2015
prehensive, especially in the exploratory analysis and assisting in the biological interpretations. Specifically, these include • Venn diagrams • clustering with “heat maps” • parametric and non-parametric ANOVA • principal components analysis • correlation matrices (with automatic “ find correlated variables” function) • parametric and non-parametric one and two sample tests • automatic removal of batch effects • multiple test correction Now, as to the graphics, it is easy to generate quality assessment (QA) graphs of the chips with the ‘postImport QC spreadsheet’ (Figure 1).
The chip itself can be visualized as a pseudochip picture to further assist with QA and spotting patterns (Figure 2). Finally, a group separation may be had with the PCA graphic (under the review title, page 1 of review). This graphic is rotatable in three dimensions and is interactive with the workflow sheet. Furthermore, it is possible to color and group the points in the PCA plot (Figure 3). We may also pull up the obligatory line plot to see lines of probe intensity versus
Figure 3
Figure 2
Figure 4
ScientificComputing.com
23
DATA ANALYSIS
Figure 5
Figure 8
Figure 6
Figure 7
24
ScientificComputing.com
frequency of probe intensity. Again, this graphic is interactive with the workflow sheet and may be customized to user preferences (Figure 4). The â&#x20AC;&#x2DC;Sources of Variationâ&#x20AC;&#x2122; plot dramatically illustrates the major inputs that affect variation in the experiment (Figure 5). It should be noticed that the variation in a single gene may be quickly produced by right-clicking on the row header and asking for the Sources of Variation on the resulting pop-up. The two remaining obligatory genomics plots are the Volcano plot and the hierarchical clustering heat map that will assist in identifying significantly regulated genes and patterns across genes and samples (Figures 6 & 7). If we need to look at a graphic of the intensity values and distribution for a single
gene in a sample vs. control grouping, this is accomplished by requesting a Dot plot of the original data (Figure 8). In this example, it is easy to discern the difference from the control patient to one with Down Syndrome. In summary, the software combines easy-to-use statistics with interactive graphics and is available for Windows, Macintosh and Linux. Interested parties should request the free two-week trial version. Downloading and setup are quick and easy, and the novice will find many exciting and extremely helpful features.
AVAILABILITY Partek Incorporated 624 Trade Center Boulevard St. Louis, Missouri 63005 314-878-2329; Fax: 314-275-8453 www.partek.com John Wass is a statistician based in Chicago, IL. He may be reached at editor@ScientificComputing.com.
July 2015
DATA A ANALYSIS N ALYSIS
Survival Models An important technique employed in medical and engineering sciences
Mark Anawis
C
arl Sagan once said: â&#x20AC;&#x153;Extinction is the rule. Survival is the exception.â&#x20AC;? In either case, survival analysis is a method where a time to an event, such as death or equipment failure, is measured and modeled. The determination of whether the event has occurred or not, the event status, is also noted. Observations in a study are prone to censoring. Most common is right censoring where the patient or equipment does not experience the event (i.e. is still alive or equipment is still functioning at the end of the study). A patient who drops out of a study is also considered right censored. Less common is left censoring where we follow a patient after testing positive for an illness, but we donâ&#x20AC;&#x2122;t know the time of exposure. Truncation is a related phenomenon whereby a patient with a lifetime less than a threshold is not observed. This is a typical situation in actuarial work. Both censored and uncensored data are used to estimate the parameters of a model.
July 2015
A cumulative distribution function F(t) can be used to describe the probability of observing time T less than or equal to a time t. A distribution function f(t) can be calculated from F(t) to determine the probability of observing survival time up to t.
Figure 1
There are two main functions used in survival analysis: the survival function and the hazard function. The survival function calculates the probability of surviving (or equipment not experiencing failure) up to that time calculated at every time. The hazard function calculates the potential of an event occurring per unit time given that they have survived up to that time. Their relationship to each other and to the distribution function can be seen in Figure 1. At the start of a study, the distribution function f(t), the survival function S(t), and the hazard rate h(t) are high, while the cumulative hazard rate H(t) is low. As time increases, the survival function S(t) moves towards a minimum, whereas the cumulative hazard function H(t) moves toward a maximum. A quantile survival of interest such as the median, can be calculated from the hazard or survival function. The relationship of a factor, such as a drug on the time to event, can be evaluated in the presence of covariates such as age, weight and gender. There are three main approaches to model building: parametric, nonparametric and semiparametric. The parametric
ScientificComputing.com
25
DATA ANALYSIS approach uses linear regression for both location and scale parameters. A linear regression can be used, since the typical distributions used for survival (Weibull, lognormal, exponential, Fréchet, loglogistic, and Gompertz) can be made linear through transformation. A Goodness-of-Fit using a Chi-square statistic can be calculated by comparing the likelihood of a distribution with a null model, which allows for a different hazard rate for each interval. This is shown in Figure 2 using the JMP Parametric Survival Fit platform for a lognormal fit where the Chi-square value is statistically significant with a probability less than 0.05. A plot of the 0.1, 0.5, and 0.9 quantiles as a function of the regressor is displayed.
method is that it is not dependent on the grouping of the data into different intervals. A comparison of survival times for two or more groups can be done using a test, such as the log-rank, Wilcoxon, Gehan’s generalized Wilcoxon, Peto and Peto’s generalized Wilcoxon, Cox’s F and Cox-Mantel test. Although there are no hard and fast rules on which test to use in a given situation, where there are no censored observations, the samples are from a Weibull or exponential distribution, and sample size is less than 50 per group, the Cox’s F test is more powerful than Gehan’s generalized Wilcoxon test. Regardless of censoring, where the samples are from a Weibull or exponential distribution, the log-rank and Cox-Mantel are more powerful than Gehan’s generalized Wilcoxon test. There are multiple sample tests of the log-rank, Gehan’s generalized Wilcoxon, and Peto and Peto’s generalized Wilcoxon test. Using Mantel’s procedure, a score is assigned to each survival time and a Chi-square value is calculated based on the sums for each group. An example of survival between males and females is shown in Figure 3 using the JMP Survival platform where both the log-rank and Wilcoxon test are statistically significant with a probability less than 0.05.
Figure 2 A nonparametric approach, typified by the Kaplan-Meier method, calculates a survival function from continuous survival times, where each time interval contains one case. The estimate of this function is a product-time estimator. An advantage to this
26
ScientificComputing.com
Figure 3
July 2015
DATA ANALYSIS A semiparametric approach, typified by the Cox proportional hazards regression model, makes no assumptions about the baseline hazard function. A nonlinear relationship between the hazard function and predictors is assumed. However, the proportional hazards assumption needs to be checked. It states that the hazard ratio comparing any two observations is constant over time where the predictors do not vary with time. Using Kaplan-Meier curves, a graph of the log(-log(survival)) vs log of survival time should show parallel curves. An example of a Cox proportional hazards fit of drug data is shown in Figure 4. The whole model shows a Chi-square test of the hypothesis that there is no difference in survival time among the effects. For a categorical parameter estimates, a confidence interval that does not include zero indicates that the difference between the level and the average of all levels is significant. Survival analysis is an important technique used in medical, engineering, social and economic sciences and is related to reliability. Evaluation of data distribution with consideration of censoring and a suitable approach to the type of model selected are key components to an evaluation of the process. Note: All graphs were generated using JMP v.11.2.0 software. Mark Anawis is a Principal Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at mark.anawis@abbott.com Figure 4
Need a Solution to a Problem? You are not alone. In this highly complex computing world we all have a need for timely, creative solutions. Our FREE webinars and live events are developed by our expert editors who select hot industry topics and lead the discussion on compelling, trending issues that might just provide the answer to your problem. Join us for one of these events where you will hear about the latest innovations in computing and analytics. Just sign-up to take advantage of this amazing resource. July 2015www.scientiďŹ ccomputing.com/edu and sign up today! Visit
ScientificComputing.com
27