TCC Training, Education, and Workforce Development 2016-2017

Page 1

BD2K: Big Data to Knowledge

Training, Education, and Workforce Development 2016 - 2017



Table of Contents

BD2K Centers

2

BD2K Coordination Centers

2

BD2K Initiative’s Training Efforts

2

Geographical Distribution of Centers and Training Grants

3

Distribution of Training Efforts

5

Breakdown of Scientific Training and Career Development (K01) Grants

7

Breakdown of Resource Development (R25) and Diversity (dR25) Grants

7

Breakdown of BD2K Centers (U54) Grants

8

Breakdown of Institutional Training (T32/T15) Grants

8

BD2K Training Events 2017

9

Upcoming: Imagining Tomorrow’s University

10

Training Coordinating Center (U24) Grant

11

Resource Development (R25) Grants

14

Enhancing Diversity (dR25) Grants

53

Scientific Training and Career Development (K01) Grants

60

Institutional Training (T32 and T15) Grants

91

BD2K Centers (U54) Grants

111

Index 131


R25

Biomedicine

Databases

FFT

Media

Training

Big Data

Distributed/ Cloud Computing

DNA

PCR

MRI

U54

Health

BD2K

Data Science Data Management

U24

ICA

Time

Data Representation

Space

dR25

Computing

MDA

Genes Data Exploration

K01

Neuroimaging

Data Visualization and Communication

1

Data Modeling

MP4

T15/ T32

Career Path

Tools

Integration

Proteomics

Informatics

Cells

RNA

Genomics

NIH

Skills

Discovery


BD2K Centers The BD2K Training Coordinating Center (TCC) helps to promote and support training and educational activities across the collection of NIH funded Big Data to Knowledge (BD2K) program. In particular, the TCC brings an innovative approach to the exploration of educational and training resources for biomedicine in the context of big data science. Nationwide, there are 13 BD2K Centers (U54 grant mechanism), including 11 Centers of Excellence, that work in collaboration with other BD2K grantees. The primary objective is advancing in the field of biomedical data science research by providing training, and by developing new methods, tools, and resources in Big Data science.

BD2K Coordination Centers The BD2K Centers- Coordination Center (BD2KCCC) provides administrative support for the BD2K Centers program, promoting collaboration among Centers, facilitating Working Group operations, and organizing the consortium activities of BD2K Centers. In addition to the TCC and BD2KCCC, BD2K awarded a Data Discovery Index Coordination Consortium (DDICC) grant to the Biological and HealthCare Data Discovery and Indexing Ecosystem (bioCADDIE) Project. BioCADDIE seeks to develop a prototype Data Discovery Index that will enable the accessibility and discoverability of biomedical Big Data. In order to bring together information and guidance about biomedical data standards, the BD2K Standards Coordinating Center (BD2K SCC) enables community engagement and provides tools to educate on standards in research.

BD2K Initiative’s Training Efforts BD2K funds research and training activities through different grant programs. These training efforts are categorized by grant mechanisms with overall aim of supporting the advancement of biomedical data science through education and training, as well as development and dissemination of courses, training material, and open educational resources.

Resource Development (R25) Grants Through these grants, online educational opportunities are expanded and made available to a greater audience of scientists, researchers, and trainees. These educational resources include MOOCs and short courses, and contribute to the foundation for the future of biomedical Data Science education and training.

Scientist Training and Career Development (K01) Grants These grant efforts aim to support mentored training of scientists who will gain the skills and knowledge necessary to develop new technologies, methods, and tools in the field of Big Data.

Enhancing Diversity (dR25) Grants With research experiences for undergraduate students as a primary focus, these grants support educational activities with hands-on exposure to Big Data research. In addition, these grants allow faculty to expand their Big Data knowledge base. Through a partnership with one or more of the BD2K Centers, or the Data Discovery Index Coordination Consortium, undergraduate institutions develop Big Data curricula and novel instructional approaches.

Institutional Training (T15/T32) Grants These grants support the next generation of biomedical data scientists by providing integrated training in computer science, the quantitative sciences, and biomedicine. 2


Geographical Distribution of Centers and Training Grants

Scientist Training and Career Development (K01) Grants

3

BD2K Training Coordinating Center

Columbia University Health Sciences

1

Duke University

1

Emory University

1

Fred Hutchinson Cancer Research Center

1

Harvard Medical School

1

Hudson-Alpha Institute for Biotechnology

1

Broad Institute, Inc.

1

Johns Hopkins University

1

Icahn School Of Medicine At Mount Sinai

1

Medical College of Wisconsin

1

Harvard Medical School

1

Pennsylvania State University-Univ Park

1

Stanford University

2

Stanford University

2

University Of California Los Angeles

1

University Of California, San Francisco

1

University Of California Santa Cruz

1

University Of Hawaii at Manoa

1

University Of Michigan

1

University Of Illinois Urbana-Champaign

1

University Of Pittsburgh at Pittsburgh

1

University Of Memphis

1

University Of Utah

1

University Of Pittsburgh At Pittsburgh

1

University of Washington

1

University Of Southern California

2

Weill Cornell Medical College

1

University Of Wisconsin-Madison

1

University Of Southern California

1

BD2K Centers of Excellence


Resource Development (R25) Grants

Institutional Training (T15/T32) Grants

Boston University (Charles River Campus)

1

California State Univ, Monterey Bay

1

California State University, Fullerton

1

Duke University

1

Fisk University

1

Georgetown University

1

Harvard School of Public Health

1

Icahn School of Medicine at Mount Sinai

1

Jackson Laboratory

2

Johns Hopkins University

2

Columbia University Health Sciences

1

Mount Desert Island Biological Lab

1

Dartmouth College

1

New York University School of Medicine

2

Harvard School Of Public Health

1

Northwestern University

1

Northwestern University At Chicago

1

Oregon Health & Science University

2

Pennsylvania State University-Univ Park

1

Purdue University

1

Stanford University

1

Rutgers, The State Univ of N.J.

1

University Of California Berkeley

1

University of California Irvine

1

University Of California Los Angeles

1

University of California Los Angeles

2

University Of Missouri-Columbia

1

University of California San Diego

2

University Of North Carolina Chapel Hill

1

University of Michigan

1

University Of Texas, Austin

1

University of Puerto Rico Rio Piedras

1

University Of Virginia

1

University of Utah

1

University Of Washington

1

University of Washington

1

University Of Wisconsin-Madison

1

Washington University

1

Vanderbilt University

1

Weill Cornell Medical College

1 4


Distribution of Training Efforts

`

25

20

15

10

5

5

Grant Program Type

Number of Grants

Enhancing Diversity (dR25)

4

Scientist Training and Career Development (K01)

17

Resource Development (R25)

27

Institutional Training (T15/T32)

15

Training Coordinating Center (U24)

1

BD2K Centers (U54)

13

Total

77


Training programs across the BD2K enterprise represent a broad range of undergraduate, graduate, and post-doctoral programs, career path development, in-person workshops seminars, virtual events, video lectures, among other unique activities. While funded through a variety of NIH grant mechanisms, these training programs are, in fact, part of an integrated, collective whole. Through close interactions with these programs, the NIH and TCC seek to promote data science as a 21st Century response to the need for more scientists with the computational tools to take on our nation’s most serious biomedical research challenges.

6


Breakdown of Scientific Training and Career Development (K01) Grants Scientific Background Scientific Background

Number of Grants

Physicians

7

Computational or Quantitative

7

Blend of biomedical and computational

2

Behavioral or Social Scientists

1

Research Data Focus Research Area

Number of Grants

Cancer

4

Neurological Disorders

4

Infectious Diseases

4

Behavioral and Social Sciences

2

Other

3

Breakdown of Resource Development (R25) and Diversity (dR25) Grants Funding Purpose Purpose

Number of Grants

Research Education: MOOC on Data Management

1

Research Education: Open Educ. Resources for Sharing, Annotating and Curating

3

Open Educational Resources

9

Short Courses for Skill Development

14

Enhancing Diversity

4

Awards by Data Type

7

Data Type

Number of Grants

Multiple data types and for the general biomedical audience

16

Clinical or Population

4

Imaging

3

Genomics

8


R25 and dR25 Grants continued Undergraduate students

Awards by Participant Participants

Number of Grants

Course Instructors

6

Undergraduate Students

7

Graduate and Senior Faculty

18

Researchers and practitioners

Graduate students and postdoctoral fellows

Course instructors and professors

Breakdown of BD2K Centers (U54) Grants Biomedical Data Focus Focus Area

Number of Grants

Imaging and genomics

4

Environmental, Behavioral

7

Computational prediction, data-mining, machine learning

4

Biomedical (cells, drugs, molecules)

2

Breakdown of Institutional Training (T32/T15) Grants Type of Training Program and Competencies Training Type

Competencies

Master degree

Computational biology, quantitative genomics 1

Biomedical new Pre-doctoral training

Bioinformatics, molecular, genetics, medicine, genomics, imaging, proteomics, epidemiology, neuroscience

7

Quantitative new Pre-doctoral training

Statistics, computer science, machine learning, data-mining

6

Revisions to training program

Bioinformatics

1

Number of Grants

8


BD2K Training Events 2017 The BD2K Centers and training grantees conduct various forms of training and educational events such as webinars, symposiums, workshops, and open courses throughout the year that all work to reach the overall aim of the BD2K training initiative. The TCC helps advocate for these various training events. Depicted below are the in-person workshops and events hosted by the different BD2K Training efforts for the year of 2016. The full list of events is available to view on the BigDataU website. Sep 14-16

BD2K California Meeting

Aug 6 - 12

mHealth Training Institute Los Angeles, CA

Data Science Innovation Lab

Jun 12 – 23 Jun 13 – 24

Summer School for Computational Genomics

TBA, CA

Jun 19 – 23

Beverly, MA

Summer School for Computational Genomics

Jun 15 – 16

Causal Discovery Datathon

Jun 12 – 15

Causal Discovery from Biomedical Data

Pittsburgh, PA

New York, NY

Pittsburgh, PA

New York, NY June 5 - Jul 14 Jun 5 – 14 Jun 5 – 6

Meeting of the Committee on Applied and Theoretical Statistics Washington, DC

Feb 3

Women in Data Science Conference Stanford, CA

Undergraduate Summer program on Big Data Ann Arbor, MI

Undergraduate Summer Program on Big Data U. of Michigan

Big Data Training for Translational Omics Research Boot-camp West Lafayette, IN

June 5 - Aug 11

Summer Research Training Program in Biomedical Big Data Science New York, Cincinnati, Miami

May 16-18

May 9

NIH: Machine Learning with MATLAB Bethesda, MD

BD2K-LINCS Data Science Symposium Cincinnati, OH Nov 2017

Oct 2017

Sep 2017

Aug 2017

Jul 2017

Jun 2017

May 2017

April 2017

March 2017 9

Jun 6-11

Feb 2017

May 15-19

March 24-25

Spring School for RNA-seq Replicathon and Data Carpentry Sequence Instructor Analysis New York, NY

Training Workshop

San Juan, PR


Upcoming: Imagining Tomorrow’s University: Rethinking Scholarship, Education, and Institutions for an Open, Networked Era In the 21st Century, research is increasingly data- and computation-driven. Partly due to this trend, researchers, funders, and the larger community today emphasize openness and reproducibility. “Imagining Tomorrow’s University” is an NIH BD2K and NSF-funded one-day, invitation-only workshop where researchers who practice open science and key university administrators will come together to start a new dialog. The research world has changed, and the university needs to change too—but how? How should it adapt its structure, mission, infrastructure, education, recruitment plans, etc.? Do we need new educational programs? New disciplines or new departments? How can universities recognize the value in new types of research outputs, such as software and data? Does research staffing need to change? Do research data engineers or research software engineers have a place? What are different measures of success for faculty active in open science/open research? The crucial question is: how do universities make themselves competitive to attract the best students, staff, and faculty in this new world? The workshop participants are summoned to reimagine scholarship, education, and institutions for an open, networked era. As a result, university leaders will be awakened to new opportunities to create value and serve society. The output of this workshop is envisioned to be a set of principles for how universities can thrive in the new world.

March 8-9 http://www.ncsa.illinois.edu/Conferences/ImagineU/

Imagining Tomorrow’s University Rosemont, IL

10


BD2K Training Effort Grant Information

Training Coordinating Center (U24) Grant

11


U24 Grant

1 U24 ES026465-02

Training Coordinating Center Big Data U: Empowering Modern Biomedicine via Personalized Training Van Horn, John Darrell jvanhorn@usc.edu University Of Southern California bigdatau.org Abstract: In our rapidly evolving information era, methods for handling large quantities of data obtained in biomedical research have emerged as powerful tools for confronting critical research questions, with significant impacts in diverse domains ranging from genomics to health informatics to environmental research. The NIH's Big Data to Knowledge (BD2K) Training Consortium is expected to empower current and future generations of researchers with a comprehensive understanding of the data science ecosystem: the ability to explore, prepare, analyze, visualize, and interpret Big Data. To these ends, we propose a novel Training Coordinating Center (TCC) to coordinate the diverse activities occurring within the BD2K Training Consortium into a synergistic training effort. The TCC will create an inclusive and collaborative virtual environment - entitled "Big Data U" - serving trainees from a wide spectrum of educational backgrounds and scientific domains. Big Data U will make personalized educational resources easy accessible and facilitate novel research collaborations through scientific rotations. We will harvest the web to automatically identify, model, and incorporate online resources into an Educational Resource Discovery Index (ERuDIte) and a Big Data U Knowledge Map. This unique system will alleviate the burden of sifting through hundreds of educational resources and searching across multiple research and training program websites, allowing users to easily determine which resources are didactically significant and correspond to the appropriate scientific domain of interest, level of education, and learning objective. Over the long term, our efforts will cultivate a diverse network of data scientists that can propagate their knowledge and experience for generations to come. Our PI and team have a demonstrated commitment to training in biomedical data science. The University of Southern California is ideally suited to host this NIH BD2K effort, having a strong history of data science training and recently founded two new masters programs of relevance to Big Data biomedicine. The TCC is the logical extension of our outstanding track record in data science, and we will leverage our comprehensive experience and infrastructure in developing the TCC. Public Health Relevance: A novel Training Coordinating Center (TCC) will be developed to coordinate the diverse activities occurring within the BD2K Training Consortium into a synergistic training effort. The TCC will create an inclusive and collaborative virtual environment - entitled "Big Data U" - serving trainees from a wide spectrum of educational backgrounds and scientific domains and facilitate novel research collaborations through scientific rotations. Achievement; Award; base; Big Data; Biomedical Research; California; career development; citizen science; Collaborations; Communication; Communities; computer science; Consensus; Controlled Vocabulary; cost; Data; data mining; design;

12


erasmus-econ wharton-quant ml-classification forensic-acct wharton-cust bi-tools

security

bigdata

m

mobile_devices

nlp nlangp bigdata sing nlpintrot guage_proces an aly anl_l e-ra natu real-tim listeninge ston ml-capations nd ml-foussification ml-cla ds tation etho resen _rep el_m -ml ledge introsion kern know l gres ing ode ml-ree-learntion al_m ica hin phic mac lassif l gra m odisetic_ ml-c yste abil gm ic_m topprob p r_s s nde r-sy e me nde ton om me ps encs rks o om l-ca atior rs tw rec m und endealyt ne fo ml- ommiz-an ral_ eu ets rec b n g mlraln ing n e u nin ne lear ton ns ar s ep ap tio g _le de ml-c nda rnin ble a fou -le em ing l ml-hine earno-mys ens c -l tr l l ma ep in ana l-m yt de a- a al n at ctic n io l-d pra ed-aicat cs pr ssif klin lys n 2 na ta la l-c bd era da st clu

m

whartonhealth-i risk digital-anformatics biz-ana nalyt wharto lyt dwbica n-decision ml-rec pstone analyt ommende rs desi -mysql biz-mgnexperi wha etrics ments ml- rton-op wharegress s peo rton-c ion ana ple-a apston e db lyt-ca nalyt an -mana pston accalyt-ta geme e ml- t-ana bleau nt an cap lyt da alyt- stone lis ta-re exce da tenin sult l wh ta-an g s ex arto aly pr ec- n-fi s-r dw ocm data nan ca re in sci ce m se latio wh arke-biz- nal dw art t-a ana de on na lyt sig -pe lytic m n op s h l-c le cli ealt lass n h m if bi od ical- -info icat h gd e r r io st eal ata l-bu ese mat n at th a ild ar ic re da na -v ch s as ta ly al on vi t in sua g2 l

Documentation; E-learning; Ecosystem; educational atmosphere; Educational Background; Educational workshop; Elements; empowered; Ensure; Environment; Evaluation; experience; Feedback; Future Generations; Generations; Genomics; Group Processes; Harvest; indexing; Institutes; instrumentation; interest; Internet; Knowledge; Learning; Maps; member; Metadata; Methods; Modeling; novel; Ontology; Outcome; outreach; Pathway Analysis; Patient Self-Report; Positioning Attribute; Process; programs; Public Health Informatics; public health relevance; Recording of previous events; Research; Research Infrastructure; Research Personnel; Research Training; Resources; Rotation; Science; Scientist; Semantics; Services; social; Statistical Models; Students; success; Support System; System; Techniques; Technology; tool; Training; Training Activity; Training and Education; Training Programs; Training Support; United States National Institutes of Health; Universities; Update; virtual; web site; Work; working group.

business_analytics

m ex m l-da p- l-re ta int d g -a ro m ata- res na -m l co -b an si lys l o i m pm g-d alys n m at ml ac et a h -re hin cl co e- indt dat ods us l mm ea riom an te r -em rin ml- end ning nl sio er g na big s aspa clu l ity sdoa t s _r ctaia moternd tera ed tio de isc na u n_ da l-bu ov lys cti rule ta- ild er on ma an -v y _le chin a a arn e int lys l pre -lear ro-m-r ing log d-a nin l un istic naly g su reg t pe res rvis s d ion ed mlin ata _le ml-fodata-a tro-mn arn und nalys l ing biz-aations ml-b naly ig t c -d p lass ml-c red-an ata if lassif ic a atio icatiolyt n n mark regr et-analytic s logist ession-pra ct applieicregression dreg statreasression oning2 datan deep-learn ing intro-datas ci intro-inf-st ats intro-ml analyt-excel regression ml-data-analys ml-regression

s

etic

gen

eff salcau ning 01 listetistics1 stics sta ic-stati bas

ts1 ol hsta teacdatascho big pla arning

tion

ne-le machiropt linea

duca

e social_sciences

ebra

linear_alg

optimization

mathematics

knowledge_domain g

arnin

ine_le

e

enc

tellig

l_in

icia

artif

-econ rs erasmus mende ml-recom matrix tion ml-classifica pred-analyt ders ml-recommen ion gress ml-re deep-learning optimization linearopt

reproducible-res erasmus-econ wharton-quant causal-effects data-sci-proj hadoop pred-a lyt data-prna oducts whartondata-v cust iz regres sion health stat- -inform dig genomic atics biz ital-analy s data-analyt t ml- -analy gra recom s-tools ma ph-an mend pracchine-lalyt ers de tica earn da signe l-ml ing ex ta-cle xperi in p-da anin ments whferentita-an g biz arto al- alys pro an -an n-o stats ba m aly alyt- ps bil ity r- l-fo t-ca cap _s an prog unda pston ston tat e r a t ist ics incliniclyt-eammions e b tro al- xce ing daiosta-statrese l a s t w a tis rch in ra -an tic in tro ngl aly s-2 in tro -m e-m s-c on ap a tr -d l go sto int b-teo-in atane d r s f- a d at o- tin sta na i at a- da g ts lys m ntr a- viz ta pr od o-d ana -d3 sci ob el es ly s ab -bu c- -r ilit ild sta y -v ts al

computer_science

1 l ts at ta ics oo st hs ist ch 1 apeac stat tas s10 t d da tic p ig tis b ta ts1 tats g s s ta os ics nin ds aly s r t o o i int tatisreas eth l-mr a-an t s tat pm na at rac s om tio e-d -p c nc ag ion nce te fu an ess re ula m r fe ip s g regtat-in-manatisticarnin g s ata -st _le rnin g d asic ine _lea rnin b ach ine lea

m ch e_ s atic maachin rm m info lthhea o-cs b intr on-d ata ne pyththon-d apsto py thon-c chool py datas big trix

intro-cs machine-learning

ma lyt -ana predo-cs lyt intr ph-ana pstone gra rog-ca im-p -results data s alytic sna orksonline e h_an grap netw anipulat data-m ata ml-big-dsults data-re -analyt me l-tiing reaput ance_com intro-hadoop high_perform dwrelational late data-manipu

information_retrieval textretrieval

m m ac ac hi hi ne ne _ _l lea ar ea rn de tif m i r i e he ac m p- cial aip nin ng alt hin l-c lea vis la g h- e- ap rn io n inf le st in n or arn on g m in e r-p te at g ro ac ics p gr hs da ythoammtats tas n- in 1 d c g da int i-tooata taro- ls ma cs da nip tapr ula ma og te nip ra wra pa mm ngle da ulat rall ing -m tasc e el_ data o da ngo i -ma t a n ba dw ipu rela la s e tio te No db-mintro- data nal SQ L ana anagdatassci lyt-c em ci e b apsto nt anaiz-metr nerelatio lyt-m ics nal_ data dwb pythony-dsql bas icap e ston b e graph knowle -analy dge_re t gra presen ph_d ta ataba dw tion se dwreladesign tional db-managbigdata em databa data-clean ent ses ing analyt-mysql python-db dwbicapstone biz-analyt bi-tools

mach

ing tho no j py at-gey-prolearn st lax ine- tor ga ach duc ols n to m co icbio nom cs ge 2klin bd ects

biology

supervised_learning

ng machlearniing deep-learn intro-ml ning ar -le machine s -analy e-data mmds manag nlp incst bd2kl aly n a ta s bigdateranaly ry clus discovetan rn da sci patte data nce -fina al rton uild-v g whaodel-blearnins-r m eep- naly -ml d ta-a tro el in exc s da lyt- aly s ana ta-anation e da d on ml-l-foun apst sionci m lyt-c gres tas s a e a p -an ml-r life-dton-ol-mlg biz l- r a rea wharacticarninone p e-le pst sion n in a ci io ch ci-c de ss ata maatas ton-egreig-d -riskst d har r l-b on cu yt w m art on- nal on j whhart d-a cati pro nt w pre sifi ci- ua on s s la a- -q ec l-c at on sm d art mu wh ras e

13

economics history engineering physics ethics

seo

inferential-stats designexperiments regression ml-classification causal-effects wharton-quantn erasmus-eco

healthcare

n co s -e ct y2 us ffe or m l-e ist s a a ah s er aus in od c ch th w e ds ne pm ho m et m co s p m ult s o c re g t ta- stin tor sta da b-te uc io a nd sedb ds o c a ho ics bio seb met om ca mp en s co n-g mic


BD2K Training Effort Grant Information

Resource Development (R25) Grants

14


1 R25 GM114821-03

An Open Resource For Collaborative Biomedical Big Data Training Amaro, Rommie E; Altintas De Callafon, Ilkay ramaro@ucsd.edu, altintas@sdsc.edu University Of California San Diego https://biobigdata.ucsd.edu Researchers increasingly rely on Big Data, Computational Data Science, and High Performance Computing (HPC) to solve problems within scientific domains and explore them in new ways. However, it is often difficult to effectively apply these approaches in practice. They draw heavily from domains including computer science and applied mathematics, and most researchers are not greatly invested in them. Moreover, petascale and exascale datasets require new data science techniques to process and manage. A large number of practitioners have not encountered these new techniques in their studies or work. Traditional training programs cannot keep up with current demand, or the rapid changes in technologies and practices. Students are not entering the field with enough real-world skills, and there aren’t enough programs to rapidly affect a change in the talent gap. Our vision is to cultivate an online community focused narrowly on data science and computing in biomedicine, and fostering high quality, well informed, and freely accessible knowledge. Our community, the Biomedical Big Data Training Collaborative (BBDTC), targets the development of technical skills as well as education. Our effort will assist educators by developing and communicating best practices for content development and deployment, along with adaptive learning, assessment metrics, and testing practices. The BBDTC is a community-oriented platform to encourage high-quality knowledge dissemination with the aim of growing a well-informed biomedical big data community through collaborative efforts on traing and education. The central design idea behind BBDTC is to accelerate open knowledge dissemination in the biomedical community. The BBDTC collaborative is an e-learning platform that supports the biomedical community to access, develop and deploy open training materials. The BBDTC supports Big Data skill training for biomedical scientists at all levels, and from varied backgrounds. The natural hierarchy of courses allows them to be broken into and handled as modules. Modules can be reused in the context of multiple courses and reshuffled, producing a new and different, dynamic course called a playlist. The BBDTC enables course content personalization through the playlist feature. Users may create playlists to suit their learning requirements and share it with individual users or the wider public. BBDTC leverages the maturity and design of the HUBzero content-management platform for delivering educational content. Hands-on training software packages, i.e., toolboxes, are supported through Amazon EC2 and Virtualbox virtualization technologies, and they are available as: (i) downloadable lightweight Virtualbox Images, and (ii)

15


R25 Grants

remotely accessible Amazon EC2 Virtual Machines. BBDTC supports cross-platform operability to reduce effort duplication and encourage innovation. To facilitate the migration of existing content, the BBDTC supports importing and exporting course material from the edX platform. Migration tools will be extended in the future to support other platforms. The framework tracks content consumption and usability through usage statistics and user scenarios. In brief, by focusing on technical training as well as education, providing real-world datasets, and presenting packaged toolkits that are simple enough to learn yet powerful enough to perform real work, BBDTC differentiates itself from other online learning providers. The framework is live and running at https://biobigdata. ucsd.edu. Publications: 10.1016/j.procs.2016.05.454 Keywords: Big Data, Virtualization, Distributed Data-Parallel, Biomedical Engineering and Biophysics, Immunology, Cancer, Computational Biology

16


1 R25 GM114827-02

An open, online course in neuronal data analysis for the practicing neuroscientist Bohland, Jason W; Eden, Uri Tzvi; Kramer, Mark Alan jbohland@bu.edu, tzvi@bu.edu, mak@bu.edu Boston University (Charles River Campus) Advances in technology for measuring neuronal activity at ever-larger scales and with increasing spatial and temporal resolution, concomitant with a decrease in the costs of data storage, are driving a revolution in neuroscience. As a flood of neuronal data continues to accumulate, a new challenge faces the global neuroscience community: how to make sense of these complex data to drive basic biological insight and to shed new light on neurological and neuropsychiatric disorders. This new, data-driven era of neuroscientific research demands that investigators master the fundamental methods in time series and image analysis and know when and how to appropriately apply these methods, either in custom applications or in existing software packages. Accessible yet rigorous resources to develop hands-on training and experience with such modern data analysis techniques are lacking in neuroscience. Our BD2K-sponsored training program directly addresses this challenge. Through a case-study approach composed of a series of individual learning modules, we use real-world data from neuroscience to motivate the study of modern quantitative analysis methods. Principled development of theory and methods applied to small and mid-sized examples of neurophysiological data (field-, spike-, and image-based datasets) provide the necessary foundation for the analysis of big data. In collaboration with Boston University’s Digital Learning Initiative, we are in the process of developing an Open, Online Course (“OOC”) consisting of 16 modules, which will be released on the BU edX platform in 2017. This platform will allow students to directly interact with MATLAB and provided datasets in their browser. The first three modules focus on the use of MATLAB for neural data analysis and on informatics and computational methods for large datasets. Each subsequent module focuses on one category of neuroscience casestudy data, and the development and application of appropriate, modern data analysis techniques. Modules consist of introductory videos, narrated screencasts, MATLAB examples, embedded test questions, discussion forums, and assignments. A series of three “Big Data” modules will extend analysis methods for specific data types to large public datasets and describe new methods needed for tackling these. To facilitate individual learning paths, the self-contained modules are designed to be maximally independent and accessible in any order, at any time. Recognition for completing individual modules and groups of modules will be awarded through a system of badges and certificates.

17


R25 Grants

The edX platform provides a powerful system for the quantitative evaluation of educational effectiveness. We will utilize this platform to perform self-assessment of learners, and to analyze participant performance in each module. We will also compare participant performance in the online course with a traditional, in-person lecture and lab format at Boston University. These results will facilitate the ultimate goal of improving online retention and participant understanding of materials presented in the OOC. Assessments will be used to drive improvements in course materials over the duration of the project. The proposed OOC will prepare students and researchers with the fundamental skills required for the analysis of neuronal big data and elevate the general competencies in data usage and analysis across the research workforce. Keywords: Image Analysis, Machine/statistical learning, Mathematical statistics, Multivariate Methods, Signal processing, Spatiotemporal Modeling, Statistical Analysis, Visualization, Data analysis, Neuroscience

18


1 R25 EB020378-03

Big Data education for the masses: MOOCs, modules, & intelligent tutoring systems Caffo, Brian Scott bcaffo1@johnshopkins.edu Johns Hopkins University https://www.coursera.org/specializations/genomics In this program, we are creating open and low cost educational content for those who want to work on genomics, neuroimaging and statistics for big data applications. The program supported the Genomic Data Science specialization on Coursera https://www.coursera.org/specializations/ genomic-data-science as well as numerous courses in neuroimaging and statistics. The grant has supported the development of free textbooks and freely available modular content. It has also supported the development of the Executive Data Science specialization https://www.coursera.org/ specializations/executive-data-science which offers training for those that manage data scientists. The program has also helped develop content and language translation for SWIRL http://swirlstats. com/ a unique intelligent tutoring system for learning R in R developed by Nick Carchedi, Sean Kross and others at Johns Hopkins and in the open source community. Some notable courses directly created by the resource currently running from the grant include: 1. Introduction to genomic technologies

7. Statistics for genomic data science

2. Genomic data science with Galaxy

8. Genomic data science capstone

3. Python for genomic data science

9. Principles of fMRI 1

4. Algorithms for DNA sequencing

10. Principles of fMRI 2

5. Command line tools for genomic data science

11. Neurohacking in R

6. Bioconductor for genomic data science

12. Advanced linear models for data science 1

Look for more courses that in development from the Johns Hopkins team arriving this Fall! Data mining, Image Analysis, Information science, Bayesian Methods, Machine/statistical learning, Mathematical statistics, Medical informatics, Multivariate Methods, Population Genetics, Predictive analytics, Causal Analysis, Signal processing, Statistical Analysis, Computational Genomics, Computer science, Data analysis, Johns Hopkins University

19


R25 Grants

1 R25 EB022365-02

Big Genomic Data Skills Training for Professors Chuang, Jeffrey Hsu-Min jeff.chuang@jax.org Jackson Laboratory https://www.jax.org/education-and-learning/education-calendar/2016/may/big-genomicdata-skills-training-for-professors Abstract: The Jackson Laboratory requests NIH R25 support for an innovative three part educational program, Big Genomic Data Skills Training for Professors. The program will, 1) train undergraduate college and regional university faculty across biology, mathematics and computer science departments; 2) develop a flexible and modular curriculum that faculty can implement at their institutions and 3) engage a diverse student group through dynamic annual data challenges. The biomedical research enterprise requires a workforce that can access, manipulate, integrate and analyze big data; it is essential to train the next generation of scientists to fill this need. The proposed education program will stimulate big data skills training across dozens of institutions and hundreds if not thousands of students and provide a template for genomic big data training at colleges needing improved expertise. An intensive one week workshop will provide skills training to professors from small colleges and regional universities. Jackson Laboratory scientists and participants will use a collaborative, multidisciplinary approach to develop, deliver and publish online a big data curriculum for implementation in undergraduate courses. Annual data challenges will foster team-based science where undergraduates compete across and between institutions and simultaneously gain research experience. Our needs assessment clearly demonstrates demand for this program across diverse institutions including Historically Black Colleges and Universities, Institutional Development Award (INBRE) institutions and small colleges. The Big Genomic Data Skills Training for Professors program will shift the focus of undergraduate training towards big data skills and raise the competencies of the biomedical workforce. To provide large- scale dissemination, all educational program content will be made available on a public, online resource. Public Health Relevance: The Big Genomic Data Skills Training for Professors is highly relevant to the biomedical training mission of the Big Data to Knowledge (BD2K) initiative. The multi-faceted program will 1) train faculty at less research- intensive institutions, 2) integrate ig data skills into dominantly undergraduate institutions, and 3) engage young scientists in the shared quest to manage and leverage biomedical big data. The annual big data challenge will facilitate team-driven research and teach participants that sharing data, rather than hording and protecting data, represents a new paradigm for biomedical research. Address; Arts; Award; base; Big Data; Biological; Biological Models; Biology; Biomedical Research; college; Computational Biology; computer science; computing resources; Connecticut; Data; Data Analyses; Data Set; Development; Education; Educational Curriculum; Educational process of instructing; Educational workshop; experience; Faculty; flexibility; Fostering; Genomics; Goals; Historically Black Colleges and Universities; Home environment; Hour; Immersion Investigative Technique; improved; innovation; Institution; Internet; Knowledge; Laboratory Scientists; Maine; Mammalian Genetics; Mathematics; Medicine; Mission; Needs Assessment; next generation; open source; Participant; Postdoctoral Fellow; professor; programs; public health relevance; Publishing; Research; Research Project Grants; research study; Resources; Science; Scientist; sharing data; skills; skills training; Students; Teacher Professional Development; The Jackson Laboratory; Time; tool; Training; undergraduate student; United States; United States National Institutes of Health; Universities; virtual

20


1R25EB020379-03

Ohsu Informatics Analytics BD2K Skill Course Dorr, David A; Haendel, Melissa A; Mcweeney, Shannon K dorrd@ohsu.edu Oregon Health & Science University http://www.ohsu.edu/xd/education/schools/school-of-medicine/ departments/clinical-departments/dmice/research/bd2k.cfm The basic goals of the BD2K Skills Courses are twofold: first, to create a learning environment in that will entice all levels of learners to participate. For the environment, we pose a series of challenges that requires didactics, data search strategies, the use of carefully (but artificially) created large datasets with metadata that matches standard available big datasets available within institutions and from large repositories, like the NIH, and finally, reproducibility in informatics workflows. The second goal is to share the course with as many learners as possible, offering it at least four times during the study period, and making it available to others during and after the grant period. We also endeavor to recruit diverse learners and early learners, to encourage more persons to become data scientists. To date, we have offered five courses to interns, students, staff and faculty at OHSU. The first course offering, Big Data to Knowledge Skills Course™ was held in July 2015. It was offered to a group of undergraduate persons interested in data-related research. These students attended a week-long, 35 hour session on the basics of effectively finding, wrangling, analyzing, and presenting data for research. Ethical principles and research failures were highlighted very effectively to imprint on them the importance of curation, sharing, and reproducibility in data use. The second, Data after Dark™, was offered to the graduate and early faculty community at OHSU in January 2016. It provided a briefer and more intense approach to understanding similar issues in the use and misuse of data in formulating problems, analysis, and presentation. An advanced™ course was offered over 4 nights in May 2016, which expanded on these principles and offered a data challenge to participants. Two “Data and Donuts” courses were offered in June and July 2016, respectively, which was targeted towards early learners and OHSU summer interns at the main and west campus. Keywords: Data cleaning, Exploratory Data Analysis. Data Standards, Big Data, Electronic Health Record, Genetics and Genomics, Clinical Research

21


R25 Grants Training and Educational Development (R25) Dorr, David A.

Oregon Health & Science University

Shojaie, Ali

University Of Washington

Pathak, Jyotishman

Mayo Clinic Rochester

Recht, Michael P.

New York University School Of Medicine

Kovatch, Patricia

Icahn School Of Medicine At Mount Sinai

Mukherjee, Bhramar

University Of Michigan

Hoffmann, Alexander

University Of California Los Angeles

Chuang, Jeffrey Hsu-min Jackson Laboratory Fowlkes, Charless

University Of California-Irvine

Shaw, Joseph R.

Mount Desert Island Biological Lab

Zhang, Min

Purdue University

Martin, Elaine R

Univ Of Massachusetts Med Sch Worcester

Haddad, Bassem R

Georgetown University

Surkis, Alisa

New York University School Of Medicine

Lawson, Catherine L

Rutgers, The State Univ Of N.J.

Seymour, Anne

Johns Hopkins University

Caffo, Brian Scott

Johns Hopkins University

Irizarry, Rafael Angel

Harvard School Of Public Health

Pevzner, Pavel A

University Of California San Diego

Hersh, William R

Oregon Health & Science University

Amaro, Rommie E

University Of California San Diego

Lee, Christopher

University Of California Los Angeles

Bohland, Jason W Elgin, Sarah C.R.

Boston University (Charles River Campus) Washington University

22


1 R25 GM119157-02

A Genome Browser On-Ramp to Engage Biologists with Big Data Elgin, Sarah C R selgin@biology.wustl.edu Washington University The G-OnRamp project has two complementary foci: (1) to address the immediate need for a genome annotation platform that can be used by biomedical scientists with little informatics background to analyze large genomic datasets (e.g., RNA-Seq, ChIP-Seq), and (2) to develop a resource for biology faculty members to bring bioinformatics into the graduate and undergraduate curriculum by introducing students to a genome annotation project. G-OnRamp is a collaborative project based on two successful efforts, the Genomics Education Partnership (GEP; http://gep.wustl.edu) and the Galaxy Project (https://galaxyproject.org). We are developing a customized version of Galaxy called G-OnRamp that will enable biologists to annotate any sequenced eukaryotic genome, a task that can serve as an introduction to other big data biomedical analyses. Genome annotation identifying functionally active regions within a genome requires the use of diverse datasets and tools, including sequence similarity to known genes, gene prediction models, and high-throughput genomics data. One way to facilitate the interpretation of these potentially contradictory lines of evidence is to display the appropriate evidence tracks on a genome browser - such as protein sequence alignments from a well-annotated nearby species, the results from ab initio gene predictors, and RNA-Seq coverage data showing transcriptionally active regions. Galaxy provides a platform for creating and sharing analysis workflows and results, in this case performing the bioinformatics analyses and creating the needed evidence tracks to be viewed in a genome browser. The GEP is a consortium of over 100 colleges/universities that provides classroom undergraduate research experiences in genomics for students at all levels. Students are performing primary research on selected regions of Drosophila genomes using genomic databases (e.g., FlyBase) and bioinformatics tools (e.g., BLAST) while learning about gene structure, evolution, underlying algorithms for the most important tools (e.g., Hidden Markov Models), and more. The GEP faculty would like to set up additional annotation problems, looking at a variety of organisms, reflecting their diverse research interests. G-OnRamp will extend Galaxy by providing (a) analysis pipelines for functional genomic data (e.g., ChIP-Seq, RNA-Seq); (b) interactive visual analytics to view and annotate a genome (e.g., create UCSC Assembly Hubs); and (c) tools for collaborative genome annotation. The GEP will serve as a key use case to validate and refine G-OnRamp, ensuring that it satisfies real educational needs. In a recent beta-testers workshop, we successfully used G-OnRamp to produce UCSC Assembly Hubs for five genomes provided by workshop participants, including a fungus (Chlamydomonas reinhardtii), two species of fish (Sebastes rubrivinctus and Kryptolebias marmoratus), the Puerto Rican parrot (Amazona vittata), and the African clawed frog (Xenopus laevis). These assembly hubs provide a custom genome browser for each organism. They include evidence tracks for the tblastn alignments of proteins from a nearby species, gene predictions from Augustus, GlimmerHMM and SNAP, and RNA-Seq results from HISAT and StringTie. Further development is underway to improve the user interface for building Assembly Hubs, to create an interface for subproject management, and to create tools for collaborative annotations. G-OnRamp is freely available at https://github. com/goeckslab. Publications: doi: 10.1093/nar/gkw343, doi: 10.1187/cbe-13-08-0152, doi: 10.1534/g3.114.015966 Keyword: Data mining, Scientific Workflow Environments, undergraduate education, Genetics and Genomics, Chromosome Biology, Computational Biology

23


R25 Grants

24


1 R25 EB022366-02

The Big DIPA: Data Image Processing and Analysis Fowlkes, Charless; Digman, Michelle fowlkes@ics.uci.edu University Of California-Irvine http://bigdipa.ccbs.uci.edu

The primary goal of this BD2K training effort is to provide an accessible entry point for scientists (graduate students, postdocs and established researchers in academia and industry) wishing to gain the necessary practical skills for extracting knowledge from large scale complex image data by providing high-level instruction in the theoretical mathematics and statistics of image processing and spatial analysis, as well as hands-on participation in the practical aspects of core competencies (e.g. computational processing, visualization). A secondary goal is to provide opportunities to foster interdisciplinary interactions that are a critical component of effective big data research and help build the community of experts able to effectively drive research in these fields. The course consists of an on-site modular short course (~40 participants) that progresses through the complete research pipeline, from problem definition, data acquisition, data processing to visualization and interpretation. This in-person component will be most effective in providing sufficient depth and proficiency in acquiring the appropriate research competencies to participants from diverse backgrounds and at different career stages. We also intend to develop an on-line repository of course materials, including lecture materials, short video tutorials and example datasets from the course, which will provide an organized progression of the major themes for self-study and reference. The course content is designed to provide instruction on a coherent progression of keystone concepts that will enhance the theoretical understanding and practical core competencies of trainees for independent research with large image datasets. In particular, the course design will provide a mix of strategies to help participants:

25


R25 Grants

Identify appropriate methods and limitations of big data image acquisition and instrumentation

Define and address considerations of image formats, storage and handling (hardware and software considerations)

Identify and understand mathematical and statistical frameworks for image processing

Implement computational algorithms on appropriate architectures (e.g. GPUs and parallel high performance computing platforms)

Visualize processed data (using tools such as dimensionality reduction and appropriate data-rich graphical representations)

Keywords: High performance computing, Image Analysis, Machine/statistical learning, Pattern recognition and learning, Spatio-temporal Modeling, Statistical Analysis, Visualization, Data analysis, Molecular Biology and Biochemistry, Neuroscience, Systems Biology, Developmental Biology

26


1 R25 EB022365-02

Curriculum Development and Training for Systems Genetics Churchill, Gary A gary.churchill@jax.org Jackson Laboratory Abstract: Biomedical research has been recast into a data intensive science by digital technologies that produce enormous quantities of data. New capabilities in biomedical research have emerged as a result of this transformation; however, training to realize these new capabilities has lagged. New research capabilities require a workforce with sophisticated training in methods for Big Data analysis. We propose to provide the advanced training in statistical inference, modeling, and high-throughput sequence analysis that biomedical researchers need to reach new breakthroughs in medicine and health. Our goal is to produce biomedical researchers skilled in advanced statistical methods and to nurture interdisciplinary collaboration. The Churchill group at The Jackson Laboratory will leverage its expertise in statistical analysis of large-scale data to create a cutting-edge Curriculum and Training for Systems Genetics. Our specific aims include 1) develop a modular curriculum for advanced training in genomic data analysis, 2) test and evaluate the curriculum at The Jackson Laboratory with an audience of practicing scientists, postdoctoral fellows, and graduate students, and 3) disseminate the curriculum through Software Carpentry’s global instructor network. Software Carpentry teaches basic laboratory skills for scientific computing. We will train members of Software Carpentry’s community of volunteer scientist-instructors to teach our curriculum. Trained instructors will then broadcast advanced instruction in statistical analysis of Big Data to researchers at numerous biomedical institutions. Data Carpentry, a sibling organization of Software Carpentry, will assist us in integrating our curriculum with its genomics workshop to create seamless domain-specific training from introductory to advanced content. The full curriculum will be available on the web under an open license for teaching and self-study. Public Health Relevance: Biomedical research has been transformed by technologies that produce immense quantities of data. This transformation brings the potential for new and innovative solutions to the most difficult problems in understanding human health and disease. This potential can be realized once a critical mass of researchers armed with highly developed computational and quantitative skills is in place. At present a lack of researchers with computational and quantitative skills stalls progress and leaves the potential of Big Data untapped. To remedy this, we propose an advanced curriculum and training in statistical analysis of Big Data to equip biomedical researchers with skills that will lead to breakthroughs in research. Address; American Association for the Advancement of Science; analytical method; arm; big biomedical data; Big Data; Bioinformatics; biological systems; Biomedical Research; Biometry; Chromosome Mapping; Cloud Computing; Collaborations; Communities; Computational Biology; Computer Simulation; Computer software; Computing Methodologies; Country; curriculum development; Data; Data Analyses; data management; Data Science; digital; Discipline; Disease; doctoral student; Educational Curriculum; Educational process of instructing; Educational workshop; Genetic; Genome; genomic data; Genomics; Goals; graduate student; Health; Heterogeneity; High-Throughput Nucleotide Sequencing; Human; innovation; Institution; Instruction; instructor; interdisciplinary collaboration; Internet; Investigation; Knowledge; Laboratories; Lead; Learning; Licensing; Medicine; member; Methods; Modeling; National Institute of General Medical Sciences; notch protein; novel strategies; Positioning Attribute; Postdoctoral Fellow; programs; Research; Research Personnel; Resources; Science; scientific computing; Scientist; Sequence Alignment; Sequence Analysis; Siblings; Site; skills; Statistical Data Interpretation; Statistical Methods; symposium; System; Teacher Professional Development; Teaching Materials; Technology; Testing; The Jackson Laboratory; tool; Training; Update; volunteer

27


R25 Grants

1 R25 LM012285-02

Demystifying Biomedical Big Data: A User’s Guide Haddad, Bassem R; Gusev, Yuriy ; Mcgarvey, Peter haddadb1@georgetown.edu Georgetown University Abstract: Biomedical Big Data has become a sign of our times in this new genomics era, marked by a major paradigm shift in biomedical research and clinical practice. Advances in genomics have led to the generation of massive amounts of data. However, the usefulness of these data to the basic scientist or to the clinical researcher, to the physician or ultimately to the patient, is highly dependent on understanding its complexity and extracting relevant information about specific questions. The challenge is to facilitate the comprehension and analysis of big datasets and make them more "user friendly". Towards this end we propose to develop a Massive Open Online Course (MOOC), "Demystifying Biomedical Big Data: A User's Guide," that aims to facilitate the understanding, analysis, and interpretation of biomedical big data for basic and clinical scientists, researchers, and librarians, with limited/ no significant experience in bioinformatics. This course will be a resource freely availabl to users, at no cost. Given the continuous progress in the field, our plan for the course is to be "living resource," regularly revised and updated to maintain its relevance. Public Health Relevance: The proposed study aims to develop a Massive Open Online Course (MOOC), "Demystifying Biomedical Big Data: A User's Guide," that will facilitate the understanding, analysis, and interpretation of biomedical big data for basic and clinical scientists, researchers, and librarians, with limited/no significant experience in bioinformatics. This course will be a resource freely available to users, at no cost. Address; Big Data; Bioinformatics; Biological; Biomedical Research; Clinical; clinical practice; Communities; Comprehension; cost; Data; Data Set; experience; Generations; Genomics; Knowledge; Librarians; Life; Methods; Patients; Physicians; public health relevance; Research; Research Personnel; Resources; Science; Scientist; sharing data; Students; Technology; Time; tool; Training and Education; Update; userfriendly

28


1 R25 GM114820-03

Adding Big Data Open Educational Resources to the ONC Health IT Curriculum Hersh, William R; Haendel, Melissa A; Mcweeney, Shannon K hersh@ohsu.edu Oregon Health & Science University http://skynet.ohsu.edu/bd2k

The overall goal of this project is to develop open educational resources (OERs) for use in courses, programs, workshops, and related activities as part of the National Institutes of Health (NIH) Big Data to Knowledge (BD2K) program. The materials are focused on those needing to learn at the advanced introductory level, including but not limited to beginning informatics graduate students, established investigators and senior trainees seeking to learn more about data science to expand their research programs, advanced undergraduates exploring future career paths into data science, as well as a variety of established professionals who need to understand and apply knowledge of the application of BD2K concepts in their present jobs. Examples of the latter include university administrators, librarians, public health practitioners, clinician leaders, and computer scientists. We have used a modification of the approach and format of the highly successful Office of the National Coordinator for Health IT (ONC) curriculum materials. Similar to the ONC materials, we provide materials both �out of the box� that can be used with minimum modification and as source materials (slides, exercises, etc.) that can be used to develop custom curricular content. We differ from the ONC materials in using more flexible formats and pedagogic approaches. We also do not use (expensive) professional narrators.

Keywords: Information science, Medical informatics, Search engines, Clinical Research

29


R25 Grants

30


1 R25 EB022364-02

NGS Data Analysis Skills for the Biosciences Pipeline Hoffmann, Alexander; Papp, Jeanette Christine ahoffmann@ucla.edu University Of California Los Angeles Abstract: Next Generation Sequencing (NGS) is powering the field of Genomics and increasingly permeating all aspects of biosciences research and clinical practice. This development has triggered a huge need for data analysis skills among researchers and professionals. As biosciences graduate programs and also biomedical professional schools are rapidly adapting their curricula towards a greater emphasis on quantitative analysis skills, it has become apparent that the current pool of graduate and medical/pharmacy school applicants who have an understanding and hands-on experience in quantitative analysis of `omic datasets remains exceedingly shallow. Indeed, without prior computational biosciences research experience it is difficult to evaluate applicants for their aptitude in quantitative analysis skills. In this post-genome era of biosciences research, a lack o aptitude in quantitative analysis skills will hamper graduate students and biomedical researchers through their careers. The proposed research education plan is addressing this need by providing potential applicants of graduate and professional schools with a Summer Program focused on NGS data analysis and statistics, culminating in a capstone research project in the area of computational genomics to apply these skills and document proficiency. The proposed summer undergraduate summer course in hands-on computational genomics leverages existing infrastructure at UCLA's QCB Collaboratory and adapts and extends these to train both the likely applicants of graduate and biomedical professional schools, as well as the next generation of Big Data educator-scientists by training postdoctoral fellows in workshop teaching, and by disseminating the tools and infrastructure to other institutions. Public Health Relevance: Next Generation Sequencing (NGS) is powering the field of Genomics and increasingly permeating all aspects of biosciences research and clinical practice. This development has triggered a huge need for data analysis skills among researchers and professionals. As biosciences graduate programs and biomedical professional schools are rapidly adapting their curricula towards a greater emphasis on quantitative analysis skills, the proposed research education plan aims to 1) to develop a summer workshop for undergraduates that will inspire and empower a diverse group to tackle the coming needs for genomic analysis, and 2) to develop and distribute educational materials and resources that can be used by biomedical students, researchers and educators for instruction in sharing, mining, and analyzing NGS and other types of genomic data, which in turn will advance the understanding, diagnosis, and treatment of disease, and promote public health. Address; Aptitude; Area; Big Data; career; clinical practice; collaboratory; Data; Data Analyses; Data Set; Development; Diagnosis; Disease; Education; education planning; Educational Curriculum; Educational Materials; Educational process of instructing; Educational workshop; empowered; Evaluation; experience; Exposure to; Faculty Workshop; flexibility; Future; Genome; Genomics; Graduate Education; graduate student; innovation; Institution; Instruction; Medical; Mentors; Mining; next generation; next generation sequencing; Pharmacy Schools; Postdoctoral Fellow; Professional Education; programs; public health medicine (field); public health relevance; Quantitative Reasoning; Readiness; Research; Research Infrastructure; Research Personnel; Research Project Grants; Resources; Schools; Scientist; skills; Software Tools; Speed (motion); Statistical Data Interpretation; statistics; Students; success; Testing; tool; Training

31


R25 Grants 1 R25 GM114818-03

Biomedical Data Science Online Curriculum on HarvardX Irizarry, Rafael Angel rafa@jimmy.harvard.edu Harvard School Of Public Health http://genomicsclass.github.io/book/pages/classes.html We have completed the development of version two of our biomedical data science curriculum. During the first funding period we expanded our existing MOOC Data Analysis for Genomics, created by Dr. Irizarry and Dr. Michael Love, into eight MOOCs. After reviewing the data for the class and the class evaluation we decided to create two separate series and expand some of the courses. During version two we have expanded these courses and created two XSeries. The first one is titled Data Analysis for the Life Sciences and includes the following courses: 1. Statistics and R 2. Introduction to Linear Models and Matrix Algebra

3. Statistical Inference and Modeling for High-throughput Experiments 4. High-Dimensional Data Analysis

The XSeries web page can be viewed here: https://www.edx.org/xseries/data-analysis-life-sciences The first course serves as an introduction to the basic statistical concepts and R programming skills necessary for analyzing data in the life sciences. In the second course, we learn to apply linear models to analyze data. In the third course we focus on the techniques commonly used to perform statistical inference on high throughput data such as the multiple comparison problem, false discovery rates, and hierarchical models. Finally, in the fourth course, we provide the basics to understand techniques that are widely used in the analysis of high-dimensional data such as distance, clustering, machine learning, principal component analysis, factor analysis, batch effects, and exploratory analysis for high-dimensional data. The second series focuses on genomics and is therefore titled Genomics Data Analysis. The series is composed of three courses: 1. Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays 2. High-performance Computing for Reproducible Genomics 3. Case Studies in Functional Genomics The XSeries web page can be viewed here: https://www.edx.org/xseries/genomics-data-analysis In the first course we provide an introduction to the structure, annotation, normalization, and interpretation of genome scale assays in Bioconductor. In the second we learn how to bridge from diverse genomic assay and annotation structures to data analysis and research presentations via innovative approaches to computing. Finally, in the last course we put it all together and explore data analysis of several experimental protocols, using R and Bioconductor. Finally, we are close to the completion of a new course for the curriculum: Python for Data Analysis. This course is not a Python programming course but rather a data analysis and software engineering course that uses Python. High performance computing, Bayesian Methods, Machine/statistical learning, Probability, Scientific Workflow Environments, Statistical Analysis, Uncertainty modeling, Visualization, Computational Genomics, Computer science, Data analysis, Genetics and Genomics, Cancer, Computational Biology

32


1 R25 EB020393-02

Community Research Education and Engagement for Data Science (CREEDS) Kovatch, Patricia; Claudio, Luz ; Sharp, Andrew James patricia.kovatch@mssm.edu Icahn School Of Medicine At Mount Sinai http://icahn.mssm.edu/education/non-degree/creeds The Community Research Education and Engagement for Data Science (CREEDS) represents our commitment to the overall goal of fostering practical skills for a national, diverse and interdisciplinary community of early career researchers. Our motivations are: 1) to provide biomedical researchers with the practical skills and insight needed to accelerate scientific discovery; 2) to develop an online social environment to facilitate the exchange of big data, approaches and techniques between novices and experts; and 3) to enhance the diversity of the biomedical big data workforce through targeted recruitment and retention of disadvantaged and underrepresented student populations. We are engaging 120 graduate students through an intensive, self-tailored, two week summer school in NYC that will showcase interesting, collaborative case studies in activities at schools throughout NYC. Participants will employ active learning techniques to develop their skills of specific new methods and tools through both individual and group tasks on real-life large data sets. The training raises the skills of students of varied backgrounds to a sufficient level for additional graduate research and will not require any prior computing experience. Students also receive experience and materials to help them teach others when they return to their home institutions. We plan on reaching more individuals by placing the summer school on Coursera and making all materials available on open educational resources. Additionally, we are mentoring another 30 NYC-based graduate students for team participation in four month-long DREAM Challenges. The goal of the DREAM Challenges is to teach interdisciplinary collaboration and advances and programming skills. In small groups, faculty mentor teams to develop computational solutions to real-life biological questions over a four-month period. As a result of CREEDS, we will have trained over 150 people to develop team skills to understand, select and use genomics data tools and approaches. Armed with this knowledge, we plan that the next generation of genomics scientists will be better placed to design, analyze, and interpret high-throughput genomics assets. This is the important need for research education that we aim to address. Keywords: Data mining, High performance computing, Medical informatics, Predictive analytics, Scientific Workflow Environments, Computational Genomics, Computer programming, Computer science, Data analysis, Genetics and Genomics, Computational Biology

33


R25 Grants

34


1 R25 LM012286-02

Enabling Data Science in Biology Lawson, Catherine L cathy.lawson@rutgers.edu Rutgers, The State Univ Of N.J. http://edsb.rcsb.org This NIH-funded R25 project (Sept 2015-Aug 2017) will develop, pilot test, and deliver an open access modular educational curriculum covering concepts, approaches and requirements for developing and managing the full data pipeline for a curated public archive of biological experimental data contributed by a community of data providers. The project faculty are experienced structural biologists, data scientists, and educators drawn from the Research Collaboratory for Bioinformatics (RCSB), which develops and manages the Protein Data Bank, EMDataBank, the Structural Biology Knowledgebase, and the Nucleic Acid Database. The RCSB built the infrastructure for these data resources and has successfully managed them over the past 20 years, an era of a rapid expansion in the area of Structural Biology. The “Enabling Data Science in Biology� (eDSB) open online curriculum will make best practices recommendations for data resource management based on the extensive experience accumulated by the RCSB team. The intended audience includes librarians and information specialists, who will be able to use the materials as a basis for training and services offered by their organizations, and scientists for self-instruction. In addition, the eDSB curriculum will help to catalyze formation of a proposed federated system of model and data archives that will accelerate progress in Integrative Structural Biology. We plan to produce eight Modules that can either be studied separately or assembled into a complete set as an open online course, as numbered in the overview. Keywords : Databases/data warehousing, Managing a Data Archive, Molecular Biology and Biochemistry, Structural Biology, Chemical Biology, Computational Biology

35


R25 Grants 1 R25 GM114822-03

The BD2K Concept Network: open sharing of active-learning and tools online Lee, Christopher leec@chem.ucla.edu University Of California Los Angeles http://ceils.ucla.edu/index.php/education-projects Abstract: The proposed project will create online services, teaching materials sharing, and training for instructors and students to 1) expand and tailor Big Data To Knowedge (BD2K) learning for new audiences in bioinformatics, medical informatics and biomedical applications; 2) use active- learning to greatly increase conceptual understanding and real-world problem-solving ability; 3) directly measure learning effectiveness; and 4) boost the number of students that successfully complete BD2K courses. Tailoring the core concepts for BD2K success to teach diverse biomedical audiences is crucial both because these interdisciplinary concepts are a key barrier to entry, and because they are vital for real-world BD2K problem-solving ability. The UCLA/UCSD project team will: 1) provide an open, online repository where BD2K instructors worldwide can find, author, and share peer-reviewed active-learning exercises such as concept tests (already over 600), and immediately use them in class (with students answering with their smartphones or laptops); 2) catalyze the development, usage and validation of candidate BD2K concept inventories for rigorously measuring learning gains, via an accelerated approach of open- response concept testing and online data collection; 3) provide BD2K instructors a collaborative, peer-reviewed sharing and remixing platform for active-learning materials such as algorithm projects, hands-on data mining projects (via convenient "cloud projects"), exercises and problems, as well as "courselet" recording tools that automatically record video and audio on the instructor's laptop while they teach; 4) provide students anywhere free online courselets each about one key BD2K concept, consisting of brief videos tightly integrated with concept tests and all the activelearning exercises described above, and designed as an online persistent-learning community unified by concepts, in which students learn from the community's consolidated error models (common errors for a specific BD2K concept), effective remediations and counter-examples for each error model. Testing of this instructional approach for 3 years has doubled successful student completions of a BD2K methods course at UCLA, by reducing attrition, while simultaneously increasing conceptual understanding (mean exam scores). This approach will also be disseminated by: 1) pilot projects with BD2K instructors at UCLA and partner institutions, with detailed evaluation studies to identify critical success factors; 2) workshops (both online and onsite) for training instructors how to teach effectively with these tools in their BD2K courses; 3 online services and courselets. Public Health Relevance: Big Data to Knowledge (BD2K) education means bringing sophisticated data mining skills and thinking to researchers and clinicians throughout the biomedical enterprise, a most challenging interdisciplinary learning curve. This cannot succeed without the kinds of hands-on learning exercises that are hard to find in BD2K textbooks, but that students need, such as data-mining projects with real datasets and real computational powertools, concept tests and concept inventories that rigorously teach and measure conceptual understanding, and algorithm projects where students prove their understanding of a challenge problem, by writing code that can correctly solve any test case thrown at it. We will provide BD2K instructors a collaborative, peer- reviewed sharing platform for immediately using all of these kinds of active-learning materials in class (currently containing over 2000 BD2K exercises and related materials), and BD2K students free online courselets each about one key BD2K concept, consisting of brief videos tightly integrated with concept tests and all the active-learning exercises described above. Active Learning; Algorithms; Big Data; Bioinformatics; Biological; candidate validation; Classification; Code; Collaborations; Collection; Communities; Computers; Data; Data Collection; data mining; data modeling; Data Set; design; Development; Discipline; Education; Educational process of instructing; Educational workshop; Effectiveness; Equipment and supply inventories; Evaluation Studies; Exercise; Guidelines; Individual; Institution; instructor; Interdisciplinary Education; Knowledge; laptop; Learning; lectures; Link; Measures; Medical Informatics; method development; Methods; Modeling; novel; Peer Review; Pilot Projects; Plug-in; PrePost Tests; Printing; Problem Solving; problems and exercises; Procedures; programs; Protocols documentation; public health relevance; remediation; repository; Research Personnel; response; Secure; Services; skills; Slide; Source; Students; success; System; Teaching Materials; Testing; Textbooks; Thinking, function; tool; Training; Validation; Video Recording; virtual; Work; Writing

36


1 R25 EB022363-02

Transforming Analytical Learning in the Era of Big Data Mukherjee, Bhramar; Johnson, Timothy D; Mozafari, Barzan ; Nguyen, Long bhramar@umich.edu University Of Michigan https://sph.umich.edu/bdsi/ Abstract: In this dawning era of `Big Data' it is vital to recruit and train the next generation of biomedical data scientists in `Big Data'. The collection of `Big Data' in the biomedical sciences is growing rapidly and has the potential to solve many of today's pressing medical needs including personalized medicine, eradication of disease, and curing cancer. Realizing the benefits of Big Data will require a new generation of leaders in (bio) statistical and computational methods who will be able to develop the approaches and tools necessary to unlock the information contained in large heterogeneous datasets. There is a great need for scientists trained in this specialized, highly heterogeneous, and interdisciplinary new field. Thus, the recruitment of talented undergraduates in science, technology, engineering and mathematics (STEM) programs is vital to our ability to tap into the potential that `Big Data' offer and the challenges that it presents. The University of Michigan Undergraduate Summer Institute: Transforming Analytical Learning in the Era of Big Data will draw from the expertise and experience of faculty from four different departments within four different schools at the University of Michigan: Biostatistics in the School of Public Health, Computer Science in the School of Engineering, Statistics in the College of Literature, Sciences and the Arts, and Information Science in the School of Information. The faculty instructors and mentors have backgrounds in Statistics, Computer Science, Information Science and Biological Sciences. They have active research programs in a broad spectrum of methodological areas including data mining, natural language processing, statistical and machine learning, large-scale optimization, matrix computation, medical computing, health informatics, high-dimensional statistics, distributed computing, missing data, causal inference, data management and integration, signal processing and imaging. The diseases and conditions they study include obesity, cancer, diabetes, cardiovascular disease, neurological disease, kidney disease, injury, macular degeneration and Alzheimer's disease. The areas of biology include neuroscience, genetics, genomics, metabolomics, epigenetics and socio-behavioral science. Undergraduate trainees selected will have strong quantitative skills and a background in STEM. The summer institute will consist of a combination of coursework, to raise the skills and interests of the participants to a sufficient level to consider pursuing graduate studies in `Big Data' science, along with an in depth mentoring component that will allow the participants to research a specific topic/project utilizing `Big Data'. We have witnessed tremendous enthusiasm and response for our pilot offering in 2015 with 153 applications for 20 positions and a yield rate of 80% from the offers we extended. We plan to build on the success of this initial offering in the next three year funding cycle of this grant (2016-2018). The overarching goal of our summer institute in big data is to recruit and train the next generation of big data scientists using a notraditional, action-based learning paradigm. This six week long summer institute will recruit a group of approximately 30 undergraduates nationally and expose them to diverse techniques, skills and problems in the field of Big Data. They will be taught and mentored by a team

37


R25 Grants

of interdisciplinary faculty, reflecting the shared intellectual landscape needed for Big Data research. At the conclusion of the program there will be a concluding capstone symposium showcasing the research of the students via poster and oral presentation. There will be lectures by UM researchers, outside guests and a professional development workshop to prepare the students for graduate school. The resources developed for the summer institute, including lectures, assignments, projects, template codes and datasets will be freely available through a wiki page so that this format can be replicated anywhere in the world. This democratic dissemination plan will lead to access of teaching and training material for undergraduate students in this new field across the world. Public Health Relevance: We propose a six week long summer institute: "Transforming Analytical Learning in the Era of Big Data" to be held at the Department of Biostatistics, University of Michigan, Ann Arbor, with a group of approximately 30 undergraduates recruited nationally, from 2016-2018. We plan to expose them to diverse techniques, skills and problems in the field of Big Data. They will be taught and mentored by a team of interdisciplinary faculty from Biostatistics, Statistics, Computer Science and Engineering, reflecting the shared intellectual landscape needed for Big Data research. At the conclusion of the program there will be a concluding capstone symposium showcasing the research of the students via poster and oral presentation. There will be lectures by UM researchers, outside guests and a professional development workshop to prepare the students for graduate school. The resources developed for the summer institute, including lectures, assignments, projects, template codes and datasets will be freely available through a Wiki page so that this format can be replicated anywhere in the world. This democratic dissemination plan will lead to access of teaching and training material in this new field across the world. The overarching goal of our summer institute in big data is to recruit and train the next generation of big data scientists using a nontraditional, action-based learning paradigm. Adverse drug effect; Alzheimer's Disease; Area; Arts; base; Behavioral Sciences; Big Data; Biological Markers; Biological Sciences; Biology; Biometry; burden of illness; Cardiovascular Diseases; Case Study; cluster computing; Code; Collection; college; computer science; Computing Methodologies; Data; data integration; data management; data mining; Data Set; design; Development; Diabetes Mellitus; Disease; Educational process of instructing; Educational workshop; Engineering; Epigenetic Process; experience; Faculty; Funding; Generations; Genetic; Genomics; Goals; graduate student; Grant; Image; Imagery; Information Sciences; Injury; Institutes; instructor; interest; Kidney Diseases; Lead; Learning; lectures; Literature; Machine Learning; Macular degeneration; Malignant Neoplasms; Medical; meetings; member; Mentors; metabolomics; Methodology; Methods; Michigan; Natural Language Processing; nervous system disorder; Neurosciences; next generation; Obesity; open source; Oral; Participant; personalized medicine; Pharmaceutical Preparations; Positioning Attribute; posters; Prevention; programs; Public Health Informatics; public health relevance; Public Health Schools; Recruitment Activity; Research; Research Personnel; Resources; response; Schools; Science; Science, Technology, Engineering and Mathematics; Scientist; signal processing; skills; Statistical Methods; statistics; Students; success; symposium; Talents; Techniques; tool; Training; undergraduate student; Universities; wiki; Work

38


1 R25 EB020381-04

Big Data Coursework for Computational Medicine Pathak, Jyotishman; Chute, Christopher G; Neuhauser, Claudia M jyp2001@med.cornell.edu Weill Cornell Medical College http://bdc4cm.org/

The Big Data Coursework for Computational Medicine (BDC4CM) is a joint Weill Cornell Medicine, University of Minnesota, and Johns Hopkins University week-long training program supported by a grand awarded by the U.S. National Institutes of Health (R25EB201381) that emphasizes how to navigate the interface between research and practice by offering participants in-depth lectures, case studies and hands-on training and demonstrations from leading researchers who highlight how precision medicine’s promise to deliver exemplary treatment relies on the ability of practitioners and researchers to extract information from high-dimensional data sets that combine traditional clinical data in electronic health records with data generated by high-throughput laboratory and mobile technologies. The lecturers demonstrate how new approaches for data representation, integration, analysis, visualization and sharing need to be developed collaboratively by quantitative scientists, biomedical researchers, clinicians, and bioethicists. Topics include data and knowledge representation standards; information extraction and natural language processing; visualization analytics; data mining and predictive modeling, privacy and ethics, and mHealth and participatory medicine. The 25-30 trainees who are selected following a competitive application process have a variety of background and experiences ranging from faculty, clinicians, scientists, post-doctoral fellows and researchers, all of whom have varying experience and related degrees in health information technology, computer science, statistics or bioinformatics, and represent a multitude of healthcare programs and institutions from around the country. All receive tailored, in-depth instruction and case studies, a survey of the most relevant research domains for big data in healthcare, while interacting with distinguished scholars and world-renowned experts from academia, who also provide advice concerning career options in the field of computational medicine. Travel stipends are provided to those participants who are not based in New York City. Keywords : Data engineering, Databases/data warehousing, Information science, Medical informatics, Operations research, Statistical Analysis, Visualization, Computational Genomics, Computer science, Data analysis, , Biomedical Engineering and Biophysics, Genetics and Genomics, Systems Biology, Clinical Research, Computational Biology

39


R25 Grants

40


1 R25 EB023928-01

A hands-on, integrative next-generation sequencing course: design, experiment, and analysis Owar, Kouros; Chan, Cliburn C kouros.owzar@duke.edu Duke University Abstract: The proposed six-week summer course will provide knowledge, practical skills and experience needed to train the next generation of biomedical researcher in planning, conducting and analyzing inter-disciplinary genomics experiments. Participants will be recruited nationally from advanced undergraduates, early graduate students and postdoctoral fellows from both biological and quantitative disciplines. In addition to learning genomics, statistics, informatics and programming from didactic lectures, participants will also gain invaluable experience working in small diverse groups, mentored by faculty instructors with expertise in both the biological and quantitative sciences. The workshop emphasizes recall and self-testing in gaining these skills in the hands-on computing and wet lab sessions, following pedagogical best practices of interleaved learning, spacing and low- stakes testing for learning. The workshop also emphasizes best practices in reproducible analysis, including literate programming, use of version control and automation via scripting from raw data to final report. Finally, students will be introduced to platforms for Big Data, including the use of cloud computing platforms and libraries for distributed computing such as Spark. Public Health Relevance: Next generation sequencing (NGS) platforms offer an unprecedented opportunity to push the envelope in biomedical discovery. To remain the leading country in this research domain, the next generation of NGS researchers need to be identified, encouraged and trained in the requisite disciplines. The proposed summer course will teach participants - primarily advanced undergraduates, early stage graduate and medical students - fundamental principles and concepts in statistics, biology, computing, bioinformatics and medical informatics needed for inter-disciplinary research using NGS. Adherence; Automation; Big Data; Bioinformatics; Biological; Biology; biomedical scientist; Boxing; career; Cloud Computing; cluster computing; Code; Computer software; computing resources; Country; Data; Data Analyses; design; Discipline; Educational process of instructing; Educational workshop; Environment; experience; Experimental Designs; Faculty; faculty mentor; Foundational Skills; Foundations; Generations; Genomics; Goals; graduate student; Immersion Investigative Technique; Informatics; insight; instructor; Knowledge; Laboratories; Lead; Learning; lectures; Libraries; literate; Mechanics; Medical Informatics; Medical Students; Mentors; Methods; Modeling; next generation; next generation sequencing; Noise; Participant; Postdoctoral Fellow; Process; programs; Protocols documentation; Reading; Recruitment Activity; Reporting; Research; Research Personnel; research study; Risk; Science; Scientist; sequencing platform; skills; sound; Staging; statistics; Students; teaching laboratory; Techniques; Technology; Testing; Thinking; tool; Training; Walking; Work

41


R25 Grants 1 R25 GM114819-03

Open Educational Resources for Biomedical Big Data Pevzner, Pavel A ppevzner@ucsd.edu University Of California San Diego http://bioinformaticsalgorithms.com The largest activity associated with our work has been the launch of the Bioinformatics Specialization on Coursera (http://coursera.org/ specializations/bioinformatics), a mini-degree consisting of six BBD courses and a capstone project. The Bioinformatics Specialization was announced in summer 2015, and the final three of these courses launched in late 2015 and early 2016; all six courses have been running full-time ever since. Most students who complete the courses constitute a “biologist track”, and their main source of assessment is in the form of “Bioinformatics Application Challenges” that guide them through a series of tools in a given area of bioinformatics, building upon the theory that they have learned in the class, and covering standard approaches used every day by many biologists. In one challenge, we guide students through constructing an evolutionary tree of ebolaviruses and fit viruses taken from patients in the recent West Africa outbreak into this tree, identifying the source of the virus. A second class of learners constitutes a “hacker track”, in which students complete automatically graded coding assessments testing their ability to implement the algorithms that they encounter in the course. The Bioinformatics Speicalization has recently shifted to on-demand; new “cohorts” of students are constantly beginning the course concurrently. The capstone project is part of an industry partnership with Illumina, which will be providing their BaseSpace platform (http:// basespace.illumina.com) for student use. This capstone, which is launching in summer 2016, will guide students through Application Challenges on genome assembly, measuring gene expression with RNA-seq, and comparing whole genome and whole exome sequencing. Students in the capstone who have experience with programming will take part in a coding competition to build a genome assembler from the ground up. We will run students’ assemblers on a collection of datasets in order to score them. In this way, we view the assembly challenge as a simulated research project. We have also developed an online course called “Biology Meets Programming: Bioinformatics for Beginners” (http://coursera. org/learn/bioinformatics). This course tackles the problem of teaching learners in bioinformatics who have no programming background (such as many biology students), without reducing the material to a software toolkit. To help students learn the ins and outs of their first programming language, we have plugged in remedial Python exercises from the successful Python track on Codecademy (https:// www.codecademy.com/learn/python), which has reached 2.5 million learners. We published the 2nd edition of Bioinformatics Algorithms: An Active Learning Approach, the bestselling print companion to our Specialization that has been adopted in 40 institutions worldwide. The 2nd edition saw 100 pages worth of changes based on thousands of discussion forum posts in our online classes. The largest component of these changes centered on the addition of “Charging Stations”, dozens of additional modules explaining how to implement especially difficult algorithms. The book is accompanied by programming challenges hosted on Rosalind (http://rosalind.info), a site for learning bioinformatics with problem solving that has 37,000 learners and has been used by 100 professors. The book is also accompanied by lecture playlists on our YouTube channel (http://youtube.com/user/bioinfalgorithms). Publications: doi:10.1145/2686871, http://coursera.org/specializations/bioinformatics Keywords:Graph Theory, Machine/statistical learning, Probability, Statistical Analysis, Computational Genomics, Computer programming, Computer science, Data analysis, Genetics and Genomics, Computational Biology

42


1 R25 EB020389-02

Discovering the Value of Imaging: A Collaborative Training Program in Biomedical Big Data and Comparative Effectiveness Research for the Field of Radiology Recht, Michael P; Aliferis, Constantin F; Braithwaite, Ronald Scott michael.recht@nyumc.org New York University School Of Medicine http://www.med.nyu.edu/courses/cme/voice A key component of value-based health care is the measurement and dissemination of outcomes information through the performance of comparative effectiveness research (CER). There has been limited research and relatively few publications in imaging CER despite the fact that imaging costs, particularly those related to advanced imaging such as MR, CT and PET-CT, are a significant component of increasing health care spending and radiation exposure. One reason for the lack of research in this area is that training in CER, and in the use of biomedical big data that is inherent to CER in imaging, is only available to a very small subset of medical imagers, and is available only at significant cost. Creating effective training in CER and biomedical big data for the broader community of medical imagers is beyond the capabilities of individual imaging training programs and therefore is not currently included in their curriculum. To ensure the appropriate and cost effective use of imaging to improve patient outcomes, there is a critical need to develop far-reaching CER and big data training for imagers. We propose to address this critical need by developing a collaborative training program in CER and biomedical big data that will be accessible to a large number of imagers and imaging trainees. Our longterm goal is to develop CER and big data analytic skills in the next generation of imagers in order to promote imaging that not only improves patient outcomes, but potentially reduces health care costs. Our program has two goals. Goal 1 is to educate imaging residents in the fundamentals of biomedical big data and CER. This will be accomplished by incorporating an introductory set of lectures into the curriculum of the American Institute for Radiologic Pathology, a course attended by 90-95% of United States and Canadian radiology residents. This program will begin in August 2016. Goal 2 is to create an advanced training and mentorship program in CER and the use of biomedical big data, directed toward and available to the entire medical imaging community. To gain the greatest participation, we will use a unique hybrid educational model with both on-line learning and in-person interactive sessions with expert course directors. The program will consist of five courses: Principles of Big Data Analytics , Big Data Analytics - Applications, Decision Analysis and Implementation Science, Value Assessment and Cost Effectiveness Analysis, and Evidence Synthesis and Systematic Review. Each course will begin with asynchronous online didactic material followed by a two-day, sixteen-hour interactive, in-person component, including didactic material and a final problem set and/or final exam. Following completion of the five course program, selected attendees will be matched and work with a CER and/or big data mentor for a one year period on a CER and/or big data analytics research project in imaging. Each mentee will present the results of their research project at a symposium at the end of their mentorship year. This course will begin in January 2017. Keywords: Data mining, Databases/data warehousing, Medical informatics, Statistical Analysis, Data analysis, Cancer, Clinical Research

43


R25 Grants

44


1 R25 LM012288-02

Training & Tools for Informationists to Facilitate Sharing of Next Generation Sequencing Data Seymour, Anne; Lehmann, Harold P aseymou5@johnshopkins.edu Johns Hopkins University The heart of the genomic boom is the sequencing of patient’s genetic material, that is, finding out the specific sequences of genetic components (nucleotides) that defines a person’s genetic makeup. Current next generation sequencing (NGS) produces huge amounts of data. An important aspect of such data is that for us to learn from them, other scientists and researchers need to compare different sequences, so individual researchers should upload their data to large databases that enable such comparisons. We proposed creating 3 online modules that move researchers closer to such sharing. First is a module that teaches an overall perspective of the entire process of NGS research. Second is one that teaches the ethical issues related to such research, in general and sharing, in particular. Third is one that teaches the specific knowledge and skills needed to share the datasets themselves. In each module, there will be examples that show the concepts being taught, that enable the user to try the concepts out, and that enable the user to be tested for skills and knowledge of that Module. Since October 2015, regularly scheduled monthly meetings with our subject matter expert, Dr. Jonathan Pevsner, have been invaluable in educating the three instructors about information useful to all three audiences (Informationist, researcher and clinician) for each of the modules that are in the process of creation for this project. The instructors have been developing their respective modules simultaneously. Each module has established learning objectives and assessment questions that reflect the learning objectives. Relevant content related to the learning objectives continues to be developed for each module by the instructor, working in consultation with Dr. Pevsner and other experts. After considering several options, we have determined the software, Articulate Storyline will be used to build the modules. We are working with the new Office of Online Education to create the online modules. Module one, the framework module, introduces a scientific experimental workflow that introduces the learner to NGS, and provides the learner with the ability to recognize and articulate its steps. There are 21 steps in the process, and each is being examined to determine the content most important to convey to a new learner at each step. The steps may include information on process considerations or important resources, or both. The Sharing module will provide the learner with information focused on the principles and concepts behind sharing and the important issues of ownership, privacy and accuracy. ClinVar

45


R25 Grants

and dbGap are both identified as repositories that will be mentioned in the sharing module, but deep dives into the methods and submission of files to these repositories will not be content feasible for this module. The Ethics module focuses on the concepts behind and the practical implications of informed consent, privacy, confidentiality, ownership and children’s consequences. The instructor has been in consultation with an expert in the field of ethics of genomic data, and this module is the furthest along in content development. Keywords: Information science, Medical informatics, Genetics and Genomics

46


1 R25 EB022367-02

Establishing a Network of Skilled BD2K Practitioners: The Summer Workshop on Population-Scale Genomic Studies of Environmental Stress Shaw, Joseph R joeshaw@indiana.edu Mount Desert Island Biological Lab Abstract: The application of high-throughput technologies in biomedical research has become widespread and requires specialized skills to effectively design experiments analyze large datasets and integrate new data with existing large datasets. These technologies are increasingly being applied in environmental health sciences to provide comprehensive and timely mechanistic knowledge on how the environment affects human health. With the increased application of these technologies, more researchers need training to conceptually develop, properly design and implement comprehensive, large-scale big data studies. Accordingly, our proposed program in Population-Scale Genomics Studies of Environmental Stress has the long-term goal of training a network of Big Data to Knowledge (BD2K) practitioners in the application of modern sequencing technologies, computational approaches and biostatistical methods. The program couples three annual training workshops with networking tools aimed at keeping participants trained, engaged and connected. Workshops will feature a faculty of prominent researchers to provide the training necessary to maximize the application of these technologies. Each workshop will feature novel datasets of model organisms that participants create and analyze to link gene-environment interactions with the fitness of individuals. Hands-on training in a number of bioinformatics tools will be provided. Within this inquiry-based framework, faculty will lecture on a diverse set of topics including ecological genomics, experimental design, genome sequencing and population genetics. Workshops will include a module on responsible conduct of research. The proposed program builds upon our existing Environmental Genomics Course at MDI Biological Laboratory that was first established in 2010. To our knowledge, it is the only course of its kind in the U.S. that provides a highly interactive, hands-on research experience for researchers interested in studying geneenvironment interactions using natural populations. The proposed workshop training is modeled after our 2014 Environmental Genomics Course and forms the foundation with which to build a network of expertly trained BD2K practitioners. The proposed BD2K practitioner network will ensure long-term benefits for program participants especially as new technologies and analysis methods arise in this rapidly changing field. Public Health Relevance: The long-term goal of the proposed NIH Big Data to Knowledge (BD2K) training initiative in Environmental Genomics is to increase the number of BD2K practitioners and build a virtual network of big data scientists. We will couple three annual workshops that focus on population-scale genomics studies of environmental stress with networking tools aimed at keeping participants connected, trained and engaged in the application of modern sequencing technologies, computational approaches, and biostatistical methods. Accounting; Active Learning; Address; Affect; Animal Model; Area; base; Big Data; Bioinformatics; Biological; Biology; Biomedical Research; Biostatistical Methods; career; Couples; Coupling; Data; Data Set; design; Discipline; Ecology; Educational workshop; Ensure; Environment; Environmental Health; experience; Experimental Designs; Faculty; fitness; Foundations; gene environment interaction; gene function; Generations; genome sequencing; Genomics; Goals; Health; high throughput technology; Human; Individual; Informatics; interest; Knowledge; Laboratories; lectures; Link; Manuals; Methods; Mission; Modeling; multidisciplinary; new technology; novel; Participant; Physiology; Population; Population Genetics; programs; public health relevance; Publications; rapid growth; Research; Research Personnel; Research Project Grants; research study; responsible research conduct; Science; Scientist; Series; skills; Staging; Stress; Students; Technology; tool; Training; transcriptome sequencing; United States National Institutes of Health; Variant; virtual

47


R25 Grants 1 R25 EB020380-03

Summer Institute for Statistics of Big Data Shojaie, Ali; Witten, Daniela ashojaie@uw.edu, dwitten@uw.edu University Of Washington http://www.biostat.washington.edu/suminst/sisbid The Summer Institute for Statistics of Big Data (SISBID) at the University of Washington provides workshops on the statistical and computational skills needed to access, process, manage, and analyze large biomedical data sets. The program consists of five 2.5-day in-person courses, or modules, taught at the University of Washington each July. An individual participant can register for a subset of these modules. The five modules are as follows: (1) Big Data Wrangling; (2) Data Visualization; (3) Supervised Methods for Statistical Machine Learning; (4) Unsupervised Methods for Statistical Machine Learning; and (5) Reproducible Research for Biomedical Big Data. Each module consists of a combination of formal lectures and hands-on computing labs. Participants work together in teams in order to apply the skills that they develop in each module to important problems drawn from relevant case studies. The audience for SISBID consists of (i) biomedical scientists who would like to develop the statistical and computational training needed to make use of Biomedical Big Data; and (ii) individuals with stronger statistical or computational backgrounds but little exposure to biology, who will learn how to apply their skills to problems associated with Biomedical Big Data. Each of the five modules is co-taught by two instructors from top universities and research centers across the U.S. They have been selected based on research expertise and excellence in teaching. Lecture videos and slides are made freely available online so that individuals who are unable to attend SISBID in person can still benefit from the program. Keywords: Data mining, Machine/statistical learning, Multivariate Methods, Statistical Analysis, Visualization, Data analysis, Genetics and Genomics, Systems Biology, Computational Biology

48


1 R25 LM012283-02

Preparing Medical Librarians to Understand and Teach Research Data Management Surkis, Alisa; Read, Kevin surkia01@nyumc.org New York University School Of Medicine http://hslguides.med.nyu.edu/data_management?hs=a Abstract: The proposed project is to create two curricula to prepare librarians to teach research data management to researchers. The first will be a web-based curriculum that uses interactive educational technologies to provide librarians with a better understanding of research data management, as well as the practice and culture of research. A second curriculum will be designed as a toolkit to be used by librarians for in-person teaching of researchers and will consist of slides that have been reviewed based on evidence-based instructional design and cognitive learning theories, scripts, evaluation tools, and instructions. Some of the modules in the toolkit will allow for guided customization on topics where institution specific material is important (e.g. data storage). Modules will contain existing and to-be-created, short "edutainment" videos, which we have found to be extremely effective in delivering material. Evaluation of educational materials will include pre- and post-assessments of knowledge gain, satisfaction, comfort level with material, and intent to use for the online and in-person modules. For the former, assessments will be incorporated into the modules and for the latter through a centralized online survey before and after the in-person class. Prior to broad dissemination, both curricula will be piloted. Curricular material will then be revised based on pre- and postassessments, as described above, and qualitative data obtained through interviews with and observations of the librarians. Following piloting, these curricula will be disseminated to the broader biomedical librarian community, for use at institutions across the United States to facilitate biomedical data management, sharing, and reuse. Public Health Relevance: Billions of dollars of public money are spent supporting research that ends in a publication, many times with results that can never be reproduced due to poor research data management practices. Sound research data management (RDM) provides a) the means to reproduce the results of an experiment or determine why they are not reproducible, b) the potential to reanalyze a dataset or more effectively build on a given researcher's results, and c) the ability to use standards to create interoperability between data, creating the potential for larger datasets from which stronger conclusions can be drawn. To support the movement towards improving the quality and reproducibility of research data, this project will deliver web-based modules to librarians to learn about RDM, and subsequently provide librarians with focused curricula to teach RDM to researchers to improve their current RDM practices. Affect; base; Cognitive; Communities; Data; data management; Data Set; Data Storage and Retrieval; design; Educational Curriculum; Educational Materials; Educational process of instructing; Educational Technology; Effectiveness; Evaluation; evidence base; flexibility; follow up assessment; improved; Institution; Instruction; interoperability; Interview; Knowledge; Learning; Librarians; Medical; Movement; Online Systems; Persons; Process; public health relevance; Publications; Reproducibility; Research; Research Personnel; research study; Research Support; research to practice; satisfaction; Slide; sound; success; Surveys; theories; Time; tool; Training; United States

49


R25 Grants 1 R25 EB023929-01

Summer School: Big Data and Statistics for Bench Scientists Vitek, Olga ovitek@stat.purdue.edu Northeastern University Abstract: Northeastern University requests funds for a Summer School, entitled Big Data and Statistics for Bench Scientists. The target audience for the School are graduate and postgraduate life scientists, who work primarily in wet lab, and who generate large datasets. Unlike other educational efforts that emphasize genomic applications, this School targets scientists working with other experimental technologies. Mass spectrometry-based proteomics and metabolomics are our main focus, however the School is also appropriate for scientists working with other assays, e.g. nuclear magnetic resonance spectroscopy (NMR), protein arrays, etc. This large community has been traditionally under-served by educational efforts in computation and statistics. This proposal aims to fill this void. The Summer School is motivated by the feedback from smaller short courses previously co-organized or co- instructed by the PI, and will cover theoretical and practical aspects of design and analysis of large-scale experimental datasets. The Summer School will have a modular format, with 8 20-hour modules scheduled in 2 parallel tracks during 2 consecutive weeks. Each module can be taken independently. The planned modules are (1) Processing raw mass spectrometric data from proteomic experiments using Skyline, (2) Begnner’s R, (3) Processing raw mass spectrometric data from metabolomic experiments using OpenMS, (4) Intermediate R, (5) Beginner’s guide to statistical experimental design and group comparison, (6) Specialized statistical methods for detecting differentially abundant proteins and metabolites, (7) Statistical methods for discovery of biomarkers of disease, and (8) Introduction to systems biology and data integration. Each module will introduce the necessary statistical and computational methodology, and contain extensive practical hands-on sessions. Each module will be organized by instructors with extensive interdisciplinary teaching experience, and supported by several teaching assistants. We anticipate the participation of 104 scientists, each taking on average 2 modules. Funding is requested for three yearly offerings of the School, and includes funds to provide US participants with 62 travel fellowships per year, and 156 registration fee wavers per module. All the course materials, including videos of the lectures and of the practical sessions, will be publicly available free of charge. Public Health Relevance: Northeastern University proposes to organize a Summer School `Big Data and Statistics for Bench Scientists’. The Summer School will train life scientists and computational scientists in designing and analyzing large-scale experiments relying on proteomics, metabolomics, and other high-throughput biomolecular assays. The training will enhance the effectiveness and reproducibility of biomedical research, such as discovery of diagnostic biomarkers for early diagnosis of disease, or prognostic biomarkers for predicting therapy response. base; Big Data; Biological Assay; biomarker discovery; Biomedical Research; Charge; Clinical Research; Communities; comparison group; Computing Methodologies; Data; data integration; Data Set; design; diagnostic biomarker; Disease; Early Diagnosis; Educational process of instructing; Effectiveness; experience; Experimental Designs; Feedback; Fees; Fellowship; Funding; Genomics; Hour; instructor; Interdisciplinary Study; interest; Investigation; learning materials; lectures; Life; Mass Spectrum Analysis; metabolomics; Methods; NMR Spectroscopy; Participant; Persons; Process; Prognostic Marker; Protein Array; protein metabolite; Proteomics; Reproducibility; Research; research study; response; Schedule; Schools; Scientist; Series; Statistical Methods; statistics; success; Systems Biology; teaching assistant; Technology; Time; Training; Travel; Universities; Vocabulary; WorkProcess; public health relevance; Publications; Reproducibility; Research; Research Personnel; research study; Research Support; research to practice; satisfaction; Slide; sound; success; Surveys; theories; Time; tool; Training; United States

50


1 R25 EB023930-01

Curriculum in Biomedical Big Data: Skill Development and Hands-On Training Samore, Matthew H matthew.samore@hsc.utah.edu University of Utah Abstract: Advances in genomics and biomedical generate huge amounts of data at the molecular level, while wide spread implementation of electronic medical records has dramatically increased digital documentation of healthcare information The objective of this proposal is to design, develop, and implement a cross-disciplinary, short-term training program in methods for utilizing Big Data in the health and biomedical science domain. Our curriculum will be organized as three two-week summer sessions: 1) “biomedical data science boot camp”; 2) “Targeted Learning with Case-Based Interdisciplinary Collaborative Learning,” and 3) “Integration, Exploration and Analysis of Heterogeneous Biomedical Data Sets for Prediction and Discovery”. Our program will include five knowledge categories: domain knowledge, computation, data representation, data visualization, and data analysis. These five areas will be integrated within an interactive learning environment that builds on principles of case-based learning and follows a flipped classroom paradigm. Program participants will be recruited from among trainees in biomedical informatics, molecular biology, population health sciences, and clinical investigation. Our curriculum will enable collaborating scientists to better comprehend the scientific language utilized in other disciplines. This will enhance communication and facilitate development of a synergistic environment which leads to improved overall quality of research studies. Upon completion of our short-term training, participants will be equipped to bootstrap their own big data research projects, participate in teams tackling big data problems in biomedicine, and pursue additional training in biomedical data sciences. The University of Utah is an ideal place for conducting this training with its superb collaborative academic environment. The University is home to the oldest academic biomedical informatics department in the country, Intermountain Healthcare is recognized as one of the world leaders in integrating informatics into clinical care, and the Salt Lake VA, is the center of VA’s major informatics initiatives. Researchers from these three sites are being brought together to train the next generation of big data professionals equipped to propel data-driven analytics in medicine. Public Health Relevance: Big data revolution has transformed the healthcare industry and poises to be the conduit to bring bench research to bedside. Benefits of big data are innumerable from personalizing medicine to improving overall quality as well as reducing costs of healthcare. However, the numbers of skilled professionals proficient in handling big, heterogeneous, fragmented, incomplete, unstructured biomedical data, is inadequate to balance the rapidly increasing data volumes. We propose a curriculum encompassing foundations in biomedicine and computational technologies, thus catering to the needs of both clinicians and quantitative technologists to coalesce in collaborative and innovative research using big data. Area; big biomedical data; Big Data; biomedical informatics; Camping; Case Based Learning; Case Study; casebased; Categories; clinical care; clinical decision-making; clinical investigation; Clinical Sciences; Collaborations; Communication; Computerized Medical Record; Country; Data; Data Analyses; Data Reporting; Data Science; Data Set; data visualization; design; Development; digital; Discipline; Documentation; educational atmosphere; Educational Curriculum; Environment; Equilibrium; Evaluation; evidence base; Faculty; faculty mentor; Feedback; flipped classroom; formative assessment; Foundations; Genomics; Goals; hands-on learning; Health; Health Care Costs; Health Sciences; Healthcare; Healthcare Industry; Home environment; Imagery; improved; Informatics; innovation; Interview; Knowledge; Language; Learning; learning outcome; Measures; Medicine; meetings; Methods; Molecular; Molecular Biology; Monitor; next generation; Participant; personalized medicine; population health; Process; Program Effectiveness; programs; Recruitment Activity; Research; Research Personnel; Research Project Grants; research study; Science; Scientist; Site; skill acquisition; skills; Skills Development; Sodium Chloride; Students; summer program; Surveys; Technology; Testing; Training; Training Programs; Universities; Utah

51


R25 Grants 1 R25 EB022368-02

Big Data Training for Translational Omics Research Zhang, Min minzhang@purdue.edu Purdue University http://www.stat.purdue.edu/bigtap/ Abstract: The explosion of biomedical big data (e.g. imaging, clinical records, and "omic" analyzes) that captures multiple levels of complexity has the potential to dramatically accelerate the translation of knowledge from bench to bedside. However, the effective use of these data requires skills in computer science, statistics, and bioinformatics, as well as detailed knowledge of biology and medicine to aid in the interpretation of the data analysis. Unfortunately, biomedical researchers are not trained in the computational and statistical methods needed to handle high-density biomedical big data. As a result, many biomedical scientists are frustrated by their inability to: (a) analyze big data, (b) utilize the valuable public resources containing big data, and (c) effectively communicate with computer scientists, statisticians and bioinformaticians. These barriers have significantly hampered the translational application of the large body of big data that has accumulated thus far. In order to overcome these challenges, this team proposes to create a summer training course that is built upon case studies and that is specifically designed for biomedical researchers who are novices in big data analysis. The investigators identified the need for this course in a survey of administrators and researchers at Midwest and Big Ten universities. This course will raise knowledge of the potential uses of biomedical big data and will develop skills for locating, accessing, managing, visualizing, analyzing, and integrating various types of big data that are publicly available. The proposed big data training program has three goals: (1) introduce the fundamental concepts of big data in biomedical research to raise awareness of the value of this research approach, (2) provide face-to-face instruction that develops the technical competency needed for big data science, and (3) develop educational and data analysis resources using the HUBzero platform to aid our face-to-face instruction and provide post-instruction opportunities for reinforcing and expanding technical skills. The course will exploit available big data resources and tools so that biologists can productively explore big data within a short time. The educational program will target graduate students, postdoctoral trainees, physician-scientists and biomedical scientists, with strong biomedical backgrounds but who have limited advanced coursework in statistics, bioinformatics, and computer science. This course will be centered at Purdue University, a large public university with recognized strengths in statistics and computer science, with a goal to serve scientists in the Midwest area. Also, the HUBzero platform, a unique technology developed at Purdue, will be used to house computational tools and deliver the educational program, and to lower the technical barriers that challenge participants. This approach will complement the classical curricula in biomedical training programs and serve as a foundation for more advanced training. The proposed course is directly responsive to RFA-HG-14-008 because it will enable biomedical researchers to more confidently explore existing biomedical big data, implement their own data collection and analysis plans, and communicate within research teams. Public Health Relevance: Biomedical Big Data is a collection of high density information from many different sources, and it requires sophisticated statistical and computational tools to analyze and integrate. The proposed course has the potential to effectively train biomedical researchers to develop critical skills for locating, accessing, managing, visualizing, analyzing, and integrating various types of big data. This will accelerate the utilization of big data to improve human health. Address; Administrator; Archives; Area; Awareness; base; bench to bedside; Big Data; Bioconductor; Bioinformatics; Biological; Biology; Biomedical Research; biomedical scientist; Case Study; Clinical; Collaborations; Collection; Communities; Complement; computer science; computerized tools; Computers; Computing Methodologies; Data; Data Analyses; Data Collection; density; design; Education; Educational Curriculum; Educational Materials; epigenome; experience; Explosion; Foundations; Gene Expression Profile; Genome; Goals; graduate student; Health; Home environment; Housing; Human; Image; improved; Institution; Instruction; interest; Knowledge; knowledge translation; Medical; Medicine; Participant; Physicians; Population; Positioning Attribute; prevent; programs; Proteome; public health relevance; Records; repository; Research; Research Personnel; Resources; response; Schools; Science; Scientist; skills; Source; Statistical Methods; statistics; Surveys; Technology; The Cancer Genome Atlas; Time; tool; Training; Training Programs; Universities

52


BD2K Training Effort Grant Information

Enhancing Diversity (dR25) Grants

53


dR25 Grants 1 R25 MD010391-02

Innovative Research Education and Articulation in the Preparation of Under-represented and First Generation Students for Careers in Biomedical Big Data Science Canner, Judith Elena jcanner@csumb.edu California State Univ, Monterey Bay https://csumb.edu/bd2k In 1994, California State University, Monterey Bay (CSUMB), a federally classified Hispanic Serving Institution, was established as a new university in the California State University System, with the specific vision to be distinctive in serving the diverse people of California, especially the working class and historically undereducated and low-income populations (CSUMB Vision Statement, 1994). With that vision in mind, we are building a research education program at CSUMB in biomedical data science that emphasizes 1) research experience, 2) disciplinary training, and 3) professional development for both CSUMB students and faculty. In general, the goals of the Biomedical Data Science Program at CSUMB are: 1) To establish a summer research program for CSUMB students at the Center for Big Data in Translational Genomics at UC Santa Cruz. The center works to help the biomedical community use genomic information to better understand human health and disease. Visiting CSUMB students will spend the summer working side-by-side with UCSC scientists and data specialists, learning research skills to manage and interpret genomic data. 2) To develop new programs, such as an interdisciplinary statistics major, that will include math, statistics, biology, behavioral and computer sciences to prepare students for graduate school and careers in research or industry. 3) To create opportunities to extend CSUMB faculty training and research in biomedical data science in collaboration with UCSC faculty members and researchers.

Over the past year, CSUMB has made considerable progress to achieve the above goals. In the summer 2016, eight CSUMB students participated in an 8-week research program at the Center for Big Data in Translational Genomics at UC Santa Cruz. The students, whose backgrounds include mathematics/statistics, biology, and computer science, are working on a variety of projects associated with the Center. In addition, they have participated in professional development activities including training on the Genome Browser, cloud computing, writing an abstract, and others. Upon their return to CSUMB students will work to disseminate their research, continue their professional development, and apply for graduate programs related to biomedical data science. In the past year faculty from fields related to biomedical data science have received support to develop the new programs and courses as well as attend training opportunities to extend their training in bioinformatics, genomics, supervised and unsupervised statistical learning, and cloud computing/programming. Faculty also offered three new upper division courses related to biomedical data science: Bioinformatics, Practical Computing for Scientists, and Data Mining. In addition, we are now in the planning process of a Statistics Major, a Bioinformatics Concentration in the Biology Major, a Data Science Concentration in the Computer Science Major, and a Data Science Minor. The Biomedical Data Science Program at CSUMB will continue to evolve and grow over the next four years with increasing focus on recruitment of native and transfer students into the program, dissemination of curriculum and research, and research experiences for students and faculty.vvvdesigned to attract, support, and retain underrepresented groups in biomedical data science. Keywords: Machine/statistical learning, Multivariate Methods, Population Genetics, Statistical Analysis, Computational Genomics, Computer science, Genetics and Genomics, Computational Biology

54


1 R25 MD010397-02

Big Data Discovery and Diversity Through Research Education Advancement Partnerships McEligot , Archana Jaiswal amceligot@fullerton.edu California State University, Fullerton The expansion of Big Data science has presented a compelling need to train a diverse biomedical workforce that has the ability to generate Big Data, as well as utilize and apply them in various fields of study. California State University, Fullerton (CSUF), a PUI classified as Hispanic-serving institution has partnered with the University of Southern California (USC), Big Data for Discovery Science (BDDS) NIH BD2K Center of Excellence. Our NIH-funded Big Data Discovery and Diversity through Research Education Advancement and Partnerships (BD3-REAP) program aims to: (i) train and engage three cohorts of six predominantly under-represented students; (ii) train CSUF faculty on Big Data research; and (iii) develop and integrate Big Data science curricula within and between departments at CSUF. For the first cohort, the program received a total of 22 applications (n=8 males, n=14 females), in which 59% of the applicants are first generation college students, and 82% are eligible for financial aid. The ethnic background of the applicants consists 45% Hispanic/Latino, 5% African-American, 5% White, 27% Asian-American, 5% Filipino-American, 9% Middle-Eastern, and 5% Other or mixed race. Curricula development has included: a) formulating key Big Data science student learning objectives b) new course syllabus/curricula development via program faculty feedback, and approvals via department faculty and university administration for course offering in Spring 2016. Student learning objectives include general comprehension of brain health, identifying challenges, sources and subsequently analyzing and synthesizing Big Data. Additionally, CSUF faculty along with their USC collaborators have developed a thirty-minute instructional video that complements a PowerPoint slide presentation for use to introduce and integrate Big Data Science in the curriculum of various courses in the Colleges of Health and Human Development, Natural Sciences and Mathematics and potentially two others. Subsequently, the cohort of six undergraduate students will undertake independent research projects for two years, and will be enrolled in two required courses in spring 2017, and in summer 2017, the cohort will spend several weeks for further training in the field of neuro-imaging and Big Data analysis at USC. The program participants are expected to disseminate their work through conference attendance and peerreviewed publication. Overall, it is feasible to engage and promote diverse student interest in Big Data science programs, and that project aims are being accomplished according to the proposed timeline and additional progress for the above-mentioned goals is expected in the coming months. Publications: McEligot, A. J., Behseta, S., Cuajungco, M. P., Van Horn, J. D., & Toga, A. W. (2015). Wrangling Big Data Through Diversity, Research Education and Partnerships. Californian Journal of Health Promotion, 13(3) Keywords: Data mining, Differential Equation Modeling, Image Analysis, Bayesian Methods, Mathematical statistics, Population Genetics, Predictive analytics, Statistical Analysis, Visualization, Data analysis, Health Disparities, Neuroscience, Epidemiology

55


dR25 Grants

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

CAREER PATHS Epidemiology Biostatistics Professor/Academia Public Health Biology Researcher Software Developers Computer Science Information Technology Data Science Informatics PROJECT TEAM

¥ Dr. Archana McEligot- Professor of Health Science at CSUF ¥ Dr. Math Cuajungco- Professor of Biological Science at CSUF ¥ Dr. Sam Behseta- Professor of Mathematics at CSUF ¥ Dr. Arthur Toga- Professor of Medicine at USC ¥ Dr. Jack Van Horn- Professor of Neurology at USC

QUESTIONS? Contact Harman Chauhan, Project Coordinator: hchauhan@csu.fullerton.edu

Big Data Discovery and Diversity Through Research Education Advancement and Partnerships (BD3-REAP)

For Programmatic Questions, Dr. Archana J. McEligot, Program Director amceligot@fullerton.edu

Providing enriching research experiences and opportunities through exploration and understanding big data sources, diversity, analytics and neuroimaging in efforts to improve health.

BD3-REAP BIG DATA DISCOVERY AND DIVERSITY THROUGH RESEARCH EDUCATION ADVANCEMENT AND PARTNERSHIPS SUPPORTED BY A GRANT FROM THE NIHNIMHHD #1R25MD010397-01

56


1 R25 MD010396-02

Fisk University/UIUC-Mayo KnowENG BD2K Center R25 Partnership Qian, Lei lqian@fisk.edu Fisk University http://www.knoweng.org/R25-Partnership The overall goal of the Fisk- UIUC KnowEnG R25 program is to recruit and retain a cadre of underrepresented minority scientists prepared to compete for PhD training in biomedical research with already acquired confidence in the use of Big Data. The partnership with the KnowEnG BD2K Center at UIUC will permit curricular enhancements and summer research opportunities for Fisk trainees while, at the same time, reciprocally training natural scientists and mathematics majors in complementary computer and informatics sciences and providing computer science and mathematics undergraduates with essential systems, molecular and cell biology/biochemistry background at Fisk University to provide context for cutting edge genomics, proteomics, and individualized medicine research reliant on Big Data. In addition to curricular and research training program elements, Fisk students will have remote access to seminar courses to increase efficacy in communicating BD2K-based technologies and their applications. Didactic work and undergraduate research experiences will be complemented by an individualized student development plan for honing professional skills, deep understanding of the responsible conduct of research, and wraparound mentoring to assure subsequent successful entry into competitive BD2K aligned PhD-granting programs. UIUC-hosted summer workshops for faculty will increase confidence in use of Big Data tools, leading to innovations in STEM courses that embrace Big Data, impacting all Fisk STEM undergraduates. Research collaborations between Fisk and BD2K partner faculty also will be fostered. The proposed program will increase both didactic and research experiences in Big Data for Fisk University undergraduates while preparing them for successful entry into PhD-granting programs in related disciplines at research intensive universities. Our KnowEnG partnership also will increase Fisk faculty capacity in Big Data use and foster faculty research collaborations, thus introducing Big Data into course-embedded research, impacting all Fisk University STEM Majors. Reciprocally, our KnowEnG UIUC faculty partners will enrich their holistic mentoring skills of URM trainees based on interactions with Fisk R25 mentors, of value for their broader education and research training goals at UIUC and Mayo. Publications : 10.1186/s12859-016-1117-3 Keywords: Information science, Machine/statistical learning, Pattern recognition and learning, Computational Genomics, Computer science, Data analysis, Genetics and Genomics, Chemical Biology, Computational Biology

57


dR25 Grants

58


1 R25 MD010399-02

Increasing Diversity in Interdisciplinary BD2K (IDI-BD2K) Garcia-Arraras, Jose E; Ordonez-Franco, Patricia ; Pérez, Maria-Eglee jegarcia@hpcf.upr.edu University Of Puerto Rico Rio Piedras http://idi-bd2k.hpcf.upr.edu/

The Initiative for Increasing Diversity In Interdisciplinary BD2K (IDI-BD2K) is set to increase the diversity of the BD2K scientific community. It will do so by joining NIH in its effort to increase the number of underrepresented researchers, both students and faculty, in Big Data Science and its applications to biomedical research. Our IDI-BD2K project presents a series of activities that will attract students to the field of Big Data Biomedical research, provide them with the courses and training necessary to perform Big Data research and directs them to participate in “hands-on” Interdisciplinary Biomedical Big Data research experiences, and create a community of BD2K researchers at UPR Río Piedras. The IDI-BD2K program objective is to enhance student preparation and support so they continue onto graduate studies in Big Data Research and eventually enter the academic and professional community of investigators doing Big Data Biomedical research. To achieve our goals, UPR-RP has partnered with BD2K centers at Harvard University, the University of Pittsburgh and the University of California Santa Cruz to offer summer research experiences to at least 6 of our undergraduate students per year. The program will also offer opportunities for faculty to develop their research and expertise in Big Data Research through sabbaticals and workshops with the BD2K center faculty. In addition, the development and integration of new courses, workshops, seminars, and online course modules in Big Data Science into the undergraduate curricula of the Natural Sciences College, will serve to expand the research infrastructure and capabilities of the UPR-RP researchers and increase the UPR-RP’s contribution in the field of Big Data Science, which permeates all of modern science and technology. Keywords: Bayesian Methods, Machine/statistical learning, Medical informatics, Multivariate Methods, Causal Analysis, Statistical Analysis, Visualization, Computational Genomics, Computer science, Data analysis, Genetics and Genomics, Health Disparities, Microbiology and Infectious Diseases, Cancer, Computational Biology

59


BD2K Training Effort Grant Information

Scientific Training and Career Development (K01) Grants

60


1 K01 ES026834-02

Advancing Outcome Metrics in Trauma Surgery Through Utilization of Big Data Callcut, Rachael A callcutr@sfghsurg.ucsf.edu University Of California, San Francisco Abstract: My goal in seeking a K01 Award is to acquire the necessary training to become an independently funded investigator focused on exploiting the power of biomedical Big Data Science to improve outcome following severe injury. I am a trauma surgeon at San Francisco General Hospital, one of the Nation's leading trauma centers, and an Assistant Professor of Surgery at the University of California San Francisco (UCSF). UCSF has recently entered into collaboration with the National Laboratories to study the use of biomedical Big Data in complex clinical conditions and my main mentor, Dr. Mitchell J. Cohen is the lead investigator at UCSF for this collaboration. I believe that given the complexity of the factors that likely affect traum outcome including patient injury patterns, medical co-morbidities, patient biology, and the system of care, trauma provides a solid foundation to study the utility of Big Data Science for solving complex medical questions. To facilitate my growth as an expert in this field, I am proposing to develop a framework for integrating multiple data sources necessary to forecast patient outcomes following trauma. These novel datasets combined with biologic and metadata will then be utilized to create improved metrics that better predict complication risk from modifiable and non-modifiable factors. The net result of this work is a new approach to data ascertainment for measuring outcome, leveraging new data types to improve prediction of patient trajectory, and creating a platform to interface with existing information technology to ultimately be used for an early warning detection system for patients at risk of complications. The future long-term goal of this work would be to identify early patients predicted to do more poorly and then apply refinements to the process of care to minimize complication development. The creation of early warning detection systems has significant theoretic potential to improve quality and ultimately decrease costs. Nearly $30 billion per year in the US is spent on care for the traumatically injured and the development of post-traumatic complications is believed to be major contributor to the overall costs of care. The ability to report performance has been hampered by a lack of standard definitions, reporting bias, access to datasets, and the analysis techniques that fail to account for the highly confounded relationships contributing to patient outcome. This K01 award will provide me with the support necessary to accomplish the following goals: (1) to become an expert in applying biologic big data to trauma care (2) to elucidate the relationship of modifiable factors affecting complication development

61


K01 Grants

(3) to gain experience with advanced biostatistical techniques and bioinformatics; and (4) to develop an independent clinical research career. To achieve these goals, I have assembled a multidisciplinary team including Dr. Cohen, a National expert in trauma systems biology and biologic big data, and two co-mentors: Dr. Michael Matthay, a translational research expert in complications after severe illness, and Dr. Alan Hubbard, an expert in advanced biostatistical techniques including biologic big data analysis. Public Health Relevance: In the US, trauma is the leading cause of death for those under 45 years old and many of the patients who survive their initial injuries develop complications such as blood clots or pneumonia that contribute to both death and the long-term effects of the trauma. Through leveraging the power of biomedical Big Data, an integrated approach to measuring outcome will be developed utilizing biologic, clinical, and electronic medical record (EMR) data. The goal of this project is to lay the ground work for developing integrated EMR early warning detection systems that could identify those at risk of complications early with the intent to ultimately refie the process of care for this group to minimize complication development. Publication: doi: 10.1097/TA.0000000000000930; doi: 10.1097/TA.0000000000001313 Accounting; Address; Affect; Age; Algorithms; American; base; Benchmarking; Big Data; Bioinformatics; Biology; Biometry; Blood coagulation; California; care systems; career; Caring; Cause of Death; Cellular Phone; Cessation of life; Characteristics; Clinical; Clinical Research; cloud based; Collaborations; Comorbidity; Complex; Complication; Computerized Medical Record; cost; Data; Data Analyses; Data Element; Data Set; Data Sources; Demographic Aging; Detection; Development; disability; Effectiveness; experience; follow-up; Foundations; Funding; Future; General Hospitals; Geographic Locations; Goals; Growth; High Performance Computing; Hospitalization; Hospitals; Hybrids; improved; Individual; Informatics; Information Technology; injured; Injury; Knowledge; Laboratories; Lead; Long-Term Effects; Machine Learning; Medical; Mentored Research Scientist Development Award; Mentors; Metadata; Methods; Modeling; modifiable risk; Morbidity - disease rate; Mortality Vital Statistics; multidisciplinary; Myocardial Ischemia; novel; novel strategies; Operative Surgical Procedures; Outcome; Outcome Measure; Patient Care; Patients; Pattern; Performance; Physiological; Play; Pneumonia; Process; professor; public health relevance; Reporting; Research; Research Personnel; Risk; Risk Factors; San Francisco; Science; Severities; Socioeconomic Status; Solid; Source; Statistical Methods; Surgeon; System; Systems Biology; Techniques; Technology; Testing; Time; Training; Translational Research; Trauma; trauma care; trauma centers; Trauma patient; Universities; Work

62


1 K01 ES026837-02

Data-Mining Clinical Decision Support from Electronic Health Records Chen, Jonathan Hailin jonc101@stanford.edu Stanford University Public Health Motivation: National healthcare quality is compromised by undesirable variability, reflected in different locales having anywhere from 20-80% compliance with evidence-based guidelines. Much of this is due to uncertainty, with half of clinical practice guidelines lacking adequate evidence to confirm their efficacy. This is unsurprising when clinical trials cost >$15 million to answer individual clinical questions. The result is medical practice routinely driven by individual opinion and anecdotal experience. While Big Data has revolutionized how society processes internet scale information, the status quo in clinical decision making remains the manual interpretation of literature and isolated decision aids. The adoption of electronic health records (EHR) creates a new opportunity to answer a “grand challenge in clinical decision support (CDS).†In a learning health system, we could automatically adapt knowledge from the collective expertise embedded in the EHR practices of real clinicians and close the loop by disseminating that knowledge back as executable decision support. Candidate Goals and Objectives: The unifying goal of this BD2K K01 proposal is the mentored career development of Jonathan H. Chen, MD, PhD. This proposal will accelerate his transition into an independent physician scientist, towards his long-term goals to produce Big Data technologies that answer such grand challenges in clinical decision support. His near-term objective is developing methods to translate EHR data into useful knowledge in the form of patient-specific, point-of-care clinical order recommendations for acute medical hospitalizations. His doctoral background in computer science gives him the technical capability to achieve these objectives, while his medical training will ensure clinically meaningful results. His preliminary work to build an order recommender, analogous to commercial product recommenders, demonstrates the proposal’s overall feasibility. Research Aims: The overriding hypothesis of the proposal is that clinical knowledge reflected in clinical order patterns from historical EHR data can improve medical decision making when adapted into functional clinical decision support. The specific aims each address components of this concept, as they seek to: (1) Develop the algorithms to learn clinical order patterns from historical EHR data, building on a preliminary recommender system; (2) Assess how underlying clinician proficiency affects the quality of those learned clinical order patterns through observational data inference against external standards; and (3) Determine the impact of automatically

63


K01 Grants

learned clinical decision support (CDS) on (simulated) clinical workflows through a randomized controlled crossover trial of human-computer interfaces with real clinicians. Expected Results and General Significance: By the completion of the proposed work, Dr. Chen will answer the grand challenge in clinical decision support (CDS) by automating much of the CDS production process, and have direct translational impact with a prototype system. This will advance the field with new paradigms of generating and disseminating clinical knowledge, which can then improve the consistency and quality of healthcare delivery. With this applied research experience and career development, Dr. Chen can compete for R01 funding and become an independent physician scientist developing Big Data approaches to solve national healthcare problems in clinical decision making. Publications: doi: 10.1093/jamia/ocv091; doi: 10.1093/jamia/ocw136; doi: 10.1001/jamainternmed.2015.6831; doi: 10.1186/s12909-016-0665-6; doi: http://dx.doi.org/10.1142/9789814749411_0019; doi: 10.1016/j. amjmed.2016.03.023. Keywords: Data mining, Machine/statistical learning, Medical informatics, Operations research, Pattern recognition and learning, Predictive analytics, Search engines, Computer science, Social and Behavioral Sciences, Clinical Research, Epidemiology

64


1 K01 ES025437-03

Novel Methods to Identify Momentary Risk States for Stress & Physical Inactivity Coffman, Donna Lynn dcoffman@temple.edu Temple University Behavioral risk factors, such as physical inactivity, poor stress management, poor diet, and smoking, are responsible for about 80% of coronary heart disease and cerebrovascular disease (WHO, 2011), and they are partly responsible for other negative health outcomes, such as high lipids, high blood pressure, cancer, diabetes, and obesity. Thus, helping individuals to change unhealthy behaviors and maintain healthy ones can decrease morbidity and mortality from cardiovascular disease, diabetes, and cancer. It can also help individuals manage symptoms of osteoarthritis, menopause, and other physical issues that impact health and quality of life. This BD2K K01 proposal is for the purpose of establishing myself as an independent researcher in the development and application of big data methods for health-behavior change and maintenance. I aspire to improve public health by developing, extending, and applying big data methods in biobehavioral health research to help individuals develop and maintain positive health behaviors, specifically those related to physical activity, diet, and stress management. To achieve this goal, my training plan focuses on developing expertise in (a) statistical methods for the analysis of large, complex, high-dimensional data; (b) computer science and informatics, along with advanced high-performance computing topics for accessing, managing, and processing big data; and (c) the theories and measurement of health behavior change, specifically with regards to intensive assessment of physical activity and stress management. In order to examine individuals’ engagement in physical activity, stress management, and other health behaviors, I will use this funding to develop and apply big data methods that can integrate data across multiple time scales and studies, better infer causality, and account for dependencies (e.g., time-structure, dyads) in the data. I will publish manuscripts on these methods in both technical and health-behavior journals, and I will disseminate software to clinical researchers so that they can use these methods in their work. At the completion of this grant, I will be prepared to make important contributions as the data scientist on interdisciplinary teams that develop health behavior interventions. This work will have broad implications for public health, in particular for the development of adaptive, individualized, health-behavior interventions delivered in real-time, realworld contexts. PUBLIC HEALTH RELEVANCE: Healthy behaviors such as physical activity can decrease the risk of cardiovascular disease, diabetes, and other adverse health outcomes. Increasingly, researchers collect vast amounts of data related to these behaviors and outcomes, but new statistical methods are needed to take advantage of the spectacular opportunities these data present. The proposed research will develop and apply big data methods to promote health behavior change; these methods will have broad implications for public health, particularly for the development of adaptive, individualized, health-behavior interventions that will deliver a specific intervention at the specific moment when it is needed, thereby increasing efficiency and effectiveness of interventions while decreasing participant burden. Publications: doi: 10.1007/s10742-016-0157-5. Keywords: Multivariate Methods, Causal Analysis, Statistical Analysis, Data analysis, Social and Behavioral Sciences, Epidemiology

65


K01 Grants

1 K01 ES025432-01

Imaging genomics bases of pediatric executive functioning Avants, Brian avants@grasp.cis.upenn.edu University Of Pennsylvania Abstract: We propose to establish new integrative and data-driven methods for building systems neuroscience models of executive functioning during childhood. We will develop these methods by employing public datasets that contain complementary quantitative biological metrics of genomics, structural and functional neuroimaging and cognitive performance. By studying the multi-dimensional predictors of executive functioning during childhood, the Principal Investigator will gain invaluable protected time that will enhance his career as a big data scientist. The candidate will also gain experience in the new science of integrating imaging with genomics and in the statistical methods necessary to optimally construct sparse and interpretable low-dimensional models from high-dimensional data. We will determine validity by comparing models between training and independent testing datasets. Ultimately, this research will identify a fusion of genomics and imaging predictors that predict executive functioning in normal subjects and how these relationships are modified by environment. Public Health Relevance: This project will train an accomplished imaging scientist to perform big data science based on new quantitative biomarkers. Using large datasets, this study will reveal mechanisms that may relate to early signs of neuropsychiatric disorders or risk for such disorders. This research will inform prevention and intervention strategies that may improve childhood outcomes. Publications: doi: 10.1016/j.dadm.2016.03.001; doi: 10.1016/j.neuroimage.2014.10.052; doi: 10.1038/ nmeth.3797; doi: 10.1016/j.ymeth.2014.10.016; doi: 10.1016/j.neuroimage.2014.05.026. Academic Training; Acceleration; Address; Adolescence; Adolescent; Affect; Age; Algorithms; Area; Attention deficit hyperactivity disorder; base; Big Data; Biological; Biological Markers; Brain; career; Child; Childhood; Cognition; Cognitive; cohort; Confidence Intervals; Data; Data Set; Development; Disease; Environment; Environmental Risk Factor; Equilibrium; executive function; experience; functional/structural genomics; Genetic; Genomics; Genotype; Healthcare; Image; imaging modality; improved; innovation; insight; Intervention; Knowledge; Lead; Linear Algebra; Magnetic Resonance Imaging; Maps; Mathematics; Measurement; Mental Health; Mentors; Methods; Metric; Modality; Modeling; Neurocognition; neurodevelopment; neuroimaging; neuropsychiatry; Neurosciences; novel; Outcome; Performance; Philadelphia; Population; Predictive Value; Prevention strategy; Principal Investigator; Psyche structure; Psychometrics; public health relevance; Randomized; relating to nervous system; Relative (related person); Research; Risk; Sampling; Science; Scientist; Single Nucleotide Polymorphism; Site; Socioeconomic Status; Software Validation; Statistical Methods; success; System; Testing; Thinking, function; Time; Training; trait; Validation; white matter; Work

66


1 K01 ES026835-02

New Tools for the interpretation of Pathogen Genomic Data with a focus on Mycobacterium tuberculosis Farhat, Maha mrfarhat@partners.org Massachusetts General Hospital Infectious diseases continue to be a major cause of morbidity and mortality. Despite the availability of effective antimicrobials, pathogens are successfully evolving new disease phenotypes that allow them to resist killing by these drugs or in other instances cause more severe disease manifestations or wider chains of transmission. Drug resistance (DR) is now common and some bacteria have even become resistant to multiple types or classes of antibiotics. A key strategy in the fight against emerging pathogen phenotypes in infectious diseases is surveillance, and early personalized therapy to prevent transmission and propagation of these strains. The timely initiation of antibiotic therapy to which the pathogen is sensitive has been shown to be the key factor influencing treatment outcome for a diverse array of infections. Molecular tests that rely on the detection of microbial genetic mutations are particularly promising for surveillance and diagnosis of these pathogen phenotypes but rely on a comprehensive understanding of how mutations associate with these pathogen phenotypes. Currently there is an explosion of data on pathogen whole genome sequences (WGS) that is increasingly generated from clinical laboratories. Data on disease phenotype may also be available, but methods for the analysis and interpretation of these Big Data are lagging. Here I propose tools to aid in this analysis leveraging Big Data sets from Mycobacterium tuberculosis (MTB) and my prior work. Specifically I propose to (1) develop a web-based public interface to several analysis tools, including a statistical learning model that can predict the MTB DR phenotype from its genomic sequence, (2) to develop and study an MTB gene-gene network, based on WGS data, to improve our understanding of the effect of mutation-mutation interactions on the DR phenotype, and (3) study the performance of methods in current use for the association of genotype and phenotype in pathogens, and develop a generalizable power calculator for the best performing method. Publications: doi: 10.1164/rccm.201510-2091OC, doi: 10.1186/s13073-014-0101-7, doi: 10.1038/ng.2747; doi: 10.1128/JCM.02775-15. Keywords: Machine/statistical learning, Population Genetics, Computational Genomics, Data analysis, Genetics and Genomics, Microbiology and Infectious Diseases, Epidemiology

67


K01 Grants

68


1 K01 ES025434-03

An Integrative Bioinformatics Approach to Study Single Cancer Cell Heterogeneity Garmire, Lana X lgarmire@cc.hawaii.edu University Of Hawaii At Manoa Abstract: My long term career goal is to become a leading expert in translational bioinformatics who creates, develops and applies computational and statistical methods to reveal landscapes of cancers and to identify strategies to cure cancers. Human cancers are highly heterogeneous. Such heterogeneity is the major source of the ultimate failure of most cancer agents. However, due to the limit of technologies, the intercellular heterogeneity has not been investigated genome wide, at single-cell level until recently. New technologies such as singlecell transcriptome sequencing (RNA-Seq) and exome have revealed new insights and more profound complexity than was previously thought. However, so far these technologies are limited to one assay per cell. It remains a grand challenge to perform multiple, integrative assays from the same single tumor cell, in particular, from those derived from small tumor biopsies. Given the stochasticity at the single cell resolution, reproducibility and sensitivity ar daunting tasks. To overcome this challenge, I have started the single cancer cell sequencing analysis project, in collaboration with Dr. Sherman Weissman at Yale University, who is also my comentor of this K01 proposal. My immediate career goal is to identify genome-wide heterogeneity among single cancer cells, using the erythroleukemia K562 cell line. Towards this, I am proposing a research project on an integrative bioinformatics approach to analyze multiple types of genomics data generated from the same single leukemia cells, a timely and critical topic. Specifically, I am interested in studying the following specific aims: (1) buildinga bioinformatics pipeline to study heterogeneity of single-cell RNA-Seq, (2) building a bioinformatics pipeline to study CpG methylome of single cells, (3) building a bioinformatics pipeline to study singlecell Exome-Seq, and (4) integrate the RNA-Seq, methylome and Exome-Seq data generated from the same single cells. These single cells genomic data are provided by Dr. Sherman Weissman's lab from 30 single K562 erythroleukemia cells. I will first construct and validate in parallel, the RNA-Seq, methylome, and Exome-Seq bioinformatics pipelines optimized for single-cell analysis, and then develop and validate an integrative platform to analyze these multiple types of high-throughput data. To accomplish the research project, and to successfully transit from a junior faculty to an expert of the field, I have developed a career plan with my mentoring committee composed of four world-class experts in different fields relevant to Big Data Science: Primary Mentor Dr. Jason Moore in Bioinformatics from Dartmouth

69


K01 Grants

College, Co-mentor Dr. Sherman Weissman in Single-cell Genomics and Genetics from Yale University, Co-mentor Dr. Herbert Yu in Cancer Epidemiology from University of Hawaii Cancer Center and Co-mentor Dr. Jason Leigh in Big Data Visualization from the Information and Computer Science Department of University of Hawaii Manoa. I will primarily work with my four co-mentors for planning the development of my career during this award. Public Health Relevance: The goal of this K01 proposal is to integrate multiple types of high-throughput data, in particular, the transcriptome, exome-sequencing and CpG methylome data generated from single cancer cells. The proposed project is designed to address the urgent need for an integrative bioinformatics platform for mega-data generated from next-generation sequencing applications. It is also aimed to study the fundamental sources of tumor heterogeneity. Publications: doi: http://dx.doi.org/10.1142/9789813207813_0059; doi: 10.3389/fgene.2016.00163; doi: 10.1186/1471-2105-16-S5-S10; doi: 10.1038/srep37446; doi: 10.1158/1055-9965.EPI-16-0260; doi: 10.1186/s13073-016-0289-9; doi: 10.1016/j.ebiom.2016.03.023; doi: 10.1186/s13148-0150052-x; doi: 10.1186/s13040-015-0075-z; doi: 10.1038/nature14452; doi: 10.18632/oncotarget.5409; doi: 10.1186/s13059-014-0500-5. Acute Erythroblastic Leukemia; Address; Alternative Splicing; Award; base; Big Data; Bioinformatics; Biological; Biological Assay; Biopsy; cancer cell; Cancer Center; cancer epidemiology; career; Cell Line; Cells; Clinical; Collaborations; college; computer science; Computer software; computerized data processing; Computing Methodologies; Copy Number Polymorphism; Data; Data Analyses; data integration; Data Set; design; Development; Development Plans; Diagnosis; Event; Evolution; Excision; exome; exome sequencing; Faculty; Failure (biologic function); Galaxy; Gene Expression Profile; Generic Drugs; Genes; Genetic; Genetic Polymorphism; Genome; genome-wide; Genomics; Goals; Hawaii; Heterogeneity; Human; Imagery; improved; Individual; Information Sciences; innovation; insertion/deletion mutation; insight; interest; K-562; K562 Cells; Laboratories; leukemia; Machine Learning; Malignant Neoplasms; Mentors; Messenger RNA; Methodology; Methods; Methylation; methylome; Microfluidic Microchips; Modeling; Molecular; neoplastic cell; new technology; next generation sequencing; Noise; novel; Nucleotides; open source; outcome forecast; Pattern; Play; Population; Process; public health relevance; Quality Control; Reproducibility; Research Personnel; Research Project Grants; research study; Resolution; restriction enzyme; Role; Sample Size; Science; Sequence Analysis; single cell analysis; Source; Statistical Methods; success; Technology; Testing; Therapeutic; Tissues; tool; transcriptome sequencing; tumor; Universities; user-friendly; Variant; Work

70


1 K01 ES026839-02

Epileptic biomarkers and big data: identifying brain regions to resect in patients with refractory epilepsy Gliske, Stephen V sgliske@umich.edu University Of Michigan Abstract: A new electrical biomarker has been identified in high resolution, intracranial electroencephalogram (iEEG) recordings, called a high frequency oscillation (HFO). Studies have suggested this biomarker has great promise to identify seizure networks and improve surgical outcomes for patients with refractory epilepsy. However, translation of HFOs to clinical practice is hampered by many factors such as spatial, temporal and inter-patient variation in HFO detection rates, false positive and false negative detections, and significant background noise. Big data approaches using large numbers of HFOs acquired from many patients are needed to quantify these effects and allow clinical usage of HFOs. This project details a plan in which the candidate's experience quantifying measurement and detection bias in massive high energy nuclear physics datasets will be combined with a multidisciplinary mentor team to address this problem. The combination of training in computational neuroscience, big data network analysis, and translational neural engineering research will be critical to approach this problem and provide a career trajectory for the candidate. The specific aims of this proposal address three specific confounding factors: 1) the false negative HFO detection rate, 2) variations in HFO features not due to epilepsy, and 3) effects of the state of vigilance on HFOs. Each of these aims involve novel big data methods and/or applications generalizable to other situations: 1) estimating false positive detection rates using a combined experimental/simulated data approach, 2) clustering and classification of distributions of data points, rather than of the data points directly, and 3) a general disambiguation statistic to assess meaningful (rather than statistical) difference between distributions. The applicant's career goal is to become an academic researcher in the analysis and modeling of intracranial EEG data with a focus on translational epilepsy and sleep physiology research. With the rapid advancement in the resolution of clinical EEG, there is already a strong need for this type of research expertise. Thi grant will provide didactic coursework, formal research and methods training, and career guidance from an expert mentor team. The three mentors have appointments spanning Neurology, Anesthesiology, Mathematics, Statistics, Biomedical Engineering, and Electrical Engineering and Computer Science. The candidate will also build and mentor a research team and establish external collaborations. The University of Michigan is a premier research university with strong programs and training opportunities

71


K01 Grants

in biomedical and physical sciences, engineering, translational and academic research, and advanced research computing. This proposal makes extensive use of the University's large computer cluster. The mentor team and an external collaborator will provide candidate access to prerecorded, deidentified data from over 150 patients, estimated to have over 40 million HFOs. The environment and mentor team will provide the training, facilities, and data for the candidate to successfully complete the proposed goals. Public Health Relevance: Recent advances in epilepsy research are generating very large datasets in the search for better ways to identify the region of brain responsible for generating seizures. A particular signa known as High Frequency Oscillations found in high resolution intracranial EEG shows great promise for identifying these regions, but clinicians are unable to use the signal due to confounding biases. This project combines big data processing expertise from particle physics, computer science and machine learning to address these confounds, providing a more accurate process to enable clinical translation of this new biomarker and potentially improve clinical outcomes. Publications: doi: 10.1142/S012906571650049; doi: 10.1016/j.clinph.2016.06.029; doi: 10.1109/ ICASSP.2016.7472887. Accounting; Address; Algorithms; Anesthesiology; Appointment; Area; base; Big Data; Biological Markers; Biomedical Engineering; Brain; Brain region; career; career development; Chronic; Classification; Clinical; clinical practice; clinically relevant; Collaborations; computational neuroscience; computer cluster; computer science; computerized data processing; Custom; Data; Data Set; Detection; detector; Development; Electrical Engineering; Electroencephalogram; Engineering; Environment; Epilepsy; Event; expectation; experience; Funding; Future; Goals; Grant; Graph; High Frequency Oscillation; Hybrids; improved; Individual; innovation; Literature; Machine Learning; Manpower and Training; Mathematics; Measurement; Mentors; Mentorship; Methods; Michigan; Modeling; Motivation; multidisciplinary; nervous system disorder; Neurology; Neurosciences; Noise; novel; Nuclear Physics; Operative Surgical Procedures; Outcome; particle physics; Pathway Analysis; patient population; Patients; Pharmaceutical Preparations; physical science; Physics; Physiological; Physiology; Population; Process; public health relevance; Refractory; relating to nervous system; Research; Research Methodology; Research Personnel; Resected; Resolution; Sampling; scientific computing; Seizures; signal processing; Signal Transduction; Sleep; sleep epilepsy; Solutions; spatial temporal variation; standard of care; Statistical Methods; statistics; Techniques; Technology; tool; Training; Training Programs; Translating; Translational Research; Translations; Universities; Variant; vigilance; Work

72


1 K01 ES026838-02

Molecular Analysis and Precision Medicine in Renal Cell Carcinoma Johnson, Michael Hiroshi michael.h.johnson@jhmi.edu Johns Hopkins University Abstract: My proposed Mentored Career Development Award in Biomedical Big Data Science (K01) will focus an important but largely unmet need within the field of oncology. Renal cell carcinoma is the 8th most common cancer and the most lethal of the urologic cancers. Systemic medical therapy is required for the 25% of patients who initially present with metastatic disease and 30% of patients who recur following surgery. The selection of which medical treatment to use is not based on biomolecular characteristics of the tumor, despite large variability in the efficacy of th treatments. Only recently, with improvements in sequencing technologies, have these molecular analyses been possible. The research goals proposed within this application combine bioinformatics, biostatistics, and molecular biology to establish an informatics toolkit for investigating the molecular markers of renal cell carcinoma as they relate to treatment success. We hypothesize that building a toolkit into an existing, well-annotated clinical database and tissue repository will allow us to (1) create the first multi-institutional registry of renal cell carcinoma patients with clinical, genomic, and outcomes data, (2) identify molecular predictors of treatment success for existing cancer therapies, and (3) investigate personalized subtypes of patients that correlate with treatment response to immunotherapy. An understanding of molecular predictors of treatment success can have a major impact on decision-making within renal cell carcinoma and establish hypotheses for future clinical trials. Over the course of this Mentored Career Development Award in Biomedical Big Data Science, my goal is to acquire the expertise from my mentors that is required to succeed as an independent investigator, advancing Precision Medicine in renal cell carcinoma through the use of biomedical data science. The 5 years allotted for this project will provide ample time to truly establish the skill necessary for independent research. At least 50% of my time will be devoted solely to research, which will be further supplemented by clinical work involving care for patients with renal cell carcinoma. I have a multi-institutional and multi-disciplinary panel of experts who will guide me through this research project and use the resultant tools. The education that I will receive will b essential in cultivating my ability to use biomedical "big data" towards improving care in

73


K01 Grants

renal cell carcinoma patients. My long-term career goals include pursing additional NIH funding as an independent investigator and lead a multi-disciplinary team in advancing data-driven medicine. In doing so, I hope to inspire future scientists to pursue challenges in biomedical data science and guide them in the same tradition of mentorship that is being given to me. Public Health Relevance: There currently exists a critical, unmet need to improve methods of selecting effective therapy for advanced kidney cancer, the most lethal of urologic cancers. The proposed mentored career development award will focus on identifying molecular markers of treatment success by developing an informatics toolkit that incorporates molecular data into a large, well-annotated institutional database of patients with advanced kidney cancer. The scientific findings of this grant will be used to identify the most effective treatment based on individual tumor biology and accelerate discovery in the molecular biology of kidney cancer. Address; Adoption; base; Big Data; Binding (Molecular Function); Bioinformatics; Biological Assay; Biology; Biometry; cancer care; cancer therapy; career; Caring; Characteristics; Clinical; Clinical Data; Clinical Sciences; Clinical Trials; clinically relevant; Collaborations; companion diagnostics; Computer software; Computers; Data; data exchange; data management; database query; Databases; Decision Making; design; Development; Disease; Education; effective therapy; exome; Exons; flexibility; Funding; Future; Gene Expression Profile; genome-wide; Genomics; Goals; Grant; Healthcare; Human; Imagery; Immune response; immunogenic; Immunotherapy; improved; Individual; Informatics; inhibitor/antagonist; K-Series Research Career Programs; Knowledge; Lead; Link; Malignant neoplasm of prostate; Malignant Neoplasms; Medical; Medicine; Mentors; Mentorship; Methods; methylome; Molecular; Molecular Analysis; Molecular Biology; molecular marker; mTOR Inhibitor; Oncologist; oncology; Online Systems; Operative Surgical Procedures; Outcome; Palate; Patient Care; Patients; Peptides; Pharmaceutical Preparations; precision medicine; protein aminoacid sequence; public health relevance; Registries; Renal carcinoma; Renal Cell Carcinoma; repository; Research; Research Infrastructure; Research Personnel; Research Project Grants; response; Risk; RNA; Sampling; Science; Scientist; skills; Specimen; success; Technology; The Cancer Genome Atlas; Time; Tissues; tool; tool development; Translating; Translational Research; Treatment outcome; treatment response; tumor; Tumor Biology; Tumor Pathology; Tumor Specific Peptide; Tyrosine Kinase Inhibitor; United States National Institutes of Health; Urologic Cancer; Urologic Oncology; Urologist; Work

74


1 K01 ES025431-04

The role of epigenetic heterogeneity in CLL evolution Landau, Dan dlandau@nygenome.org Weill Medical College of Cornell University Abstract: Chronic lymphocytic leukemia (CLL) is currently incurable. Despite effective treatments, the disease invariably recurs due, in part, to its ability to evolve. We have shown that pretreatment intra-leukemic genetic heterogeneity foreshadows clonal evolution leading to disease relapse. Nevertheless, the cellular phenotype and its fitness for selection result from both genetic and epigenetic alterations. Therefore, a major challenge in the study of cancer evolution is to integrate genetic and epigenetic heterogeneity. In preliminary studies, we found increased intra-sample epigenetic heterogeneity in CLL. To understand the basis of this heterogeneity, we studied the uniformity of the methylation status of neighboring CpGs contained within individual reads from massively parallel bisulfite sequencing of ~100 primary CLL samples. We demonstrated that most of the heterogeneity stems from disordered methylation, a form of stochastic epigenetic drift. Disordered methylation affected regions important to transcriptional regulation and was associated with a decoupling of the relationship between promoter methylation and transcriptional silencing. Finally, disordered methylation was subjected to selection and may facilitate clonal evolution. I hypothesize that disordered methylation impacts histone modification and transcription, thereby enhancing CLL evolution. To define the impact of disordered methylation, we propose the following independent yet interrelated Specific Aims: (1) To examine its relationship to histone modification and transcription, we will produce comprehensive histone ChIP-seq mapping and ChIP-bisulfite-seq directed at repressive histone marks. We will integrate the multidimensional data to infer epigenetic intra-sample heterogeneity and validate this with single-cell RNAseq to assess cell-to-cell variability as a function of methylation disorder. (2) We will develop a statistical inferece tool to detect putative methylation "driver" events in cancer taking into account background stochastic variation. (3) To study the impact of disordered methylation on clonal evolution and clinical outcome, we will integrate genetic and epigenetic heterogeneity analysis in pretreatment samples from 350 patients who received uniform treatment, and 80 relapse samples. There are no therapeutic strategies currently available to curb cancer evolution. Thus, these studies address an unmet therapeutic need. Finally,

75


K01 Grants

in this application, I have outlined a 5-year career development plan to meet my goal of becoming an independent investigator in translational cancer biology, proficient in big data science methodology. I have assembled a Mentorship Committee of internationally recognized experts to provide scientific and career mentorship. I will pursue intensive didactic coursework and hands-on training with leading experts, to develop a strong computational and statistical foundation. Finally, Dana-Farber Cancer Institute is the ideal environment for attaining my scientific and career goals, given its outstanding research community, emphasis on big data science, and an excellent track record of training independent physician-scientists. Public Health Relevance: Chronic lymphocytic leukemia frequently undergoes evolution in response to therapy, resulting in a more aggressive and treatmentresistant form of the disease. I propose a new mechanism to explain this evolution, which contributes to greater diversity of subpopulations within the cancer. In this proposal, I will investigate how this mechanism, termed "disordered epigenetic patterning", facilitates leukemia evolution, thereby paving the way for the development of therapeutic approaches to curb the adaptive potential of cancer. Publications: doi: 10.1038/nature15395; doi: 10.1016/j.ccell.2014.10.012. Accounting; Acute; Address; Adopted; Affect; B-Lymphocytes; base; Big Data; bisulfite; Cancer Biology; cancer cell; Cancer Detection; career; career development; Cells; ChIP-seq; Chromatin; Chronic Lymphocytic Leukemia; Clinical; Clonal Evolution; Collaborations; Communities; DanaFarber Cancer Institute; Data; Development Plans; Disease; Disease remission; Disease Resistance; DNA Methylation; effective therapy; Enrollment; Environment; Epigenetic Process; Event; Evolution; Failure (biologic function); fitness; Foundations; Gene Expression Profile; Gene Expression Regulation; Gene Silencing; Genes; Genetic; Genetic Heterogeneity; Genetic Transcription; German population; Goals; Heterogeneity; high risk; Histone Code; histone modification; Histones; Individual; insight; interest; Lead; Lesion; leukemia; Link; Malignant Neoplasms; Maps; Measures; meetings; Mentorship; Methodology; Methods; Methylation; Multivariate Analysis; Normal Cell; Outcome; Patients; Pattern; Phenotype; Physicians; Plastics; Promotor (Genetics); public health relevance; Reading; Recurrent disease; Relapse; Research; Research Personnel; Resistance; response; Role; Sampling; Science; Scientist; stem; Stem cells; Therapeutic; therapeutic development; tool; Training; Transcriptional Regulation; transcriptome sequencing; tumor; Variant; Vision

76


1 K01 ES025445-02

Deep Learning and Streaming Analytics for Prediction of Adverse Events in the ICU Nemati, Shamim shamim.nemati@gmail.com Emory University Abstract: Critical care medicine in the United States costs over 80 billion dollars annually. Over the past decade the rate of intensive care unit (ICU) use has been increasing, with a recent study reporting almost one in three Medicare beneficiaries experiencing an ICU visit during the last month of their lives. Every year, sepsis, a medical condition characterized by whole-body inflammation, strikes between 800,000 and 3.1 million Americans, killing approximately one in four patients affected. There is currently no definite treatment for sepsis in spite of many clinicl trials. However, early detection of sepsis and timely initiation of interventions are widely considered as important determinants of patient survival. However, basic care tasks (such as microbiological sampling and antibiotic delivery within 1 h, fluid resuscitation, and risk stratification using serum lactate or alternative), which are known to benefit most patients, are not performed in a timely manner. Previous literature suggests that high-resolution vital signs (such as heart rate, blood pressure, respiratory rate, etc.), and other sequential measurements within the electronic medical records (EMRs), can be dynamically integrated using Machine Learning techniques to help with early detection of sepsis. With the ubiquity of inexpensive high- capacity storage and high-bandwidth streaming technology it is now possible to monitor patients' vital signs continuously (for instance, the research application developed by the Emory hospital ICU uses IBM's streaming analytics platform to transmit over 100,000 real-time data points per 100 beds, per second). Despite this continuous feed of data, commonly used acuity scores, such as APACHE and SAPS, are based on snapshot values of these vital signs (typically the worst values during a 24 hours period). This limitation is partially due to unavailability of computationally efficient and robust algorithms capable of finding predictive features in multivariate, nonlinear and nonstationary sequential data, which may reveal inter- organ communication and disintegration of causal couplings with critical illnesses such as sepsis. We have recently developed a novel Machine Learning algorithm to discover automatically a collection of predictive multivariate dynamical patterns in a database of patient time-series, which can be used to classify patients or to monitor progression of disease in a given patient. The primary goal of this proposal is to apply our method to assess the predictive power of high- resolution multivariate time-series of vital signs and sequentially recorded EMR data in the ICU for early detection of sepsis and risk stratification of septic patient. To accomplish this, we aim to benchmark our technique on a large

77


K01 Grants

ICU cohort (the MIMIC II database with over 60,000 patients), as well as simulated data from a multiscale mathematical model of influence of inflammatory mediators on dynamical patterns of vital signs. Next, the technique will be externally validated on two separate ICU sepsis cohorts (the Emory Sepsis dataset and the Mayo Clinic Metric dataset). Finally, we will provide a real-time implementation of the proposed algorithm in an streaming environment (such as the IBM streaming analytics), in order to address the Big Data challenge of harnessing real-time, streaming sensor data from bedside monitors within the ICU, while enabling advanced pattern recognition and real-time forecasting. Ultimately we believe these methods can change the current standard of care through faster recognition and initiation of basic care, as well as guiding interventional strategis based on severity of illness and mechanisms underlying physiological deterioration. Public Health Relevance: The proposed project is making use of computers to analyze data from sickest patients in the intensive care unit (ICU). We want to develop methods to identify patterns in the patient data which predict who is at risk for mortality and who might respond to various medications which could make them better. We have a very strong team of doctors and researchers who work closely together, covering all aspects of the proposed research, which we hope will help us improve the lives of the sickest patients in the ICU. Address; Adverse event; Affect; Algorithms; American; Antibiotics; Antihypertensive Agents; base; Beds; Benchmarking; beneficiary; Big Data; Blood Pressure; Caring; Clinic; Clinical; Clinical Data; cohort; Collection; Communication; Computerized Medical Record; Computers; cost; Critical Care; Critical Illness; Data; data acquisition; Data Analyses; Data Set; Databases; Decision Trees; Deterioration; Differential Equation; Disease Progression; Early Diagnosis; Employee Strikes; Ensure; Environment; experience; feeding; Goals; Heart Rate; Hospital Mortality; Hospitalization; Hospitals; Hour; improved; Inflammation; Inflammation Mediators; Inflammatory Response; Intensive Care Units; Intervention; Killings; Laboratories; Learning; Link; Liquid substance; Literature; Machine Learning; mathematical model; Measurement; Medical; Medicare; Medicine; Methods; Modeling; Monitor; Morphologic artifacts; Mortality Vital Statistics; multi-scale modeling; Nature; novel; Organ; outcome forecast; parallel computer; Pathologic; Patient Monitoring; patient safety; Patients; Pattern; Pattern Recognition; Performance; Pharmaceutical Preparations; Physiological; portability; public health relevance; Reporting; Reproducibility; Research; Research Personnel; Resolution; respiratory; Resuscitation; Risk; Risk Assessment; safety practice; Sampling; SCAP2 gene; sensor; Sepsis; septic; Series; Serum; Severity of illness; simulation; standard of care; Stratification; Stream; System; Techniques; Technology; Telemedicine; Testing; Therapeutic Intervention; Time; time use; United States; Validation; Visit; Work

78


1 K01 ES025438-04

A Framework for Integrating Multiple Data Sources for Modeling and Forecasting of Infectious Diseases Nsoesie, Elaine O en22@uw.edu University of Washington Abstract: I am trained as a computational biologist and statistician, and I am currently a postdoctoral fellow at Boston Children's Hospital, Harvard Medical School. My main career goal is to become an independent researcher at a major research institution. I plan to continue my current research pursuits in global health and infectious diseases. Specifically, I aim to continue developing mathematical and computational approaches for modeling to understand disease transmission, forecasting future dynamics and evaluating interventions for public policy decisions. As a postdoctoral research fellow, I have had the wonderful opportunity of working with data from multiple sources. Although several of these data streams could be labeled as "Big Data", I typically work with the data after it is already processed, filtered and aggregated to a daily or weekly resolution. While I have developed the necessary skills for modeling these already processed data, there are three important areas where I require additional training, mentoring, and experience: (1) advanced computational skills especially in the use of high performance computing and informatics tools, (2) techniques in computational machine learning and data mining necessary for data acquisition and processing, and (3) biostatistical methodology needed for the statistical design of studies involving big data. These three training and mentoring aims would enable me to develop the skills necessary to become an independent investigator in Big Data Science for biomedical research. Boston Children's School and Harvard Medical School are leading institutions in translational biomedical research, thereby making them the ideal environment to pursue the training and research aims in this proposal. The recent emergence of infectious diseases such as the avian influenza H7N9 in China, and re-emergence of diseases such as polio in Syria underscores the importance of strengthening immunization and emergency response programs for the prevention and control of infectious diseases. Researchers have developed computational and mathematical models to capture determinants of infectious disease dynamics and identify factors that support prediction of these dynamics, provide estimates of disease risk, and evaluate various intervention scenarios. While these studies have been extremely useful for the understanding of infectious disease transmission and control, most have been disease specific and solely used data from traditional disease surveillance systems. In contrast, there is a huge amount of internet-based data that have been extensively assessed and validated for public health surveillance in the last decade, but it has been scarcely used in conjunction with other data sources for modeling to predict disease spread. Using these novel digital eventbased data sources in combination with climate and case data from traditional disease surveillance systems, we will establish a much needed framework for integrating these disparate data sources for modeling to estimate disease risk and forecasting temporal

79


K01 Grants

dynamics of infectious diseases. Our approach will be achieved through three aims. The first objective is to develop an automated process for acquiring, processing and filtering data for modeling (Aim 1). Once we gather this data, we will develop temporal models for the dynamical assessment of the relationship between the various data variables and infectious disease incidence (Aim 2). Finally, we will assess the utility of the modeling approaches developed under Aim 2 for forecasting temporal trends of infectious diseases (Aim 3). Through data acquisition, thorough processing, statistical and epidemiological modeling, and guided by advisers with expertise in biomedical informatics, computer science and statistics, we plan to achieve a comprehensive approach to integrating multiple data streams for modeling to forecast infectious diseases. Public Health Relevance: Although there have been significant medical and technological advances towards infectious disease prevention, surveillance and control, infectious diseases still account for an estimated 15 million deaths each year worldwide. Reliable forecasts of infectious disease dynamics can influence decisions regarding prioritization of limited resources during outbreaks, optimization of disease interventions and implementation of rigorous surveillance processes for quicker case identification and control of emerging disease outbreaks. Our goal is therefore to develop a data mining/ informatics framework that leverages the huge amount of digital event-based data sources in combination with climate data, and data from traditional disease surveillance systems for modeling and forecasting infectious diseases. Publications: doi: 10.1038/srep40841; doi: 10.1016/S1473-3099(16)30513-8; doi: 10.3201/eid2301.161274; doi: 10.2807/1560-7917; doi: 10.1016/j.chom.2015.02.004; doi: 10.1371/journal.pntd.0003977; doi: 10.1038/ srep09112. Accounting; Address; Area; Avian Influenza; base; Big Data; Biological Models; biomedical informatics; Biomedical Research; Boston; career; Centers for Disease Control and Prevention (U.S.); Cessation of life; Child; China; Climate; Communicable Diseases; computer science; Computer Simulation; computerized data processing; Coronavirus (genus); Data; data acquisition; data integration; data mining; data modeling; Data Set; Data Sources; Databases; Dengue; Detection; digital; Disease; Disease model; Disease Outbreaks; disease transmission; disorder control; disorder prevention; disorder risk; Emergency Situation; Emerging Communicable Diseases; Environment; Epidemic; epidemiological model; Epidemiology; Event; experience; Future; global health; Goals; Health; High Performance Computing; Human; Humidity; Immunization; improved; Incidence; Individual; infectious disease model; Influenza; Influenza A Virus, H1N1 Subtype; Influenza A Virus, H7N9 Subtype; Informatics; Institution; International; Internet; Intervention; Label; Linear Models; Machine Learning; mathematical model; Medical; medical schools; Mentors; Methodology; Middle East; Modeling; Monitor; news; novel; Outcome; pandemic influenza; Pattern; Pediatric Hospitals; Poliomyelitis; Population Surveillance; Postdoctoral Fellow; Prevention program; Process; public health medicine (field); public health relevance; Public Policy; Report (account); Reporting; Research; Research Design; Research Personnel; Research Proposals; Research Training; Resolution; Resources; respiratory; response; Review Literature; Schools; Science; Series; skills; social; Source; Statistical Methods; Statistical Models; statistics; Stream; Syndrome; Syria; System; Techniques; Temperature; Time; tool; Training; trend; web based interface; Weight; Work; World Health Organization

80


1 K01 ES026842-02

Connecting Single Cell Heterogeneity to Clinical Descriptors of Clonal Evolution in Acute Myeloid Leukemia Paguirigan, Amy apaguiri@fredhutch.org Fred Hutchinson Cancer Research Center Abstract: Clonal evolution in cancer - the selection for and emergence of increasingly malignant clones during progression and therapy, resulting in cancer metastasis and relapse - has been highlighted as an important phenomenon in the biology of leukemia and other cancers. While the role of clonal evolution during leukemia development and therapy has been a focus for a number of avenues of research, our abilities to deduce the clonal composition of individual samples have been limited by the use of bulk leukemia samples and mainly mutation data (limiting our analyses to only a fraction of all leukemias). In part, because of this caveat, deconvolution of the clonal structure from bulk sequencing data requires a model of the cancer, specifically regarding the heterozygosity of mutations in single cells, the order of acquisition of mutations, and the requirement for unique mutational events. Our work at the single cell level in acute myeloid leukemia (AML) suggests an underlying tumor heterogeneity beyond what is currently understood in the disease and precludes us from using bulk techniques to assess diversity accurately. The ability to describe and track clonal diversity over the course of therapy would allow us determine what impact diversity has on outcomes and would be a useful tool in designing therapies informed by the possibility of cancer cell evolution. This proposal seeks to expand our preliminary work in single cell genetics to create a defined approach for generating single cell and population sampling data that can refine our model of tumor heterogeneity, allowing for accurate reconstruction of clonal diversity for all types of AML. The work will address a number of challenges in the detection of clonal diversity and the possibility of evolution impacting therapy, as well as provide additional experience in bioinformatics, evolutionary biology, highperformance computing, statistics and clinical research. An excellent mentoring and collaborative team has assembled at Fred Hutch to support both the training and scientific efforts proposed including Dr. Jerald Radich (clinical research), Dr. Justin Guinney (bioinformatics), Dr. Brent Wood (clinical flow cytometry), and Dr. Ted Gooley (clinical statistics. With this team, along with the extensive resources available at Fred Hutch such as clinical access/repositories, genomics facilities, as well as high performance computing infrastructure, we are uniquely poised to

81


K01 Grants

move the lessons learned at the single cell level from select research settings to the clinic. The resulting approaches for describing clonal diversity and its role in therapeutic responses will be informative not only in the setting of leukemia, but for other heterogeneous cancer types as well. With a better understanding of how clonal evolution can be described and monitored, the potential to design therapeutic strategies meant to manage it provides a valuable opportunity to refine cancer treatments. Public Health Relevance: Emerging evidence shows that tumors often consist of multiple cancer cell clones and suggests that variability in outcomes for patients may be due in part to intra-tumor heterogeneity and the resulting clonal evolution during therapy. This proposal seeks to develop single cell and population sampling methods that serve to define patient-specific models of tumor heterogeneity beyond bulk Big Data analyses. An approach to describe cancer heterogeneity that is feasible to use clinically over the course of therapy would allow for a better understandin of its impact on outcomes and be valuable for developing effective and long-lasting therapeutic strategies, regardless of cancer type. Acute Myelocytic Leukemia; Address; base; Big Data; Bioinformatics; Biological; Biological Assay; Biology; Blast Cell; cancer cell; Cancer Model; Cancer Relapse; cancer therapy; cancer type; Cell Count; Cell Differentiation process; Cell physiology; Cell Proliferation; Cells; Characteristics; Clinic; Clinical; clinical material; Clinical Research; Clonal Evolution; Clonality; Complex; computerized tools; Cytogenetics; Data; Data Analyses; Data Set; density; Descriptor; design; Detection; Diagnosis; Disease; DNA; DNA Methylation; DNA Sequence; Event; Evolution; experience; fitness; Flow Cytometry; Genetic; genetic analysis; Genetic Heterogeneity; Genome; Genomic approach; Genomics; Heterogeneity; Heterozygote; High Performance Computing; in vitro Model; Individual; insight; Karyotype; Learning; leukemia; Lymphocyte; Malignant - descriptor; Malignant Neoplasms; Measurement; Mentors; Methods; Modeling; Molecular; Monitor; mutant; Mutation; Myeloid Cells; Neoplasm Metastasis; Noise; Normal Cell; novel; Outcome; Patients; Pattern; Population; prevent; Protocols documentation; public health relevance; Race; reconstruction; Relapse; Relative (related person); repository; Research; Research Infrastructure; Residual state; Residual Tumors; Resistance; Resources; response; Risk; Role; Sampling; scale up; Scheme; single cell analysis; Sorting - Cell Movement; Staging; statistics; Stem cells; Stratification; Structure; Techniques; Therapeutic; therapy design; therapy development; Time; tool; Training; Transplantation; tumor; Variant; Variation (Genetics); Wood material; Work

82


1 K01 ES026833-02

Multiparametric Prediction of Vasospasm after Subarachnoid Hemorrhage Park, Soojin sp3291@cumc.columbia.edu Columbia University Health Sciences Subarachnoid Hemorrhage (SAH) affects an estimated 14.5 per 100,000 persons in the United States, and is a substantial burden on health care resources, because it can cause longterm functional and cognitive disability. Much of this is due to delayed cerebral ischemia (DCI) from vasospasm (VSP). VSP refers to the reactive narrowing of cerebral blood vessels due the unusual presence of blood surrounding the vessel. In its extreme, severe VSP precludes blood flow to brain tissue, resulting in stroke. SAH is one of the most common disease entities treated in the Neurointensive Care Unit (NICU). Currently, resource planning is scripted around the Modified Fisher Scale, which predicts the odds ratio of developing DCI based on the volume and pattern of blood on initial brain computed tomography (CT). It does not, however, allow for further individualized risk assessments. The first 14 days are occupied by efforts to detect preclinical or early VSP and arrange timely interventions to prevent permanent injury. The only noninvasive tool supported by guidelines to potentially identify preclinical VSP is the transcranial Doppler (TCD), which has an unreliable range of sensitivity and negative predictive values, and is at the mercy of technician availability. If not identified preclinically, VSP must be detected once it is symptomatic and is then dependent on quality and availability of expertise in the complex and diurnal environment of the ICU. Promisingly, electronic medical record (EMR) data and continuous physiology monitors offer abundant opportunities to risk stratify for future events as well as reveal events in real-time in the acutely brain injured patient. A methodical approach to feature engineering is being performed over a large set of potentially discriminatory data-driven and knowledge-based features. Metafeatures representing variations and trends in time series variables are extracted using a variety of quantitative and symbolic abstraction techniques. To date, we have experimented with feature selection using the Random Kitchen Sink algorithm and classification using Support Vector Machine and Logistic Regression. Future work will compare data-driven techniques (deep learning) with knowledge-based feature extraction techniques. This project will result in a prediction tool that improves timeliness and precision in VSP classification. It will fill an important gap in the understanding of the potential of underutilized EMR and physiological data to predict neurological decline. Generating accurate and timely prediction rules from already collected clinical data would be cost effective and have implications not only for SAH patients, but also for almost any monitored patient in any ICU. Publications: doi: 10.1007/s11910-016-0659-0, doi: 10.1001/jamaneurol.2016.5325; doi: 10.1016/j. resuscitation.2016.12.018; doi: 10.1007/s11910-016-0659-0; doi: 10.1007/s12028-016-0331-1; doi: 10.1212/ WNL.0000000000002584 Keywords: Data engineering, Data mining, Bayesian Methods, Machine/statistical learning, Predictive analytics, Signal processing, Statistical Analysis, Data analysis, Neuroscience, Clinical research

83


K01 Grants 1 K01 ES025442-03

Nonparametric Bayes Methods for Big Data in Neuroscience Pearson, John john.pearson@duke.edu Duke University pearsonlab.github.io Each year, one in four adults suffers from a diagnosable mental disorder, with 1 in 25 suffering from a serious mental illness. Yet our ability to anticipate the onset of mental illness - even our ability to understand its effects within the brain - has been limited by the recognition that these diseases are not primarily disorders of independent units, but patterns of pathological brain activation. However, we currently lack a meaningful characterization of patterns of activity within neural networks, and thus the ability to discuss, discover, and treat them effectively. Yet an improvement in our ability to characterize and detect these patterns would result in major clinical impact. This project aims to develop and implement new data analysis methods to characterize the increasingly large datasets generated by neuroscience experiments. Using tools borrowed from machine learning, I focus on three key questions for pattern detection: 1) How does the brain encode complex, unstructured stimuli? 2) What are the basic building blocks of healthy and diseased patterns of intrinsic brain activity? 3) How do patterns of brain activity change in response to changes in behavioral state? My approach makes use of recent advances in Bayesian nonparametric methods, as well as fast variational inference approaches that scale well to large datasets. In addition, because the datasets I use are part of the much larger class of multichannel time series data, the results will apply more broadly to other types of data, in neuroscience and beyond. Work thus far has focused on new approximate methods for inferring features of complex stimulus from a “neuron’s eye view.” That is, rather than asking which features of a natural scene are important for image recognition by computers, we ask which features of a scene most evoke activity in populations of neurons. Our algorithm is capable of “tagging” visual scenes and movies with multiple labels that change in time, scales well to large datasets, and suggests new hypotheses for future experiments. More recent work has focused on the problem of analyzing neural data recorded as animals compete against one another in real-time strategic decision-making. In contrast with repeated games focused on fixed alternatives, we model the behavior of each individual as a responsive control system involving both reactive and self-generated (goal-directed) components. Here, machine learning techniques provide powerful tools for decomposing dynamic behavior in ways that facilitate answering questions about strategic planning, social behavior, and decision-making within the brain. Public Health Relevance: Recent evidence suggests that most mental illnesses result from disruptions in normal patterns of brain activity. Yet discovering, detecting, and describing these patterns of activity has proven difficult to do in conventional experiments. This project wil develop computer algorithms for pattern detection that can be applied to the scientific study and early diagnosis of mental illness. Publications: 10.1073/pnas.1514761112, arXiv:1512.01408, 10.1073/pnas.1522315113; doi: 10.1037/bne0000082.

http://dx.doi.org/10.1080/13506285.2015.1093244;

doi:

Keywords: Bayesian Methods, Machine/statistical learning, Multivariate Methods, Pattern recognition and learning, Signal processing, Spatio-temporal Modeling, Statistical Analysis, Computer programming, Computer science, Data analysis, Neuroscience

84


1 K01 ES025435-01

Using a Sequence-to-Structure-to-Function Approach to Functionally Characterize Protein Coding Missense Mutations in the Human and Rat Genomes Prokop, Jeremy William jprokop@hudsonalpha.org Hudson-Alpha Institute for Biotechnology Abstract: The goal of this BD2K K01 application is to provide Dr. Jeremy W. Prokop with the mentoring and training to be an independent investigator in big data to knowledge. Mentoring and training will be provided to Dr. Prokop in genomics, proteomics, computer science, and statistics to advance his skill to allow for independence. This will allow for further develop of his sequence-to-structure-to-function analysis for interpretation of protein coding genetic variants into a usable workflow available to other scientists. The mentoring throughout this award will be from Dr. Howard Jacob, a leader in rat/human genetics and tool development to understand variants in whole genomes. Additionally, Dr. Andrew Greene serves as a co-mentor to help advance skills in proteomics and Dr. Christina Kendziorski as a co-mentor to advance statistical tools/approaches. The Medical College of Wisconsin (MCW) is a leader in the use of whole genome sequencing in understanding human health, with a Cap/CLIA certified clinical sequencing facility, tools for identifying genetic variants from whole genomes (such as Carpe Novo), and operation of databases like the Rat Genome Database. These tools and knowledge at MCW will serve as an asset for the training and completion of the Aims in this award. Dr. Prokop's research focus in this grant is to further expand the sequence-to-structure-to-function approaches he developed during his Ph.D. and initial postdoc into a workflow and web submission server for other users. Aim 1 of the grant organizes these steps into a workflow allowing for the development of the web based submission server. To test the workflow, 75 candidate genes for cardiovascular disease will be screened. Initial use of the approach has revealed hypotheses for how genetic variants in Havcr1 and Shroom3 result in altered cardiovascular drug response or disease. Variants in these two proteins will be biochemically characterized using a novel decision tree to standardize experiments (Aim 2) and to serve as a quality control for the approaches of Aim 1. From the results of the 75 screened cardiovascular genes, the four genes with the highest confidence score will be validated using the biochemical decision tree in year four and five of this award serving as an additional quality control for the approaches in Aim 1. This grant will provide the training and support for Dr. Prokop to be integrated into the Medical College of Wisconsin's Clinical Sequencing program, allowing for additional R01 proposals for validation of genetic causes of disease, and facilitate the development of Dr. Prokop into an independent researcher in the use of big data. Public Health Relevance: Development of a new tool package that is able to interpret genetic variants of disease genomes into a testable hypothesis of how protein function is perturbed. This tool will allow for improved diagnostic and treatment methods for many human diseases that employ whole genome/exome sequencing such as cardiovascular, cancer, and rare diseases. Publications: doi: 10.1016/j.jmb.2016.10.016; doi: 10.1074/jbc.M115.696831; doi: 10.1152/physiolgenomics.00074.2015; doi: 10.1186/ s13293-016-0064-z; doi: 10.1038/srep40674; doi: 10.1007/s12265-015-9626-4; doi: 10.1152/physiolgenomics.00138.2014. Actins; Affinity; Age; Area; Award; Big Data; Binding (Molecular Function); Biochemical; Biological Assay; Candidate Disease Gene; Cardiovascular Agents; Cardiovascular Diseases; Cardiovascular system; Case Study; Cellular Morphology; Clinical; Code; cohort; computer science; Computer Simulation; Computers; Data; Databases; Decision Trees; Development; Diagnostic; Dimerization; Disease; disease diagnosis; Disease model; DNA; Doctor of Philosophy; End stage renal failure; Evaluation; exome; exome sequencing; Filtration; Frequencies (time pattern); functional outcomes; Genes; Genetic; genetic variant; Genome; genome database; genome sequencing; Genomics; Goals; Grant; Health; High-Throughput Nucleotide Sequencing; Human; human disease; Human Genetics; Human Genome; human SFN protein; improved; Informatics; interest; Internet; Kidney; Knowledge; Lisinopril; Macromolecular Complexes; Malignant Neoplasms; medical schools; Mentors; Methods; Missense Mutation; Molecular; mutant; National Heart, Lung, and Blood Institute; novel; Online Systems; operation; Output; Pathway interactions; Patients; Pharmaceutical Preparations; Phenotype; Phosphorylation; Post-Translational Protein Processing; Postdoctoral Fellow;

85


K01 Grants

Precipitation; Process; programs; protein function; protein structure; Proteins; Proteomics; Protocols documentation; public health relevance; Quality Control; Rare Diseases; rat genome; Rat Strains; Rattus; Reporting; Research; Research Personnel; research study; response; RNA; Science of genetics; Scientist; screening; Serine; skills; Source; statistics; Structural Protein; Structure; Techniques; Testing; tool; tool development; Training; Training Support; Translating; Validation; Variant; Wisconsin

86


1 K01 ES025433-01

HashtagHealth: A Social Media Big Data Resource for Neighborhood Effects Research Nguyen, Quynh quynh.nguyen@health.utah.edu University Of Utah https://hashtaghealth.github.io/ HashtagHealth is a project funded by the National Institute of Health (NIH) as a Mentored Research Career Development Award for Dr. Quynh Nguyen in the College of Health at the University of Utah. This project proposes to design and develop a new resource, HashtagHealth, that addresses both the dearth of neighborhood data and offers novel characterizations of neighborhoods. We will build the data algorithms and infrastructure to harness relatively untapped, cost efficient, and pervasive social media data to develop neighborhood indicators such as food themes, healthiness of food mentions, frequency of exercise/recreation mentions, metabolic intensity of physical activities, and happiness levels. The specific research aims are as follows: Aim 1. Develop a neighborhood data resource, HashtagHealth, for public health researchers. Aim 2. Develop Big Data techniques to produce novel neighborhood indicators. Aim 3. Utilize HashtagHealth and individual-level data from the Utah Population Database to investigate neighborhood influences on obesity among young adults. Publications : doi:10.1016/j.apgeog.2016.06.003; doi: 10.2196/publichealth.5869; doi: 10.2105/ AJPH.2015.303006; doi: 10.1142/9789813207813_0059 Keywords : Data mining, Databases/data warehousing, Machine/statistical learning, Multivariate Methods, Causal Analysis, Statistical Analysis, Visualization, Computer science, Data analysis, Health Disparities, Social and Behavioral Sciences, Computational Biology, Epidemiology

87


K01 Grants

1 K01 ES026840-02

INGOT: a family of statistical computing algorithms for hypothesisdriven imaging genomic and longitudinal neuroimaging analysis Schmitt, James E james.e.schmitt@uphs.upenn.edu University Of Pennsylvania Abstract: This application for K01 Mentored Career Development Award in Big Data Science describes a proposal from Dr. J. Eric Schmitt MD, PhD, a neuroradiologist at the University of Pennsylvania. It proposes the development the Integrated Neuroimaging Genomic Ontology Toolbox (INGOT), a free, open source collection of analytic tools for the combined analysis of imaging and genomic data. These tools will include an automated pipeline for combined imaging genomic analysis, the integration of gene and neuroanatomic ontology databases into the analytic framework, and the incorporation of structural equation models to facilitate analysis of genetically- informative longitudinal datasets. These tools will then be tested via simulation and subsequently applied to two large genetically-informative neuroimaging datasets. The proposal also details the continued career development of Dr. Schmitt as he transitions from Neuroradiology Fellow to independent imaging genomics researcher and neuroradiology faculty. A rigorous interdisciplinary curriculum is described, intended to reinforce and update prior training in quantitative neuroimaging and genomics, maintain existing skills in clinical neuroradiology and develop a strong technical skill set in bioinformatics and statistics via a combination of workshops, coursework, and seminars. Public Health Relevance: Neuroimaging and genomic data are both becoming increasingly complex, hindering the ability of individual researchers to perform statistically rigorous hypothesis-driven analysis. The development of standardized, integrated, intelligent, and open-source analysis tools such as INGOT is critical to the successful fusion of these modalities. INGOT will simplify imaging genomic and longitudinal imaging analysis as well as incorporate information from existing genetic and neuroanatomic databases into the analysis pipeline. Algorithms; Architecture; Behavior; Behavioral Genetics; Big Data; Bioinformatics; Biological; Brain; Brain imaging; career development; Child; Childhood; Clinical; Cognition; cohort; Collection; Complex; cost; Data; data mining; Data Set; Databases; Development; Doctor of Philosophy; Educational Curriculum; Educational workshop; Environment; Equation; Explosion; Faculty; Family; Future; Genes; Genetic; genetic variant; Genomics; Goals; Gur; Image; Image Analysis; Individual; Individual Differences; Informatics; K-Series Research Career Programs; Knowledge; Literature; Mediating; Mentors; Methods; Modality; Modeling; Moods; National Institute of Mental Health (U.S.); Network-based; neuroimaging; Neurosciences Research; novel; Ontology; open source; Pennsylvania; performance tests; Philadelphia; population based; Psychometrics; public health relevance; relating to nervous system; Research; Research Personnel; Resolution; Sample Size; Science; simulation; skills; Solutions; Statistical Computing; Statistical Methods; statistics; Structure; success; Testing; Time; To specify; tool; Training; Universities; Update

88


1 K01 ES026836-02

Data integration for global population health through dynamic models Van Panhuis, Willem Gijsbert wav10@pitt.edu University Of Pittsburgh At Pittsburgh www.tycho.pitt.edu My research is focused on using population health informatics and Big Data methods to improve pandemic preparedness and response strategies. In particular, I am funded by BD2K to improve the use of real-world data for epidemic simulation models. A lack of available, quality data on epidemic diseases has limited the public health response against recent epidemic threats such as Ebola, Chikungunya, and Zika virus: there is no global information system that gives access to integrated epidemiological information on epidemic diseases and most countries disseminate data through siloed, heterogeneous systems. Funded by the Gates Foundation and the NIH Models of Infectious Disease Agent Study (MIDAS), my team has created a global resource for infectious disease data: Project Tycho (NEJM 2013, New York Times, Washington Post, Wall Street Journal). Through Project Tycho (www.tycho.pitt.edu), we have released a redesigned, integrated, machine-readable version of the entire 125 year history of weekly US Notifiable Disease Surveillance System (NNDSS) data and all dengue data available from the World Health Organization (WHO). We also integrated detailed dengue data from 8 countries in SE Asia and found a strong association between large, synchronous dengue epidemics and elevated temperatures caused by El NiĹ„o (PNAS 2015, NIH Directorââ‚Źs blog). With funding from BD2K, I used a formal, highly expressive data representation, developed by the University of Pittsburgh Apollo project, to redesign and standardize into machine-readable format Chikungunya data that was released by the WHO and countries in Latin America. Most data released by countries or international agencies cannot immediately be used for analysis due to format constraints (e.g. PDF files) or missing content, requiring major investments of time and resources to prepare these data for analysis. Using BD2K funding, we created an algorithm that automates the preparation of epidemic data into a common data format that can be used by simulation models. Together with my mentor and programmer team, we are turning this algorithm into a software product that can be used by health agencies and researchers to speed up data processing. The goal of my BD2K project is to create a prototype of an ecosystem for epidemic data and simulators that will automatically determine what epidemics can be simulated based on all available data and simulation models. We have created a large-scale agent-based simulation model of Chikungunya virus among 45M people in 10M households, schools, and workplaces to help policy makers optimize epidemic control strategies based on integrated epidemic and population data (in preparation). To inform policy makers of these new resources, I have presented this work at multiple venues including the US White House Pandemic Prediction and Forecasting Science and Technology Working Group in Washington, DC, the China Centers for Disease Control in Beijing, and the Asia Pacific Dengue Prevention Board in Kuala Lumpur, Malaysia. Many students in public health are excited about this research and I am currently mentoring three doctoral students in Epidemiology to use Big Data methods to improve global population health and am always looking for new talent and collaborations. a variety of countries; 2) Develop computer algorithms that will search across all available datasets and all available epidemic simulators to identify those

89


K01 Grants epidemics that can be studied by simulation. These algorithms will also identify data gaps; that is, epidemics that could be studied by simulation if a particular datum or dataset were to become available; and 3) Quantify the importance of different datasets for simulation of specific epidemics. This new technology will replace laborious manual processes with fast computer algorithms that can be scaled up to search across millions of datasets and simulators. Impact: Easier and faster discovery of appropriate datasets or data gaps for simulation will expand the use of epidemic simulation for public health research and practice leading to more efficient integration of available data. Using data more efficiently for innovative analyses will lead to new knowledge and discoveries that can improve global population health. Efficient use of data will also lead to cost savings by avoiding redundant data investments. Finally, wider use of epidemic simulators will improve preparedness against new epidemic threats. Outcomes of this project can be used across the biomedical sciences and will prepare me to become an independent investigator at the interface between public health and Big Data. Publications : 10.1056/NEJMms1215400, 10.1073/pnas.1501375112, 10.1186/1471-2458-14-1144 Keywords : Data engineering, Databases/data warehousing, Differential Equation Modeling, High performance computing, Ontology Design, Operations research, Spatio-temporal Modeling, Visualization, Data analysis, Microbiology and Infectious Diseases, Social and Behavioral Sciences, Virology, Epidemiology

90


BD2K Training Effort Grant Information

Institutional Training (T32 and T15) Grants

91


T32 Grants

1 T32 LM012409-01

Predoctoral Training in Biomedical Big Data Science Altman, Russ Biagio russ.altman@stanford.edu Stanford University http://bmi.stanford.edu Stanford’s Biomedical Big Data Science Graduate Training Program’s mission is to provide the highest level of training in the development of novel quantitative and computational methods for the solution of pressing problems in biology and medicine. The BD2K Training Grant for the Biomedical Big Data Science Graduate Training Program started in 2016. Using the BD2K funds, six new students were admitted to the PhD program with an anticipated start date of September 2016. We are pleased that this cohort includes two underrepresented minorities. Our PhD curriculum has been updated first to provide greater flexibility of course choices in the area of the biomedical informatics to reflect the greater exposure at the undergraduate level that our applicants are now receiving, and second to provide the opportunity for both more depth and breadth in the core disciplines of data science: computer science, statistics, and related areas of engineering and mathematics, taking advantage of Stanford’s rich, world-class resources in these domains. We are also redesigning the qualifying examination for advancing to PhD candidacy to ensure uniform coverage and greater rigor across the areas of biomedical informatics and biomedical data science. In the coming year, we anticipate that Stanford’s Biomedical Informatics Program will be moved both organizationally and physically into the recently formed Department of Biomedical Data Science. This will strengthen the already productive interactions between our PhD students and faculty with training in biostatistics, and we will expect to establish formal roles for those faculty in our training program. Keywords : Data mining, Machine/statistical learning, Medical informatics, Multivariate Methods, Ontology Design, Statistical Analysis, Computational Genomics, Computer programming, Computer science, Data analysis, Genetics and Genomics, Molecular Pharmacology, Systems Biology, Computational Biology

92


1 T32 LM012204-01A1

Quantitative Biomedical Sciences At Dartmouth Amos, Christopher I christopher.i.amos@dartmouth.edu Geisel School of Medicine at Dartmouth College https://bmds.dartmouth.edu/ This T32 training program, we selects two highly qualified predoctoral students each year for two years of support based on academic qualifications and a dissertation project that will focus on big data. The predoctoral trainees will complete an eightcourse core curriculum that will include two courses in computer science or data science, two courses in bioinformatics, two courses in biostatistics, two courses in epidemiology and a two course sequence in integrative biomedical sciences that exposes students to interdisciplinary research and that culminates in a collaborative big data class project. Students in this program take two elective courses, one term of teaching, journal clubs and seminars, training in responsible conduct of research, a written and oral qualifier exam, a yearly research in progress seminar, and the completion of a significant research project that forms the foundation of the written dissertation that is orally defended. Faculty trainers include QBS and computer science faculty that have extramural funding, a strong track record of graduate training experience, a strong track record of big data research and a commitment to participating in the activities necessary for successful predoctoral training. Research areas of the faculty include bioinformatics, biostatistics, computer science, computational biology, genomics, and epidemiology. It is our vision that the next generation of big data researchers need multiple skills and expertise from areas such as bioinformatics, biostatistics, computer science and epidemiology to bring together teams of researchers. Publications: 10.1158/1055-9965.EPI-15-1318, 10.1093/bioinformatics/btw180, 10.1158/1078-0432.CCR-16-0623 Keywords: Artificial intelligence, Databases/data warehousing, High performance computing, Bayesian Methods, Machine/statistical learning, Medical informatics, Multivariate Methods, Population Genetics, Computational Genomics, Computer science, Data analysis, Genetics and Genomics, Cancer, Systems Biology, Clinical Research, Computational Biology

93


T32 Grants

1 T32 LM012414-01A1

Big Data to Knowledge Training Program Kosorok, Michael R; Forest, Mark Gregory forest@unc.edu, kosorok@bios.unc.edu University Of North Carolina Chapel Hill http://bd2k.web.unc.edu/ The UNC Chapel Hill “BD2K in Biomedicine” program supported 6 trainees during our first academic year, an additional 13 trainees this summer, and leveraged additional federal funding for ~ 3 dozen other graduate students. Our program now includes mentors from 23 Departments across 4 Schools (Public Health, Pharmacy, Medicine, Information and Library Science) and the College of Arts and Sciences, and we share curricula with the Bioinformatics and Computational Biology training program. We have acceptances for 7 funded trainees for Year 2, using funds remaining from the summer supplement for an additional trainee. All trainees, funded or not, work on research projects that integrate a biomedical dataintensive domain, a statistics / data analytics domain, a computer science / information science domain, and a predictive computational modeling domain. During Year 1, five modules (counting for 1 credit hour in each participating PhD program) were developed and offered: • Cancer Genomics and Class Discovery by the Perou lab in the Lineberger Cancer Center and the Marron-Nobel lab in Statistics;

in the Marsico Lung Institute, the Lai lab in the School of Pharmacy, the Superfine lab in Physics, and the Forest lab in Mathematics;

• Chromosome Organization and Dynamics in Living Cells by the Bloom lab in Biology and the Forest lab in Mathematics;

• Predictive Analysis for High-Dimensional Data Sets by the Prins lab in Computer Science and the Dittmer lab in Microbiology and Immunology; and

• Microrheology and Transport – Particle Tracking in Biological Fluids by the Hill lab

• Neuromechanics by the Miller lab in Biology and Mathematics and the Newhall lab in Mathematics.

Throughout Year 1, group meetings (of grad trainees and mentors) were held for two hours on the first Friday of each month, with activities ranging from: trainee presentations of their research projects, soliciting (and receiving) feedback from other attendees; brainstorming sessions about what skills and tools trainees were interested in learning, followed by small working groups to present and learn; at least one visitor each week who presented the focus of their lab, with the goal to develop a multi-domain collaboration similar to the theme of our BD2K program. The summer supplemental program solicited proposals from faculty and grad trainees for team-oriented projects spanning two months, with a data-intensive biomedical focus and a desire to integrate multiple disciplinary domains. Over the first two weeks, we played matchmaker and aligned interested students with faculty, settling on projects involving 13 BD2K funded trainees and about the same number of alternatively funded trainees, as well as numerous postdocs and faculty from across campus. All participants met for 2 hours each Friday, reporting progress and soliciting input and suggestions from the group. The last meeting was devoted to a report from each working group about the project, progress, and future plans. The final reports of the Year 1 funded trainees and of the Summer Supplement Program will soon be available on our website http://bd2k.web.unc.edu/vv Publications: DOI:10.1103/PhysRevLett.116.228301, doi:10.1186/s13059-016-0975-3, doi:10.1093/nar/gkw510 Keywords: High performance computing, Image Analysis, Information science,Bayesian Methods, Machine/statistical learning, Mathematical statistics, Medical informatics, Spatio-temporal Modeling, Computational Genomics, Computer science, Genetics and Genomics, Cancer, Virology, Chromosome Biology, Epidemiology

94


1 T32 LM012414-01A1

Predoctoral Training In Biomedical Big Data Science Daniels, Michael J; Dhillon, Inderjit ; Meyers, Lauren Ancel mjdaniels@austin.utexas.edu University Of Texas, Austin https://stat.utexas.edu/biomedical-big-data-training-grant The purpose of this pre-doctoral training program at The University of Texas at Austin is for the trainee to become an expert in one of the following areas: 1. Statistics (STAT); 2. Computer Science (CS); 3. Computational science, engineering, and mathematics (CSEM); or 4. Biology (via a PhD in one of a. neuroscience [NS]; b. ecology, evolution, and behavior [EEB]; c. cell and molecular biology [CMB]; or d. Biomedical Engineering [BME]) while also obtaining essential training in all three core areas (statistics, computer science, and biology). TRAINING Training for the program involves three formal components: core courses (3) research lab rotations (2) seminar/workshop course CORE COURSES Trainees with sufficient skills in computer science, statistics, and biology take the following three courses in the second year of their PhD programs. BIO 38X: Introduction to Biology for Data Science: An introduction to modern biology for students with quantitative backgrounds. The course includes a survey of modern biology and also introduces modern statistical approaches, bioinformatics analyses, and computational approaches fundamental to “big data� and other life sciences in contexts students are likely to encounter in their own research. CSE 380: Tools and Techniques of Computational Sciences: Graduate level introduction to the practical use of high performance computing hardware and software engineering principles for scientic technical computing. Topics include computer architectures, operating systems, programming languages, data structures, interoperability, and software development, management and performance.

95


T32 Grants

SDS 385: Statistical Models for Big Data: This course will cover big data modeling approaches including linear models, graphical models, matrix and tensor factorizations, clustering, and latent factor models. Algorithms explored will include sketching, fast n-body problems, random projections and hashing, large-scale online learning, and parallel learning. RESEARCH ROTATIONS Each student will participate in (at least) two lab rotations designed to give the students direct mentoring, the experience of working on a research team, and experience working on real problems in big data. Trainees will do a “quantitative” and a “biomedical” rotation. Research group rotations will last a semester and each student will be expected to complete two rotations (typically both during the fall semester of year 3 of their PhD). For each rotation, the students will register for a 3 credit course. WEEKLY SEMINAR COURSE Through the weekly seminar/workshop course, students will be introduced to research areas, develop skills in critical literature evaluation, strengthen their oral and written communication, and work on team rotation projects. PROGRAM TIMELINE First Year: During the first year of the students’ PhD program, the students will take the standard curriculum for their respective programs. Second Year: Trainees will take three program core courses and enroll in the seminar/workshop course in the second year of their PhD programs. Third Year: The first semester of the third year will comprise two rotations and any additional coursework. Trainees will also enroll in the seminar/workshop course for both semesters. The second semester will be spent developing and/or finalizing a dissertation topic. All third-years will submit F31 applications, assuming their dissertation work is fundable by an F31 institute. Keywords: High performance computing, Image Analysis, Bayesian Methods, Machine/statistical learning, Pattern recognition and learning, Causal Analysis, Statistical Analysis, Computational Genomics, Computer science, Genetics and Genomics, Molecular Biology and Biochemistry, Neuroscience, Systems Biology, Epidemiology

96


1 T32 LM012410-01

Massive and Complex Data Analytics Pre-Doctoral Training in One Health Shyu, Chi-Ren shyuc@missouri.edu University Of Missouri-Columbia http://engineering.missouri.edu/2016/05/collaborative-effort-leadsunique-informatics-degree-program/ The University of Missouri-Columbia (MU) is hosting a unique predoctoral training program to nurture a new breed of data scientists and informaticians for Massive and Complex Data Analytics in One Health. Our goal is to bring new expertise in Big Data analytics to the biomedical science community, helping researchers to process, analyze, and visualize meaningful heterogeneous biomedical data derived from animals and humans, and to draw key insights to accelerate scientific discoveries. People and animals share common diseases (e.g. prostate cancer, hip dysplasia, etc.), and their study is the object of comparative medicine. Semantically integrating human and animal data, including molecular data when available, will enable novel in silico studies such as candidate gene prioritization and target discovery, and will aid in the development of analytical techniques and tools to help with the timely detection and treatment of human and animal ailments and illnesses. The training faculty at University of Missouri are well positioned, through existing and strong collaborations, to meet the challenges in graduate education and to effectively train the next generation of data scientists in One Health. The MU Informatics Institute (MUII), in conjunction with School of Medicine, College of Veterinary Medicine, and College of Engineering collaboratively host a T32 training program addressing the needs of researchers and practitioners under the One Health theme. Moreover, the project calls for bringing advanced analytic capabilities to the basic science and clinical communities by providing responsive and new Big Data training opportunities at multiple levels, to help trainees and biomedical scientists transition into the Big Data environment through an array of training and outreach activities. MU’s T32 program will train 14-18 doctoral students over the 5-year project duration (2016-2021). Students will participate in a training program that has both department-specific and training programwide components. Our unique departmental components include: (1) Required Data Science and Analytic classes that are highly personalized to ensure core competencies for trainees from diverse technical backgrounds; (2) A specialized, tailored informatics curriculum appropriate for the unique research interests of each trainee; (3) A group of outstanding scientists to serve as academic mentors and research role models for intra- and interdisciplinary research. Our interdisciplinary components include: (1) Tri-labs rotations, allowing students to gain hands-on wet-lab experience in human and animal health, as well as informatics; (2) Instruction in written and oral communication, so that our trainees will be able to efficiently disseminate their analytics outcomes for actionable plans; (3) A suite of creativity events and student-driven seminars, allowing students to work in teams with senior researchers to address and identify current and future research challenges; and (4) Professional networking that enables our trainees to present their work at national and international meetings. Together, these departmental and program-wide components provide our trainees with a depth of disciplinary expertise and a breadth of exposure to other disciplines. Under the One Health theme, a new breed of data analyst will be trained; one who can quickly analyze animal and human data and infer discoveries for improved human health. Keywords: Data mining, Databases/data warehousing, High performance computing, Machine/statistical learning, Predictive analytics, Search engines, Cloud Computing, Computational Genomics, Data analysis, Genetics and Genomics, Structural Biology, Clinical Research, Computational Biology

97


T32 Grants

98


1 T32 LM012413-01A1

Bio-Data Science Training Program Newton, Michael A; Dewey, Colin Noel; Gould, Michael N newton@biostat.wisc.edu, cdewey@biostat.wisc.edu, and gould@ oncology.wisc.edu University Of Wisconsin-Madison https://www.biostat.wisc.edu/content/predoctoral-training-program-biodata-science-bds Research to improve the analysis of big biomedical data is active at the interface of computer sciences, statistics, and various biomedical disciplines, including genomics, molecular biology, neuroscience, cancer research, and population health. The mission of the Bio-Data Science (BDS) training program is to provide predoctoral research training at this interface, preparing graduate students for key roles in academia, industry, or government.

Trainers in the UW BDS program are leading research efforts at the fronteirs of biomedical data science. Broadly, we cover computer sciences, statistics, and various biomedical domains. We support trainees in one of three PhD programs (Computer Sciences, Statistics, Cellular and Molecular Biology), and we require trainees to achieve advanced course work in these three areas, in conjunction with the requirements of their home PhD program. We organize lab rotations through which trainees contextualize elements of their formal course work and become prepared for PhD research. By virtue of the interdisciplinary efforts of the trainers, trainees have many opportunities to develop data science research projects in a wide variety of biomedical topic areas. For example, current projects include developing better computational tools for the analysis of RNA sequencing data, with applications to stem-cell biology, and developing better tools for neuroimaging data analysis, with applications to Alzheimer’s disease research. Publications: 10.1371/journal.pcbi.1003235, 10.1038/nmeth.3549, 10.1002/hbm.22472 Keywords: Image Analysis, Bayesian Methods, Machine/statistical learning, Mathematical statistics, Statistical Analysis, Computational Genomics, Computer science, Data analysis, Genetics and Genomics, Neuroscience, Stem Cell Biology, Cancer, Computational Biology

99


T32 Grants

100


1 T32 LM012419-02

University of Washington PhD Training in Big Data for Genomics and Neuroscience Noble, William Stafford; Daniel, Thomas L; Fairhall, Adrienne L; Witten, Daniela william-noble@uw.edu University Of Washington http://www.gs.washington.edu/academics/bdgn/index.htm The University of Washington conducts world-class research in the development of big data analytics, as well as in many areas of biomedical research. However, most predoctoral students in biomedical science do not receive cutting-edge training in statistical and computational methods for big data. Furthermore, most predoctoral students in statistics and computing do not receive in-depth training in biomedical science. Given the growing importance of big data across many areas of biomedical research, such an integrated program is critically needed. In order to train a new generation of researchers with expertise in statistics, computing, and biomedical science, we have created the University of Washington PhD Training in Big Data from Genomics and Neuroscience (BDGN). This program focuses on two areas of biomedical science, both of which are characterized by huge amounts of data as well as extensive expertise at the University of Washington: genomics and neuroscience. The program draws six predoctoral students per year from the following seven PhD programs: Applied Mathematics, Biology, Biostatistics, Computer Science & Engineering, Genome Sciences, Neuroscience, and Statistics. Trainees are appointed to the training grant during their first or second year of PhD studies and will continue on the training grant for two years. They take a rigorous curriculum that involves three courses in statistics, machine learning, and data science, and three courses in either genomics or neuroscience. Each trainee is paired with two world-class faculty mentors: one specializing in either genomics or neuroscience, and a second specializing in the development of either computational or statistical methods for big data. Other key features of the training program include three onequarter rotations, with at least one focusing on genomics or neuroscience and one focusing on statistical or computational methods, a summer internship program, opportunities to attend world-class summer courses run through UW programs, peer mentoring, seminars, journal clubs, and courses on reproducible research and on responsible conduct of research. All predoctoral trainees will leave the BDGN Training Program with a core set of skills and a common language required for generating, interpreting, and developing statistical and computational methods for big data from genomics or neuroscience. Keywords: Machine/statistical learning, Population Genetics, Computational Genomics, Computer science, Data analysis, Genetics and Genomics, Stem Cell Biology, Cancer, Chromosome Biology, Computational Biology

101


T32 Grants 1 T32 LM012416-01

Transdisciplinary Big Data Science Training at UVa Papin, Jason; Brown, Donald E; Loughran, Thomas Patrick; Skadron, Kevin papin@virginia.edu University Of Virginia http://bme.virginia.edu/bds We aim to prepare the next generation of scientists and engineers to address the monumental challenge of multi-type biomedical big data manipulation, analysis, and interpretation. We propose a curriculum and a set of programmatic activities to create an interdisciplinary training ground wherein teams of students will work across key disciplines, benefit from a true co-mentoring and interdisciplinary environment, and develop the technical and “soft” skills necessary to succeed as independent scientists making groundbreaking new discoveries enabled by biomedical big data. Three key features of this proposed training program are (1) depth in Big Data technical training, (2) tangible “soft skill” training through collaborative, team science activities, and (3) cross-disciplinary comentors in close physical proximity. Our proposed program embraces the philosophy that “there can be no question about the productivity and effectiveness of research teams formed of partners with diverse expertise.” (The National Academies, 2004). We propose courses, symposia, workshops, and collaborative activities to create a training environment that will support the development of the next generation of biomedical big data scientists and engineers. The proposed program will necessarily lie outside the existing traditional curricular structure, and it will provide the blueprint for the future in which collaborative biomedical big data science will play an everincreasing role in biomedical science research. The program is led by faculty with a strong history of prior collaboration and activity in biomedical big data science. The goals of our proposed training program are to (1) Create innovative and effective approaches to teaching collaborative methods for interdisciplinary biomedical big data science; (2) Address the demand at UVa and nationally for students and ultimately scientific professionals with data science expertise who can work on interdisciplinary teams to address complex challenges and problems; (3) Produce a scalable, sustainable and transferable program for education and training in collaborative big data science; (4) Create new pipelines for Ph.D. students from underrepresented groups. Recognizing the inextricable link between diversity and excellence, our program seeks to ensure that the next generation of leaders in biomedical big data science and engineering emerges from a variety of backgrounds. With an excellent infrastructure and history of recruiting students from underrepresented groups to existing NIH and other federal agency-funded programs at UVa, this proposed training program will flourish in bringing diversity to biomedical big data science. Historically, diverse sets of expertise were deeply embedded in the solution to many important scientific problems. We seek to imbue this sense of dedication to collaboration in our training program on biomedical big data. Keywords : Data mining, Differential Equation Modeling, Image Analysis, Bayesian Methods, Machine/statistical learning, Multivariate Methods, Nonlinear Dynamics, Operations research, Predictive analytics, Spatio-temporal Modeling, Computer science, Data analysis, Genetics and Genomics, Microbiology and Infectious Diseases, Cancer, Systems Biology, Computational Biology

102


T32 LM012424-02

Biomedical Big Data Training Grant Pellegrini, Matteo matteope@gmail.com University Of California Los Angeles Over the past few years there has been increasing recognition that the biomedical sciences are undergoing a transformation, led by the development of new technologies that have enormously increased the capacity to generate data. These include technologies to sequence DNA and RNA; measure protein and metabolite abundances using mass spectrometry; as well as multiple other high throughput platforms for screening and phenotyping. Coupled with the advances in medical imaging and electronic health records, the amount of data is growing faster than ever before. The Biomedical Big Data training grant allows us to support graduate students in fundamental aspects of biomedical big data analysis. A critical component of this training grant is the co-mentorship of the student by a Big Data expert along with a clinical expert, who will advise the student on a project with translational components. This training also involves a tailored set of courses in big data analysis, clinical data records and ontologies, along with specialized team building activities in big data challenges and extramural internships. Publications: 10.1007/s40484-016-0061-6, 10.1096/fj.201600259RR, Keywords: Data mining, High performance computing, Image Analysis, Information science, Medical informatics, Population Genetics, Predictive analytics, Cloud Computing, Computational Genomics, Data analysis, Genetics and Genomics, Systems Biology, Clinical Research, Computational Biology, Epidemiology

103


T32 Grants 1 T32 LM012411-01A1

Statistical and Quantitative Training in Big Data Health Science Quackenbush, John johnq@jimmy.harvard.edu Harvard School Of Public Health https://www.hsph.harvard.edu/sm-computational-biology/ Abstract: Unprecedented advances in digital technology during the second half of the 20th century have produced a revolution that is transforming science, including health and biomedical research, by providing data of unprecedented complexity in volumes and at a rate that was previously unimaginable. Members of National Research Council's (NRC's) Committee on Massive Data Analysis concluded in their 2013 "Frontiers of Massive Data Analysis" report that the challenges associated with "Big Data" go far beyond the technical aspects of data management and emphasized that development of rigorous quantitative and statistical methods was crucial if we are to use these data to their advantage. In this application we describe an integrated program designed to provide students with training in the quantitative and computational skills and communication and interdisciplinary research skills-and their application-required for those students to become the next generation of leading Big Data scientists in health and biomedical research. At the Harvard TH Chan School of Public Health, we have made a substantial investment is addressing these challenges, including launching a new formal Master's Degree program in Computational Biology and Quantitative Genomics, revamping the curriculum in Biostatistics to include a greater emphasis on computational methods and Big Data, a proposal undergoing internal review to include computation as an area of core competency for our students, and the inclusion of Big Data analytics as a central focus of the School's ongoing capital campaign. We are requesting support for six pre-doctoral students who will emerge from the program with expertise in cutting-edge statistical and computational methods development, a thorough understanding of fundamental basic science, public health, and clinical science, and demonstrated skills in the application of those methods in a wide range of areas in health and biomedical research. Our students will participate in a program designed to provide them with interdisciplinary research experience, to train them to collaborate and communicate effectively, and to understand the importance of data provenance and reproducible research. The training program involves active participation by accomplished and experienced multidisciplinary faculty members, including biostatisticians, bioinformatics scientists and computational biologists, computer scientists, molecular biologists, public health researchers, and clinicians. It combines elements of training in coursework, lab rotations in biostatistics, computational biology, computer science, molecular biology, population science and clinical science. Students will participate in directed and independent methodological research, will be involved in broad-based collaborative research projects, and will have rich career development opportunities in a stimulating and nurturing interdisciplinary environment that will prepare them to be leaders in quantitative Big Data health science research. Public Health Relevance: Unprecedented advances in digital technology during the second half of the 20th century have produced a revolution that is transforming science, including health and biomedical research, by providing data of unprecedented complexity in volumes and at a rate that was previously unimaginable. Members of National Research Council's (NRC's) Committee on Massive Data Analysis concluded in their 2013 "Frontiers of Massive Data Analysis" report that the challenges associated with "Big Data" go far beyond the technical aspects of data management and emphasized that development of rigorous quantitative and statistical methods was crucial if we are to use these data to their advantage. In this application we describe an integrated program designed to provide students with training in the quantitative and computational skills-and their application-required for those students to become the next generation of leading Big Data scientists in health and biomedical research. Big Data; Health Sciences; Training

104


1 T32 LM012415-01

Penn State Biomedical Big Data to Knowledge (B2D2K) Training Program Ritchie, Marylyn D; Honavar, Vasant G; Li, Runze mdritchie@geisinger.edu The Pennsylvania State University, Geisinger Health System, Penn State Hershey Medical Center

The Biomedical Big Data to Knowledge (B2D2K) Training Program at The Pennsylvania State University will bring together Data Science researchers and educators from 5 colleges at Penn State: the Colleges of Science, Engineering, Health and Human Development, Information Sciences and Technology, and Medicine, and Geisinger Health System to create a truly transformative multi-disciplinary predoctoral training environment. The goal of the B2D2K program is to train a diverse cohort comprising the next-generation biomedical data scientists with a deep knowledge of Data Science to develop novel algorithmic and statistical methods for building predictive, explanatory, and causal models through integrative analyses of disparate types of biomedical data (including Electronic Health Records, genomics, behavioral, socio-economic, and environmental data) to advance science and improve health. We believe that the investment in this generation of data scientists will be critical to see all of the “Biomedical Big Data� fully utilized to its greatest potential. This proposed T32 B2D2K training program is designed to build upon the previous investments of the University to integrate the faculty in three critical areas, Computer Science/Information Science, Statistics, and Life/Social Sciences, into a comprehensive predoctoral training program for the development of the next generation of data scientists. Over the past five years, Penn State has increased the number of faculty in particular thematic areas. A cluster hire in the area of Genomics and Computational Biology recruited approximately 30 new faculty to the University in a wide range of focused research areas broadly considered genomics. Another cluster hiring initiative focused on Data Sciences is currently underway and has already resulted in 25 hires in Computational Sciences. Here, another 12 faculty were recruited into areas of Statistics, Computer Science, and Informatics Science and Technology. The Strategic Plan of the Institute for CyberScience calls for participation over the next 5 years in coordinated hiring of 100 faculty with a focus on Data Sciences research and applications. Thus, The Pennsylvania State University has assembled the requisite faculty needed to accommodate a predoctoral training program in Biomedical Big Data. While these investments by the University have brought the faculty and thematic research areas together in one institution, they are not all integrated into one graduate program for students. Thus, there is an urgent need and desire to develop a predoctoral graduate training program that integrates these areas into a complete, well-rounded training program to enable scientists to have the appropriate background knowledge in data science, as well as their thematic area of choice, and engage in high quality biomedical big data research. The key graduate programs from which we expect to recruit students into B2D2K will comprise the cadre of students who will have the prerequisite knowledge and skills, along with the complementary graduate program coursework, to be well-rounded biomedical data scientists. This program will provide the students from these diverse disciplines with a concrete backbone of data science coursework from which they can build upon their appropriate subject-specific content area. Keywords: Artificial intelligence, Data mining, Machine/statistical learning, Pattern recognition and learning, Predictive analytics, Causal Analysis, Statistical Analysis, Visualization, Computational Genomics, Computer science, Genetics and Genomics, Social and Behavioral Sciences, Computational Biology, Epidemiology

105


T32 Grants 1 T32 LM012412-01

BIDS: Vanderbilt Training Program in BIg Biomedical Data Science Malin, Bradley A; Blume, Jeffrey ; Gadd, Cynthia S b.malin@vanderbilt.edu Vanderbilt University https://medschool.vanderbilt.edu/dbmi/vanderbilt-big-biomedicaldata-science-bids-program Advances in information and high-throughput technologies have set the stage for the ‘big data age’ in biomedicine. However, there remain unresolved challenges that could limit the impact of big data exploration in the basic, clinical an biomedical sciences. These challenges range from assuring privacy and security in cloud computing environments to establishing the integrity and reproducibility of quantitative analysis tools to proving the validity and generalizability of common probabilistic frameworks used to interpret big data in a biomedical context. We believe that our community can prepare for these challenges by developing a workforce that studies data as a science and engineers scalable technologies. Specifically, the Vanderbilt Training Program in Big Biomedical Data Science (BIDS) is designed to prepare the next generation of data scientists. This program is managed as a track within an existing biomedical informatics doctoral program and led by three PI’s with complementary expertise in 1) computational infrastructure, 2) statistical methodologies, and 3) management of NIH-sponsored training programs. The program’s mentorship comes from a collection of well-established university faculty and ensures students have exposure to novel biomedical problems and interdisciplinary teambased investigations. The program provides students with the experiences necessary to innovate in big data analytics with high impact in the underlying scientific applications in real clinical environments.

Our program ensures data scientists become highly knowledgeable in 1) computational techniques, technologies, and infrastructure for collecting, processing, and analyzing data on a massive scale, 2) statistical methodologies that accommodate large-scale, complex, high-dimensional biomedical data (e.g., model building and validation, false discovery rates, missing data imputation, recalibration for measurement error, and assessing the strength of statistical evidence) and 3) the scientific method and the specific biomedical and clinical contexts that led to data capture, downstream discovery and nextgeneration decision support systems (which governs the generalizability of results and quantitative tools). Because this field is evolving quickly, we provide students with pragmatic training environments that emphasize and develop critical thinking skills and expose them to modern biomedical data analysis in real systems. Publications: 10.1016/j.jbi.2013.12.012, 10.1126/science.aad2149, 10.1007/s11682-013-9269-5 Keywords: Artificial intelligence, High performance computing, Bayesian Methods, Machine/statistical learning, Medical informatics, Predictive analytics, Statistical Analysis, Computational Genomics, Computer science, Data analysis, Biomedical Engineering and Biophysics, Genetics and Genomics, Clinical Research, Computational Biology

106


1 T32 LM012203-01

Predoctoral Training Program in Biomedical Data Driven Discovery (BD3) Starren, Justin B; Klabjan, Diego justin.starren@northwestern.edu Northwestern University At Chicago http://nucats.northwestern.edu/funding/pilot-funding/biomedical-data-driven-discovery-training-program Abstract: The Biomedical Data Driven Discovery (BD3) Training Program at Northwestern University (NU) is a collaborative proposal that brings together Big Data educators and researchers from the Feinberg School of Medicine (FSM), the McCormick School of Engineering and Applied Science (MEAS), the Weinberg College of Arts and Sciences (WCAS) and the School of Communication. The goal of BD3 is to train Big Data scientists for both academic and industry research positions, who will develop the next generation of methodologies and tools. BD3 will to create a truly multidisciplinary data science training environment. In doing so, BD3 will encompass multiple departments and degree-programs, leveraging three existing data-intensive doctoral programs-- the well-established and nationally recognized program in Data Analytics in MEAS, led by Diego Klabjan, PhD, and two innovative and growing programs led by Justin Starren, MD, PhD: the Informatics track of the Driskill Graduate Program, focusing on Bioinformatics, and the Informatics track of the Health Sciences Integrated Program, focusing on clinical and population informatics. Together, Drs. Klabjan and Starren have expertise that spans three critical areas: computer science/informatics, statistics/mathematics, and biomedical domain knowledge. BD3 brings together the biomedical Big Data and domain expertise across multiple departments of FSM with methodological expertise in computation, informatics, statistics, and mathematics. The program will recruit three candidates per year and support each trainee for two years. Success in data science requires mastery of three distinct skill sets: 1) an understanding of the target domain, 2) an understanding of the nature and structure of the data within that domain, and 3) a mastery of the computational and statistical techniques for manipulating and analyzing the data. This translates into a number of more specific competencies, including: deep Domain Knowledge in the target domain, Statistical Methods, Computer Programming, Ontologies, Databases, Text Analytics, Predictive Analytics, Data Mining, Analytics for Big Data, and Responsible Conduct of Research. BD3 will utilize a co-mentoring model, with each student having a domain mentor and a methodological mentor. Each student's program will be based on an Individual Development Plan (IDP). Students will have many opportunities for both laboratory and industrial rotations leveraging well-established programs at MEAS. Additional educational activities include: an annual retreat, monthly trainee meetings, departmental seminars and speakers, journal clubs, teaching training and experience, and writing and presentation training. Trainees benefit from extensive institutional support for this program, such as: tuition supplements, stipend supplements, administrative support, the Writing Workshop for Graduate Students, the Searle Center for Advancing Learning and Teaching, the Management for Scientists and Engineers, nationally recognized mentor and mentee training programs, and formal training in Team Science. Public Health Relevance: The Biomedical Data Driven Discovery (BD3) Training Program at Northwestern University brings together Big Data educators and researchers from the Feinberg School of Medicine, the McCormick School of Engineering and Applied Science, the Weinberg College of Arts and Sciences and the School of Communication to create a truly multi-disciplinary data science training environment. The proposal leverages three existing data-intensive doctoral programs-- the well-established and nationally recognized program in Data Analytics in McCormick, led by Diego Klabjan, and two innovative and growing programs at Feinberg led by Justin Starren: the Informatics track of the Driskill Graduate Program, focusing on Bioinformatics and the Informatics track of the Health Sciences Integrated Program (HSIP), focusing on clinical and population informatics. The aim of BD3 is to train future scientists for both academia and industry who will develop novel Big Data methods to advance science and improve health. Data; pre-doctoral; Training Programs

107


T32 Grants 1 T32 LM012417-01

Biomedical Big Data Training Program at UC Berkeley Vanderlaan, Mark J; Jordan, Michael ; Nielsen, Rasmus laan@stat.berkeley.edu University Of California Berkeley Abstract: This proposal responds to the urgent need for advances in data science so that the next generation of scientists has the necessary skills for leveraging the unprecedented and ever-increasing quantity and speed of biomedical information. Big Data hold the promise for achieving new understandings of the mechanisms of health and disease, revolutionizing the biomedical sciences, making the grand challenge of Precision Medicine a reality, and paving the way for more effective policies and interventions at the community and population levels. These breakthroughs require highly trained researchers who are proficient in biomedical Big Data science and have advanced skills at collaborating effectively across traditional disciplinary boundaries. To meet these challenges, UC Berkeley proposes an innovative training program in Biomedical Big Data for advanced Ph.D. students. This training grant will support 6 trainees. We anticipate further extending the reach of our program by admitting up to 2 additional students on alternative support, thus benefitting 8 students per year. The 25 participating faculty have extensive experience with biomedical applications and expertise in biostatistics, causal inference, machine learning, the development of Big Data tools, and scalable computing. Together, they span 8 departments/programs: Biostatistics; Computational Biology; Computer Science; Epidemiology; Integrative Biology; Molecular & Cell Biology; Neuroscience; and Statistics. We will recruit participants from Ph.D. students in their second or third year of study in any/all of these departments. Those accepted into the program will participate in an intensive year of training courses, seminars, and workshops, beginning with introductory seminars in late summer and ending with a capstone project by each participant in the spring. Each trainee will be assigned a secondary thesis advisor with biomedical Big Data science expertise complementing that of the primary thesis advisor. Specialized training will focus on three pillars: (1) translation of biomedical and experimental knowledge and scientific questions of interest into formal, realistic problems of causal and statistical estimation; (2) scalable Big Data computing; and (3) targeted machine learning with causal and statistical inference. Activities will include courses in machine learning targeted learning, statistical programming, and Big Data computing, as well as workshops led by the Berkeley Data Science Institute, Statistical Computing Facility, and Berkeley Research Computing. The capstone course will involve a collaborative project in biomedical science involving the integrated and combined application of skills acquired by the trainees in the three foundational areas. Trainees will also benefit from group seminars, retreats, and interdisciplinary meetings that build a core identity with the cadre and the program. This proposal dovetails with several data science and precision medicine initiatives at UC Berkeley and comes at an ideal time to influence how data science is taught to all graduate students, focusing on biomedical research across campus. Public Health Relevance: Big Data is revolutionizing research in human health and medicine from the design of observational and experimental studies to its analysis. Our Biomedical Big Data Training Program will train the next generation of data scientists in biomedicine with a rigorous education in translation of real-world problems into a realistic causal and statistical estimation problem, computer science, targeted machine learning, and statistical inference. Big Data; Training Programs

108


3 T15 LM007079-24S1

Training in Biomedical Informatics at Columbia University Hripcsak, George M hripcsak@columbia.edu Columbia University Health Sciences www.dbmi.columbia.edu The Columbia Biomedical and Health Data Science Track is part of a long-standing, NLM-funded T32 training program at the Columbia Biomedical Informatics Department. The vision for the Track is to shape and advance the evolving discipline at the intersection of data science and biomedicine & health. As such, its primary goals are: (1) Develop and maintain a flexible, in-depth curriculum to train pre-doctoral students in biomedical and health data science. With the explosion of biomedical knowledge and healthrelated data coming from the literature, the Internet, and the electronic health record, there is a vital need and an exciting opportunity to develop new methods that incorporate massive amounts of data and existing biomedical knowledge to derive new insights, and then to use these insights to improve human health. (2) Instill principles of rigor and reproducibility in the context of data-driven methodologies. (3) Instill in students the methodological principles of “doing” biomedical and health data science as part of the biomedical and health ecosystems through “flipped” classrooms and project-based courses. And (4) Create and foster an interdisciplinary community of biomedical and health data scientists with meaningful ties within the Track, as well as collaboration ties within and outside the university. The Track was established in Fall 2015. The BD2K T32 supplement supports the research of four doctoral students within the Track. Besides its training activities, the doctoral students interact with researchers in biomedical informatics, computer science, statistics, and biostatistics, as well as a network of clinical researchers. A unique aspect of the Columbia training program is its close partnership with academic entities, such as the Columbia University Data Science Institute, for students to gain in-depth skills in data-driven work, and entities with real-world impact, such as NewYork-Presbyterian Hospital and the Observational Health Data Sciences and Informatics (OHDSI) international collaborative. Keywords: Artificial intelligence, Data mining, Databases/data warehousing, Information science, Bayesian Methods, Machine/statistical learning, Medical informatics, Nonlinear Dynamics, Predictive analytics, Visualization, Social and Behavioral Sciences, Systems Biology, Clinical Research, Computational Biology

109


T15 Grants

110


BD2K Training Effort Grant Information

BD2K Centers (U54) Grants

111


U54 Grants

1 U54 HG008540-01

Center for Causal Modeling and Discovery of Biomedical Knowledge from Big Data Cooper, Gregory F; Bahar, Ivet; Berg, Jeremy gfc@pitt.edu University of Pittsburgh at Pittsburgh http://www.ccd.pitt.edu/ Abstract: Much of science consists of discovering and modeling causal relationships that occur in nature. Increasingly big data are being used to drive such discoveries. There is a pressing need for methods that can efficiently infer causal networks from large and diverse types of biomedical data and background knowledge. This center of excellence will develop, implement, and evaluate an integrated set of tools that support causal modeling and discovery (CMD) of biomedical knowledge from very large and complex biomedical data. We also plan to actively share our knowledge, methods, and tools with others, through an innovative set of training and consortium activities. In the past 25 years, there has been tremendous progress in developing general computational methods for representing and discovering causal knowledge from data, based on a representation called causal Bayesian networks (CBNs). These methods have been applied successfully in a wide range of fields, including medicine and biology. While much progress has been made in the development of these computational methods, they are not readily available, sufficiently efficient, nor easy to use by biomedical scientists, and they have not been reconfigured to exploit the increasingly Big Data available for analysis. This Center will make these methods widely available, highly efficient when applied to big datasets, and easy to use. The proposed Center will provide a powerful set of concepts and tools that accelerate the discovery and sharing of causal knowledge derived from very large and complex biomedical datasets. The approaches and products emanating from this center of excellence are likely to have a significant positive impact on our understanding of health and disease, and thereby on the improvement of human health. Public Health Relevance: This center of excellence will develop, implement, and evaluate an integrated set of tools that support causal modeling and discovery (CMD) of biomedical knowledge from very large and complex biomedical data. The approaches and products emanating from this center of excellence are likely to have a significant positive impact on our understanding of health and disease, and thereby on the improvement of human health. Academia; Address; Advanced Development; Algorithmic Software; Algorithms; Automobile Driving; Big Data; Biology; Biomedical Research; biomedical scientist; career; Code; Collaborations; Complex; computer based statistical methods; Computer software; Computing Methodologies; Data; Data Set; Databases; design; Development; Disease; Education and Outreach; Educational process of instructing; Feedback; Goals; Government; graduate student; Health; Human; improved; Industry; innovation; Knowledge; Lead; Location; Logic; Medicine; meetings; method development; Methods; Mission; Modeling; Monitor; Nature; Policies; programs; public health relevance; Qualitative Evaluations; Quantitative Evaluations; Research; Research Infrastructure; Research Personnel; Science; Scientist; Site; Software Tools; Staging; symposium; TimeLine; tool; Training; Work

112


1 U54 AI117924-01

The Center for Predictive Computational Phenotyping Craven, Mark craven@biostat.wisc.edu University of Wisconsin - Madison http://cpcp.wisc.edu/ The Center for Predictive Computational Phenotyping is developing innovative computational and statistical methods and software for a broad range of problems that can be cast as computational phenotyping. The term phenotype, which is derived from the Greek word phainein meaning “to show� refers to the observable properties of an organism that result from the interaction of its genotype and its environment. Some phenotypes are easily measured and interpreted, and are available in an accessible format. However, a wide range of scientifically and clinically important phenotypes do not satisfy these criteria. In such cases, computational phenotyping methods are required either to extract a relevant phenotype from a complex data source or collection of heterogeneous data sources, and to predict clinically important phenotypes before they are exhibited. The Center is investigating how to exploit a wide array of data types for these tasks, including molecular profiles, medical images, electronic health records, and population-level data. The Center is also providing training in biomedical Big Data analysis to scientists and clinicians, and it is investigating the bioethical issues surrounding the technology being developed. Publications: http://onlinelibrary.wiley.com/doi/10.1111/rssb.12131/abstract, http://dl.acm.org/citation. cfm?doid=2939672.2939715, http://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Hwang_ Coupled_Harmonic_Bases_CVPR_2016_paper.html Artificial intelligence, Databases/data warehousing, Image Analysis, Bayesian Methods, Machine/statistical learning, Mathematical statistics, Medical informatics, Statistical Analysis, Visualization, Data analysis, Genetics and Genomics, Neuroscience, Cancer, Clinical Research

113


U54 Grants

114


1 U54 EB020405-01

Mobility Data Integration to Insight Delp, Scott delp@stanford.edu Stanford University http://mobilize.stanford.edu The mission of the Mobilize Center is to overcome the data science challenges facing mobility big data and biomedical big data in general to improve human movement across the wide range of conditions that limit mobility, including cerebral palsy, osteoarthritis, obesity, limb loss, running injuries, and stroke. Our training program leverages the world-class data science training resources that are already available at Stanford University and seeks to develop a group of students and researchers with the skills to develop and apply data science tools to extract meaningful clinical and biological insights from the growing number of biomedical datasets being generated today. To achieve our training mission, we offer a range of programs and resources targeted for different audiences. Below we highlight some of these. In-depth, in-person training opportunities: In addition to supporting graduate students, the Mobilize Center offers a Distinguished Postdoctoral Fellowship program and will launch a visiting scholars program later this year. These programs aim to transform individuals into interdisciplinary researchers, providing mentorship by both clinical/biomechanical and data science advisors and fostering cross-disciplinary discussions and mindsets through research meetings and reading groups. Our current cohort of trainees consists of ten individuals, split evenly between those with a data science background and those with a biological/ biomechanical background. This diverse mix of backgrounds creates a rich environment for interdisciplinary learning. The Mobilize Center has also provided internships to three computer science students from San Francisco State University, an underrepresented minority serving institution. Massive open online courses (MOOCs): Via MOOCs, we have been able to reach tens of thousands of students with a single course. During this past year, we offered MOOCs on Mining Massive Data (https://www. coursera.org/course/mmds, co-taught by Mobilize faculty member Jure Leskovec) and Statistical Learning (https://www.coursetalk.com/providers/stanford-online/courses/statistical-learning, co-taught by Mobilize faculty member Trevor Hastie). Together, these two courses enrolled more than 50,000 individuals. Conferences and seminars: We host conferences, workshops, symposia, and a weekly seminar to build knowledge and encourage cross-fertilization. The symposium Research at the Intersection of Data Science and Biomechanics, which we organized at the International Society of Electrophysiology and Kinesiology 2016 Annual Meeting, introduced human movement researchers to the power of data analytics. In December 2016, we will run a workshop on Machine Learning for Health at the Annual Conference on Neural Information Processing Systems to reach out to the machine learning community. We also co-sponsored an extremely successful first Women in Data Science Conference (http://www.widsconference.org/), with nearly 400 attendees and thousands viewing the live-stream. Recordings from this and other events are available on-

115


U54 Grants

line. A full list of conferences, workshops, and seminars are available at http://mobilize. stanford.edu/events/. Biomedical Computation Review (BCR) magazine: We publish BCR (http://bcr.org), which highlights research at the intersection of biology, medicine, and computation for a scientifically literate but non-specialized audience. Each issue also includes a tutorial on topics such as low rank models and privacyprotecting algorithms. The magazine currently has 3700+ subscribers to the print edition, and an average of 18,000 unique individuals visit the magazine’s website each month. Data mining, Graph Theory, Machine , Biomedical Engineering and Biophysics, Social and Behavioral Sciences statistical learning, Predictive analytics, Scientific Workflow Environments, Spatio-temporal Modeling, Statistical Analysis, Computer science, Data analysis

The inaugural Women in Data Science conference featured a career panel (above) and talks at the intersection of data science and biomedicine from speakers such as Jennifer Tour Chayes of Microsoft Research, Bin Yu of U.C. Berkeley, and Diane Bryant of Intel. Recordings from the event are available at https://www.youtube.com/ watch?v=xxvPS1XvxcE&list=PL3FW7Lu3i5Jsss0NcQLUrHfeOScwhbrgT

116


1 U54 HL127366-01

Broad Institute Lincs Center for Transriptomics Golub, Todd R; Subramanian, Aravind golub@broadinstitute.org Broad Institute, Inc. https://www.broadinstitute.org/ Abstract: The overall goal of this LINCS Center is to make it to generate large-scale pharmacological and genetic perturbations of a large panel of cell types, develop analytical tools that enable researchers to make biological discoveries from the data, and to make those data and tools broadly available to the research community. We will accomplish this using the L1000 transcriptional profiling method developed by our Center, and by creating a cloud-based computing infrastructure that supports data APIs and web apps designed make it possible for biologists lacking computational expertise to make LINCS-based discoveries. Public Health Relevance: The proposed project is expected to have significant impact on a broad range of the biomedical research community. It has the potential to yield new approaches to genome functional annotation, to provide a path toward the elucidation of mechanism-of-action of small-molecule compounds, and to facilitate the discovery of drugs with unanticipated therapeutic effects on disease biology. By including a significant outreach and training component it will make researchers from across many biomedical institutions conversant with the applying the resources generated to these important problems in health science. analytical tool; base; Biological; Biology; Biomedical Research; cell type; cloud based; Clustered Regularly Interspaced Short Palindromic Repeats; Communities; cost; Data; design; Development; Disease; Education and Outreach; Gene Expression; Genes; Genetic; Genome; Goals; Health Sciences; Institutes; Institution; Internet; knock-down; Knock-out; Link; Maps; Methods; mRNA Expression; novel strategies; Open Reading Frames; overexpression; Pathway interactions; Pharmaceutical Preparations; Phase; Physiological; Physiology; programs; protein function; public health relevance; Reading; repository; Research; Research Infrastructure; Research Personnel; Resources; scale up; Scientist; small hairpin RNA; small molecule; Source; Therapeutic Effect; tool; transcriptomics; United States National Institutes of Health; user-friendly; Work

117


U54 Grants 1 U54 HG007990-01

Center for Big Data in Translational Genomics Haussler, David H haussler@soe.ucsc.edu University of California Santa Cruz https://genomics-old.soe.ucsc.edu/bd2k Abstract: The Center for Big Data in Translational Genomics is a multi-institution partnership coordinated by the University of California at Santa Cruz to create scalable infrastructure for the broad application of genomics in biomedicine. Our U.S. partners include UC San Francisco, UC Berkeley, Oregon Health Science University, Caltech, and several major big data companies. International partners include the European Bioinformatics Institute, the Sanger Centre, the Ontario Institute for Cancer Research and a computer systems provider. The Center will make software solutions interoperable through the development of standard Application Programming Interfaces (APIs) and tools at multiple levels, from raw sequence data to genetic variation and functional data, through to systems, pathways and phenotypes. The overriding goal is to create implementations capable of handling genomics datasets that are orders of magnitude larger than those that can now be handled. The APIs and all academic reference implementations will be open source, while several major corporate partners not funded by the project will provide proprietary implementations, creating a competitive ecosystem of interoperable big data genomics software. All-comers extensive benchmarking will be performed on all implementations within and external to our center to identify bestof-breed and results made broadly available. Design will be in part driven by the needs of a diverse set of separately funded specific biomedical projects that will serve as pilots. These include the Pan-Cancer whole genome analysis project of the International Cancer Genomics Consortium to analyze 2,000 cancer genomes, the UK10K project to analyze 10,000 personal genomes, the UCSF-led I-SPY2 adaptive breast cancer trial, and the omics-guided leukemia therapy project BeatAML at Oregon Health Sciences University. Public Health Relevance: At least half of all diseases have a substantial genomic component, often including contributions from the millions of individually rare but collectively common genetic variations. Only by studying the genomes and transcriptomes of very large numbers of individuals will scientists have the statistical power to discover and understand this vital aspect of the genomic contribution to disease. For this it is essential that genomics is brought into the big data era, so that analyses of huge datasets is possible and precision diagnosis and treatment based on genomic information is widely deployed. analytical tool; base; Biological; Biology; Biomedical Research; cell type; cloud based; Clustered Regularly Ianticancer research; base; Benchmarking; Big Data; Bioinformatics; Breeding; California; cancer genome; cancer genomics; clinical practice; Communities; Complex; Computer software; Computer Systems; Data; data modeling; Data Set; design; Development; Diagnosis; Disease; Ecosystem; European; Funding; Gene Expression Profile; Genome; genome analysis; Genomics; Genotype; Goals; Health; Health Sciences; Human; Individual; Institutes; Institution; International; leukemia; malignant breast neoplasm; Malignant Neoplasms; Ontario; open source; Oregon; Pathway interactions; Phenotype; programs; Provider; public health relevance; Research; Research Infrastructure; Research Project Grants; San Francisco; Scientist; Solutions; System; tool; Translational Research; Universities; Variation (Genetics)

118


1 U54 EB020404-01

Center of Excellence for Mobile Sensor Data-to-Knowledge (MD2K) Kumar, Santosh skumar4@memphis.edu University of Memphis https://md2k.org; https://mhealth.md2k.org The Center of Excellence for Mobile Sensor Data-to-Knowledge (MD2K) brings together thought leaders in Computer Science, Engineering, Medicine, Behavioral Science, and Statistics from 12 universities and 1 non-profit. MD2K focuses on developing innovative tools to enable health researchers to readily gather, analyze and interpret health data captured by mobile and wearable sensors. MD2K’s main goal is to develop big data solutions to reliably quantify, in real time, changes in an individual’s health state, and identifying key physical, biological, behavioral, social, and environmental factors that contribute to health and disease risk. Such monitoring will help optimize care delivery via just-in-time mobile health (mHealth) interventions. In its efforts to improve health through prevention and early detection of adverse health events, MD2K directly targets two complex and costly health conditions - reducing hospital readmission in congestive heart failure (CHF) patients and preventing relapse in abstinent smokers. Approaches developed for these two conditions will also be applicable to other diseases, such as asthma, substance abuse and obesity. The MD2K tools, software, and training materials are broadly distributed through its website and supplemented by workshops and webinars. The team’s 27 faculty members, geographically dispersed from coast to coast, have fostered a collaborative Center culture based on multidisciplinary team science. Nine staff members at the centralized MD2K Center in Memphis, 2 post-doctoral candidates, 17 funded graduate students, 7 NIH program officers, and numerous other contributors are involved with the progress of MD2K. In its first two years, MD2K has developed new biomarkers of smoking lapse, eating, pupil dilation and stress from smartwatch sensor data and smart eyeglasses, as well as electrocardiogram and respiration sensors. These will be further developed in field tests this year, and new biomarkers will be developed through the EasySense sensor (lung fluid and heart motion), smart eyeglasses (fatigue, context and cue exposure), smartwatch (drinking, brushing, flossing), and geoexposure from GPS data (tobacco stores, fast food restaurant).

119


U54 Grants

Other areas of research with documented progress include time series analytics of biomarkers to identify and explain between- and within-person variability in stress patterns. MD2K has also designed, from the ground up, an end-to-end and fully open source software platform for collecting, storing, analyzing, and interpreting high-frequency mobile sensor data for health. The MD2K platform can collect and analyze data from numerous wearable sensors via a wide array of wireless radios. It also provides native support for triggering notifications, self-report prompts, and interventions based on real-time values of digital biomarkers derived from sensor data. The mobile phone component of the MD2K software platform is called mCerebrum and its cloud counterpart is called Cerebral Cortex. Studies are now underway at The Ohio State University Medical Center (congestive heart failure) and Northwestern (smoking cessation). Eventually, datasets from all MD2K studies will be available to mHealth researchers for use in their own studies. As part of its focus on training, MD2K sponsors webinars featuring mHealth researchers that are curated at MD2K’s mHealthHUB along with lectures from MD2K’s co-sponsored mHealth Training Institute. Publications: 10.1145/2858036.2858218, 10.1145/2750858.2807526, 10.1145/2750858.2806897 Data engineering, Data mining, Machine/statistical learning, Medical informatics, Pattern recognition and learning, Predictive analytics, Statistical Analysis, Cloud Computing, Computer science, Data analysis, Health Disparities, Social and Behavioral Sciences, Health Disparities, Social and Behavioral Sciences

120


1 U54 HL127624-03

BD2K-LINCS Data Coordination and Integration Center Ma’ayan, Avi avi.maayan@mssm.edu Icahn School of Medicine at Mount Sinai http://lincs-dcic.org/#/training The Data Coordination and Integration Center (DCIC) for the Library of Integrated Network-based Cellular Signatures (LINCS) program is one of the U54 Big Data to Knowledge (BD2K) Centers of Excellence. The Center has four major components: Integrated Knowledge Environment (IKE), Data Science Research (DSR), Consortium Coordination and Administration (CCA), and Community Training and Outreach (CTO). The CTO component of our center engages, informs and educates key biomedical research communities about LINCS data and resources. The CTO efforts have established several educational programs including a LINCS massive online open course (MOOC), a summer undergraduate research program, the initiation and support of diverse collaborative projects leveraging LINCS resources, and the systematic dissemination of LINCS data and tools via a variety of mechanisms. We use innovative crowdsourcing and outreach mechanisms to harness expertise of the wider Data Science and software development communities for benefiting the LINCS community. Recent activities of our CTO component include: 1. A LINCS-related MOOC: Big Data Science with the BD2K-LINCS Data Coordination and Integration Center https://www.coursera.org/learn/bd2k-lincs designed to train biomedical researchers with LINCS-related experimental methods, datasets, and computational tools. 2. Our second cohort of students for our BD2K-LINCS DCIC Summer Research Training Program in Biomedical Big Data Science. This is a research-intensive ten-week training program for undergraduate and graduate students. 3. Establishment of the webinar series LINCS Data Science Research Webinars which provides a forum for data scientists within and outside of the LINCS program to present their work on problems related to LINCS data analysis and integration. These webinars are posted on our YouTube channel at: https://www.youtube.com/playlist?list=PL0Bwuj8819U-G9Ob0jIGHp5AtwpCghLV5 4. The development of the BD2K-LINCS DCIC Crowdsourcing Portal to bring awareness to LINCS data, extract signatures from external public repositories, and explain the efforts of LINCS to the general public. Our crowdsourcing portal engages the research community in various micro- and mega-tasks. 5. Engagement in collaborative external data science research projects which focus on mining and integrating data generated by the LINCS program for new scientific discovery. 6. Hosting of symposia, seminars and workshops to bring together the DCIC and researchers who utilize LINCS resources. 7. The development and maintenance of the lincs-dcic.org and lincsproject.org websites as well as an active presence on various social media platforms including YouTube, Google+, and Twitter. In achieving our milestones, we established comprehensive education, outreach, and training programs aimed at scientific communities that can benefit from LINCS data and tools. We expect that the LINCS community will continue to grow into a resourceful network that brings together researchers across

121


U54 Grants

disciplines and organizations. The innovative data management solutions that we bring to LINCS can be leveraged for many other resources and become widespread. Publications: 10.1038/npjsba.2016.15, 10.1093/ bioinformatics/btw168, 10.1093/nar/gkw377 Data mining, Graph Theory, Bayesian Methods, Machine/statistical learning, Multivariate Methods, Probability, Statistical Analysis, Visualization, Computer science, Data analysis, Biomedical Engineering and Biophysics, Genetics and Genomics, Systems Biology, Cell Biology, Computational Biology,

122


1 U54 AI117925-01

Center for Expanded Data Annotation and Retrieval Musen, Mark A musen@stanford.edu Stanford University http://metadatacenter.org The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical datasets to make data more findable, accessible, interoperable, and reusable. We take advantage of emerging community-based standard templates for describing different kinds of biomedical datasets, and we investigate the use of computational techniques to help investigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identify metadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will capture annotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, including secondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human Immunology Project Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to the problems of metadata authoring and management that will generalize to other data-management environments. Publications: 10.1093/jamia/ocv048, 10.1093/database/baw080, 10.1038/sdata.2016.18 Artificial intelligence, Data engineering, Data mining, Databases / Data warehousing, Ontology Design

123


U54 Grants

124


1 U54 HG007963-01

Patient-Centered Information Commons: Standardized Unification of Research Elements (PIC-SURE) Kohane, Isaac S isaac_kohane@harvard.edu Harvard Medical School http://www.pic-sure.org/ Abstract: We propose to create a massively scalable toolkit to enable large, multi-center Patientcentered Information Commons (PIC) at local, regional and, national scale, where the focus is the alignment of all available biomedical data per individual. Such a Commons is a prerequisite for conducting the large-N, Big Data, longitudinal studies essential for understanding causation in the Precision Medicine framework while simultaneously addressing key complexities of Patient Centric Outcome Research studies required under ACA (Affordable Care Act). This agenda entails the four following aims: Aim 1: Create an individual patient data identification and retrieval toolkit that is robust across distributed data of wide variety and geographically scattered. Robustness with regard to a variety of organizational structures and national scalability is emphasized. Aim 2: Generate a complete diagnostic and prognostic ‘data’ picture of a patient across multiple sources of data, some of which are noisy and sparse. Aim 3. Enable robust decentralized computation on large-scale data with the Patient-centered Information Commons Big Data Science Platform (PIC-DSP), particularly in configurations where data are generated in locations other than where computational resources are most available. Aim 4: Create three patient-centered information commons instances (PICIs) to test all aspects of the toolkit developed. We have selected neurodevelopmental disorders as our first PICI, as it fulfills several criteria (wide variety of data types and scales, collaborator engagement, multiple healthcare institutions, and opportunity to rigorously test and refine features of the tool). Public Health Relevance: The proposed Patient-centered Information Commons will allow investigators to link and analyze patient level data on a large scale in population size but also data variety: from clinical health record, to prospectively gathered research data, survey and administrative data, genomic, imaging, socio-behavioral, and environmental data. This will allow these investigators to achieve new levels of precision in diagnosis and prognosis as well as measuring the conduct and quality of medical practice. Address; Authorization documentation; base; Behavioral; Benchmarking; Big Data; Caring; Clinical; Clinical Data; Collection; Comorbidity; Complex; computing resources; Consent; Data; Data Analyses; Data Set; Data Sources; Databases; Diagnosis; Diagnostic; Disease; Disease Progression; distributed data; Environment; Environmental Risk Factor; Epidemiology; Equipment and supply inventories; Etiology; flexibility; Genomics; health record; Healthcare; high risk; Image; improved; Individual; Institution; Knowledge; Link; Location; Longitudinal Studies; Manuals; Measures; Medial; Medical; Medicine; Metadata; Methods; Metric; Modeling; Neurodevelopmental Disorder; novel; Ontology; organizational structure; outcome forecast; Outcomes Research; patient oriented; Patients; Pattern; Phenotype; Population Sizes; predictive modeling; prognostic; Registries; Research; Research Infrastructure; Research Personnel; research study; Retrieval; Risk; Risk Assessment; Science; Site; social; Source; statistics; Stream; Structure; Subgroup; Surveys; System; Techniques; Testing; tool; Translating; Uncertainty; web interface

125


U54 Grants 1 U54 GM114833-01Â Â

A Community Effort to Translate Protein Data to Knowledge: An Integrated Platform Ping, Peipei pping@mednet.ucla.edu University of California Los Angeles http://www.heartbd2k.org/ Abstract: A critical challenge in Big Data science is the overall lack of data ahalysis platforms available for transforming Big Data into biological knowledge. To address this challenge, we propose a set of interconnected computational tools capable of organizing and analyzing heterogeneous data to support combined inquiries and to de-convolute complex relationships embedded within large-scale data. We demonstrate its utility with a cardiovascular-centric platform that is easily generalizable to similar efforts in other disciplines. Our Center has designed a federated data architecture of existing resources substantiated by a solid and growing user base, and innovations to elevate functionality. Novel crowdsourcing and text-mining methods will extract the wealth of untapped knowledge embedded in biomedical literature, and novel in-depth proteomics analytical tools will unprecedentedly elucidate dynamic protein features. A key strength of our platform will be the rigorous validation using clinical data from Jackson Heart Study and the Healthy Elderly Active Longevity (HEAL; Wellderly) cohorts. Our proposal includes nine scientific aims that address three main focus areas: (i) we will build a new model platform that amalgamates community-supported Big Data resources, enabling data annotations and collaborative analyses; (ii) we will integrate molecular data with drug and disease information, both structured and unstructured, for knowledge aggregation, and (iii) we will create onthe-cloud analytical and modeling tools to power in-depth protein discoveries. Specifically, we will create a novel distributed query system and cloud-based infrastructure that is capable of providing unified access to multi-omics datasets; we will develop computational and crowdsourcing methods to systematically define relationships between genes, proteins, diseases, and drugs from the literature, emphasizing cardiovascular medicine; we will rally community participation and promote awareness of collaborative research through outreach and educational games; we will create a platform to analyze and visualize multi-scale pathway models of genes, proteins, and metabolites; we will develop tools and algorithms to mechanistically model spatiotemporal protein networks in organelles and to. predict higher physiological phenotypes; and we will correlate individual phenotypes, health histories, and multi-scale molecular profiles to examine cardiovascular disease mechanisms. These tools will be implemented, delivered, and executed on the cloud infrastructure to minimize the computational power required of users. Public Health Relevance: The challenge of biomedical Big Data are multifaceted. Everyday, biomedical researchers face the daunting task of storing, analyzing, and distributing large-scale genomics and proteomics data, and aggregating all information to discern deeper meanings. Only through a coherent effort can we harness copious amounts of unruly genomics and proteomics data for transformation into testable hypotheses that can dovetail with all of scientific research. This Data Science Research Component is designed to address these challenges. Address; Algorithms; analytical tool; Architecture; Area; Awareness; base; Big Data; Biological; Cardiovascular Diseases; Cardiovascular system; Clinical Data; cloud based; Cloud Computing; cohort; Communities; Community Participation; Complex; computerized tools; Data; Data Aggregation; Data Set; design; Discipline; Disease; Elderly; Face; federated computing; Gene Proteins; Genomics; Health; Individual; innovation; Jackson Heart Study; Knowledge; Literature; Longevity; Medicine; Methods; Modeling; Molecular; Molecular Profiling; novel; Organelles; outreach; Pathway interactions; Pharmaceutical Preparations; Phenotype; Physiological; Protein Dynamics; protein metabolite; Proteins; Proteomics; Recording of previous events; Research; Research Infrastructure; Research Personnel; Resources; Science; Solid; spatiotemporal; Structure; System; text searching; tool; Translating; Validation

126


1 U54 GM114838-01

KnowEng, a Scalable Knowledge Engine for LargeScale Genomic Data Song, Jun songj@illinois.edu University of Illinois, Urbana - Champaign http://knoweng.cse.illinois.edu/training/ One of the most effective ways of understanding hidden structure in Big Data is to cluster the samples and visualize the underlying structure in real time. Motivated by this idea, the KnowEnG center has developed Clustering Engine (ClusterEnG), an interactive educational and research resource that provides tutorials on the basic ideas behind, and potential pitfalls of, commonly used clustering algorithms. In addition to explaining the potential pitfalls of various clustering algorithms, the tutorial also provides two dynamic clustering modules to teach the user how clustering algorithms behave as the user adds new data points by hand. It also allows the user to upload and cluster custom data sets using six different clustering methods. Parallel versions of k-means and spectral clustering are currently incorporated. The user can view and compare the results of clustering in the first three principal components, and can interact with the data in three principal components by using embedded Java script. ClusterEnG is also able to retrieve and cluster data sets from Gene Expression Omnibus. The output is implemented in D3.js and allows the user to interact dynamically with the data, e.g., retrieving gene names and expression values with a mouse-over. To make the resource friendly to people with colorblindness, ClusterEnG has a customizable color scheme. Clustering results can also be exported with a simple mouse click. http://knowdev.cse.illinois.edu/education/algorithms.php The second educational tool developed by the KnowEnG Center is Sequencing Engine (SequencEnG). High-throughput sequencing is currently one of the most influential methods for producing biological Big Data. Graduate students spend much of their time searching through the literature and web resources to learn how to perform specific experiments, compare different approaches, find optimized protocols, troubleshoot failures, and analyze the data. Training students this way, particularly the time-consuming process of collecting relevant information, significantly slows the process of training students in high-throughput technologies. We addressed these

127


U54 Grants

challenges by developing a cyber platform called SequencEnG , a method-indexing machine and an automated recommendation system for high-throughput sequencing methods (both experimental and computational components). This web resource will: (1) Compile and index a comprehensive list of state-of-the-art sequencing methods; (2) Collate and mine associated scientific papers that describe the techniques; (3) Make recommendations of established pipelines based on text-mining results and user-chosen criteria; (4) Train students in using analysis tools. We are developing interactive mini-modules for teaching students how to use common software packages in a Linux environment. Students can enter simple commands and view the outputs on our website. We have implemented a training module for running MACS, a popular software package for analyzing ChIP-seq data. http://knowdev.cse.illinois.edu/education/sequenceng/ Publications: 10.1093/bioinformatics/btw151, 10.1038/tpj.2015.74, 10.1093/jamia/ocv090 Keywords: Data mining, Machine/statistical learning, Scientific Workflow Environments, Cloud Computing, Data analysis, Genetics and Genomics, Cancer, Computational Biology

128


1U54 EB020406-01

Big Data For Discovery Science Toga, Arthur W toga@ini.usc.edu http://bd2k.ini.usc.edu/ The University of Southern California Researchers at the Big Data for Discovery Science Center will focus on proteomics, genomics, and images of cells and brain collected from patients and subjects across the globe. They will enable detection of patterns, trends and relationships among these data with user-focused data management, sophisticated computational methodologies, and leading-edge software tools for the efficient large-scale analysis of biomedical data. Interactive visualization tools created at this center will stimulate fresh insights and encourage the development of modern treatments and new cures for disease. Modern biomedical data acquisition, from genes to cells to systems, is producing exponentially more data due to increases in the speed and resolution of data acquisition methods. Yet, “big data” is a moving target. What is considered big data today will be relatively “small data” tomorrow. Moreover, singularly large data sets arise from the efforts of single laboratories or are accumulated from a collection of more modest studies across common or heterogeneous study protocols. Simply having large-scale biomedical data and making it available online, however, is not a means to an end but only the next step in turning data into actionable knowledge. Our Big Data for Discovery Science (BDDS) Center, has the following aims: 1) create a userfocused graphical system to dynamically create, modify, manage and manipulate multiple collections of big datasets, 2) enrich next generation “Big Data” workflow technologies coupled to modern computation and communication strategies specifically designed for large-scale biomedical datasets, 3) develop a knowledge discovery interface to enable modeling, visualizing, and the interactive exploration of Big Data. In addition to these overarching aims, the goals of this BDDS Center include training and consortium activities. Here we will create university-level degree programs in big data informatics, develop annual workshops on strategies for big data best practices, and contribute to national BD2K consortium efforts. The innovations of our BDDS Center include: 1) providing a novel data science framework for characterizing and big data as a shared resource either singularly or collectively, 2) deriving novel computer algorithms for the joint processing o multi-modal data with an emphasis on the challenges that big data present for computation, 3) designing and deploying a unique data management system focused on the user experience which is ontology agnostic, easy to use, and puts the data first, 4) providing enhanced technologies for remote data access, scientific workflow construction, and cloud-based computation on big data sets, 5) providing compelling means for big data set visualization, interaction, and hypothesis generation. Building on these technologies, we will construct and validate tools so that they may be translated to any biological system or biomedical research domain. Our team is comprised of leading neuroscience, biology, and computer science researchers, with expertise in large-scale biomedical data, experience with the present challenges and promise of big data, and a demonstrable history of delivering unique computational resources, thereby insuring big data solutions which promote a “science of discovery”. PUBLIC HEALTH RELEVANCE: The overarching goal of our BDDS Center is to ease the management and organization of biomedical big data and accelerate data-driven discovery by eliminating or reducing three distinct barriers to effective discovery science: complexity with respect to physical distribution and heterogeneity, scalability of analysis, and ease of access and interaction with big-data and associated analytic methods. These issues are fundamental to discovery science and transcend the specifics of the research question as we span levels of scale from cells to organs to systems, and integrate data from imaging, genetics, “omics,” and phenotypes. Publications: McEligot, A. J., Behseta, S., Cuajungco, M. P., Van Horn, J. D., & Toga, A. W. (2015). Wrangling Big Data Through Diversity, Research Education and Partnerships. Californian Journal of Health Promotion, 13(3)

129

Keywords: Data mining, Differential Equation Modeling, Image Analysis, Bayesian Methods, Mathematical statistics, Population Genetics, Predictive analytics, Statistical Analysis, Visualization, Data analysis, Health Disparities, Neuroscience, Epidemiology


U54 Grants 1 U54 EB020403-01

ENIGMA Center for Worldwide Medicine, Imaging, and Genomics Thompson, Paul M thompson@ini.usc.edu http://enigma.ini.usc.edu/ The University of Southern California The ENIGMA Center for Worldwide Medicine, Imaging and Genomics will incorporate the scientific acumen of more than 300 scientists worldwide, and their biomedical datasets, in a global effort to combat human brain diseases. This center will develop computational methods for integration, clustering, and learning from complex biodata types. This center’s projects will help identify factors that either resist or promote brain disease, and those that help diagnosis and prognosis, and will also help identify new mechanisms and drug targets for mental health care. The ENIGMA Center for Worldwide Medicine, Imaging and Genomics is an unprecedented global effort bringing together 287 scientists and all their vast biomedical datasets, to work on 9 major human brain diseases: schizophrenia, bipolar disorder, major depression, ADHD, OCD, autism, 22q deletion syndrome, HIV/AIDS and addictions. ENIGMA integrates images, genomes, connectomes and biomarkers on an unprecedented scale, with new kinds of computation for integration, clustering, and learning from complex biodata types. ENIGMA, founded in 2009, performed the largest brain imaging studies in history (N>26,000 subjects; Stein +207 authors, Nature Genetics, 2012) screening genomes and images at 125 institutions in 20 countries. Responding to the BD2K RFA, ENIGMA’S Working Groups target key programmatic goals of BD2K funders across the NIH, including NIMH, NIBIB, NICHD, NIA, NINDS, NIDA, NIAAA, NHGRI and FIC. ENIGMA creates novel computational algorithms and a new model for Consortium Science to revolutionize the way Big Data is handled, shared and optimized. We unleash the power of sparse machine learning, and high dimensional combinatorics, to cluster and inter-relate genomes, connectomes, and multimodal brain images to discover diagnostic and prognostic markers. The sheer computational power and unprecedented collaboration advances distributed computation on Big Data leveraging US and non-US infrastructure, talents and data. Our projects will better identify factors that resist and promote brain disease, that help diagnosis and prognosis, and identify new mechanisms and drug targets. Our Data Science Research Cores create new algorithms to handle Big Data from (1) Imaging Genomics, (2) Connectomics, and (3) Machine Learning & Clinical Prediction. Led by world leaders in the field who developed major software packages (e.g., Jieping Ye/SLEP), we prioritize trillions of computations for gene-image clustering, distributed multi-task machine learning, and new approaches to screen brain connections based on the Partition Problem in mathematics. Our ENIGMA Training Program offers a world class Summer School coordinated with other BD2K Centers, worldwide scientific exchanges. Challenge-based Workshops and hackathons to stimulate innovation, and Web Portals to disseminate tools and engage scientists in Big Data science. PUBLIC HEALTH RELEVANCE: The ENIGMA Center for Worldwide Medicine, Imaging and Genomics is an unprecedented global effort uniting 287 scientists from 125 institutions and all their vast biomedical data, to work on 9 major human brain diseases: schizophrenia, bipolar disorder, major depression, ADHD, OCD, autism, 22q deletion syndrome, HIV/AIDS and addictions. ENIGMA integrates images from multiple modalities, genomes, connectomes and biomarkers on an unimaginable scale, with new computations to integrate, cluster, and learn from complex biodata types.Publications: McEligot, A. J., Behseta, S., Cuajungco, M. P., Van Horn, J. D., & Toga, A. W. (2015). Wrangling Big Data Through Diversity, Research Education and Partnerships. Californian Journal of Health Promotion, 13(3) Keywords: Data mining, Differential Equation Modeling, Image Analysis, Bayesian Methods, Mathematical statistics, Population Genetics, Predictive analytics, Statistical Analysis, Visualization, Data analysis, Health Disparities, Neuroscience, Epidemiology

130


Index analyze: 12, 18, 20, 23, 32-33, 47-48, 52, 69, 78, 97, 118-120, 125-127, 131 BD2K: 1-3, 5-6, 8-12, 14, 20-21, 25, 29, 36, 47, 53-55, 57, 59-60, 63, 65, 85, 89, 91-92, 94, 109, 111, 118, 121, 129-131 BD2KCCC: 2, 131 BD2K Centers of Excellence: 3, 131 Behavioral: 7-8, 38, 54, 64-65, 84, 87-88, 90, 105, 109, 116, 119-120, 125, 131 Big Data: 2, 9, 12, 14-21, 23, 25-29, 31, 33, 36-39, 41-43, 47-48, 50-52, 54-55, 57, 59, 61-67, 69-74, 76, 7880, 82, 84-85, 87-90, 92-97, 101-108, 112-113, 115, 118-119, 121, 125-127, 129-131 bigdatau: 9, 12, 131, 135 bioCADDIE: 2, 131 biomedical: 2, 6-9, 12, 14-16, 20, 23, 27-28, 31-33, 36-37, 39, 41-43, 47-52, 54-55, 57, 59, 61-62, 71-74, 7980, 90, 92-97, 99, 101-109, 112-113, 115-118, 121-123, 125-126, 129-131 biomedicine: 1-2, 12, 15, 51, 94, 106, 108-109, 116, 118, 131 cancer: 3, 7, 16, 32, 37, 43, 52, 59, 65, 69-70, 73-76, 81-82, 85, 93-94, 97, 99, 101-102, 113, 118, 128, 131 career path: 1, 6, 131 clinical: 7, 14, 21, 28-29, 31, 39, 43, 50-52, 61-65, 67, 70-76, 78, 81-85, 88, 93, 97, 103-104, 106-107, 109, 113, 115, 118, 125-126, 130-131 computation: 37, 50-51, 104, 107, 116, 125, 129-131 computational: 6-9, 15-17, 19-20, 23, 25-27, 31-33, 35-37, 39, 42, 47-48, 50-52, 54, 57, 59, 67, 69, 71-72, 76, 79, 87, 92-97, 99, 101-109, 112-113, 117, 121-123, 125-126, 128-131 computer science: 2, 8, 12, 15, 19-20, 32-33, 37-39, 42, 52, 54, 57, 59, 63-65, 70-72, 80, 84-85, 87, 92-96, 99, 101-102, 104-109, 115-116, 119-120, 122, 129, 131 core: 25, 36, 92-93, 95-97, 101, 104, 108, 131 curriculum: 20, 23, 27, 29, 31-32, 35, 43, 49, 51-52, 54-55, 88, 92-93, 96-97, 101-102, 104, 109, 131 data: 1-2, 6-10, 12, 14-21, 23, 25-29, 31-33, 35-39, 41-43, 45-52, 54-55, 57, 59, 61-67, 69-85, 87-90, 92-97, 99, 101-109, 112-113, 115-123, 125-131 databases: 1, 23, 35, 39, 43, 45, 74, 78, 80, 85, 87-88, 90, 93, 97, 107, 109, 112-113, 123, 125, 131 data-mining: 8, 36, 63, 131 diversity: 2, 5, 7, 33, 53, 55, 59, 76, 81-82, 102, 129-131 dR25: 1-2, 5, 7-8, 53, 56, 58, 131 drug: 38, 67, 85, 126, 130-131 education: 2, 7, 10, 12-16, 19-21, 23, 28, 31, 33, 36, 45, 52, 54-55, 57, 73-74, 97, 102, 108, 112, 117, 121, 127130, 132 educational: 2, 7, 9-10, 12-13, 15, 18-20, 23, 27, 29, 31, 33, 35-36, 38, 41-43, 47, 49-52, 88, 107, 112, 121, 126-127, 132-133 electronic health record: 21, 109, 132 131


Elixir: 132 environmental: 8, 12, 47, 66, 105, 119, 125, 132 ERuDIte: 12, 132 evaluation: 13, 18, 31-32, 36, 49, 51, 85, 96, 132 events: 6, 9, 13, 75, 77, 81, 83, 97, 115-116, 119, 126, 132 funding: 7, 32, 37-38, 50, 62, 64-65, 72, 74, 89, 93-94, 107, 118, 132 genes: 1, 23, 70, 76, 85, 88, 117, 126, 129, 132 genomics: 1, 7-9, 12-14, 19-21, 23, 27-28, 31-33, 37-39, 41-42, 46-48, 50-51, 54, 57, 59, 66-67, 69-70, 74, 81-82, 85, 88, 92-94, 96-97, 99, 101-106, 113, 118, 122, 125-126, 128-130, 132 graduate: 6, 8, 21, 23, 25, 27, 29, 31, 33, 37-38, 41, 50, 52, 54, 59, 92-95, 97, 99, 103, 105, 107-108, 112, 115, 119, 121, 127, 132 grant: 2, 5-6, 11-12, 14, 19, 21, 37-38, 53, 60, 65, 71-72, 74, 85, 91-92, 101, 103, 108, 111, 132 H3Africa: 132 health: 1, 3-4, 12-14, 20-21, 27-29, 31-32, 36-39, 41, 43, 47, 49-52, 54-55, 59, 62-63, 65-66, 70, 72, 74, 76, 78-80, 82-90, 94, 97, 99, 103-105, 107-109, 112-113, 115, 117-120, 125-126, 129-130, 132-133, 135 health informatics: 12-13, 37-38, 89, 132 imaging: 7-8, 37, 43, 52, 66, 88, 103, 125, 129-130, 132, 135 indexing: 2, 13, 132 infectious diseases: 7, 59, 67, 79-80, 90, 102, 132 inference: 27, 32, 37, 63, 84, 108, 132 informatics: 1, 12-13, 17, 19, 21, 23, 29, 33, 36-39, 41, 43, 46-47, 51, 57, 59, 62, 64-65, 73-74, 79-80, 85, 8889, 92-94, 97, 103, 105-107, 109, 113, 120, 129, 132, 135 innovation: 9, 16, 20, 27, 31, 51, 66, 70, 72, 112, 126, 130, 132 Institute: 3, 9, 27, 37-38, 43, 48, 76, 85, 87-88, 94, 96-97, 105, 108-109, 117-118, 120, 132, 135 instructor: 9, 27, 36, 38, 41, 45-46, 50, 132 K01: 1-3, 7, 60-90, 132 knowledge map: 12, 132 machine learning: 9, 32, 37-38, 48, 62, 70, 72, 77-80, 84, 101, 108, 115, 130, 132 medicine: 20, 27, 31, 33, 37-39, 43, 49, 51-52, 57, 73-74, 77-78, 80, 92-94, 97, 105, 107-108, 112, 116, 119, 121, 125-126, 130, 132, 135 metadata: 13, 21, 61-62, 123, 125, 132 mHealth: 9, 39, 119-120, 132 mobile health: 119, 132 modeling: 1, 13, 18, 26-27, 32, 36, 39, 41, 47, 55, 62, 66, 70-72, 78-80, 82, 84, 88, 90, 94, 96, 102, 112, 116, 118, 125-126, 129-130, 132 MOOCs: 2, 19, 32, 115, 132

132


movie: 133 network: 12, 27, 36, 47, 67, 71, 109, 121, 133 neuroimaging: 1, 19, 66, 88, 99, 133, 135 neurological: 17, 37, 83, 133 next generation: 2, 20, 31, 33, 37-38, 41, 43, 45, 51, 70, 93, 97, 102, 104-108, 129, 133 NIH: 1, 6, 9-10, 12, 20-21, 29, 47, 55, 59, 74, 87, 89, 102, 119, 130, 133 NSF: 133 nursing: 133 online: 2, 12, 14-18, 20, 28, 32-33, 35-36, 42-43, 45, 48-49, 59, 74, 85, 96, 115, 121, 129, 133 ontology: 13, 88, 90, 92, 123, 125, 129, 133 Open Educational Resources: 2, 29, 33, 42, 133 outreach: 13, 97, 112, 117, 121, 126, 133 partners: 57, 67, 102, 118, 133 personalized learning: 133 physicians: 14, 28, 52, 76, 133 postdoctoral: 8, 20, 27, 31, 41, 52, 79-80, 85, 115, 133 proteomics: 1, 50, 57, 85-86, 126, 129, 133 quantitative: 2, 17-18, 27, 31, 37, 39, 41, 51, 66, 83, 88, 92-93, 95-96, 104, 106, 112, 133 R25: 1-2, 4, 7-8, 14-20, 22-52, 54-55, 57, 59, 133 representation: 1, 39, 51, 89, 112, 133 research: 2, 6-7, 9-10, 12-14, 17-18, 20-21, 23, 25, 27-29, 31-33, 35-39, 41-43, 45, 47-52, 54-55, 57, 59, 62-66, 69-74, 76-83, 85-90, 93-97, 99, 101-109, 112-113, 115-118, 120-121, 125-127, 129-130, 133 researchers: 2, 8, 10, 12, 14-15, 18, 25, 27-28, 31, 33, 36, 38-39, 41, 45, 47, 49, 51-52, 54, 59, 65, 78-79, 8789, 93, 97, 101, 104-105, 107-109, 115, 117, 119-121, 126, 129, 133 resource discovery: 12, 133 resources: 2, 12-14, 17, 20, 27-29, 31, 33, 35, 38, 41-42, 45, 52, 80-83, 89, 92, 115, 117, 121-122, 125-127, 129, 133 RoAD-Trip: 133 science rotations: 133 scientists: 2, 6, 12, 14-15, 19-21, 23, 25, 27-29, 33, 35, 37-39, 45, 47-48, 50-52, 54, 57, 74, 85, 97, 102, 104109, 112-113, 118, 121, 130, 133 seminar: 57, 93, 95-96, 115, 133 sequencing: 19, 27, 31, 41-42, 45, 47, 69-70, 73, 75-76, 81, 85, 99, 127-128, 133 short courses: 2, 50, 134 skills: 1-2, 15, 18, 20-21, 25, 27, 31-33, 36-38, 41, 43, 45, 47-48, 51-52, 54, 57, 74, 79-80, 85-86, 88, 93-96, 101-102, 104-106, 108-109, 115, 134 133


social sciences: 105, 134 statistical: 13, 18-19, 26-27, 31-32, 37-39, 42-43, 48, 50, 52, 54-55, 57, 59, 62, 64-67, 69-72, 75-76, 79-80, 83-85, 87-88, 92-97, 99, 101-102, 104-109, 112-113, 115-116, 118, 120, 122, 128-130, 134 T15: 1-2, 4, 8, 91, 109-110, 134 T32: 1-2, 4, 8, 91-109, 134 TCC: 2, 6, 9, 12, 134 training: 1-9, 11-17, 19-20, 25, 27-28, 31, 33, 35-36, 38-39, 41, 43, 45, 47-55, 57, 59-63, 65-66, 71-72, 76, 7982, 85-86, 88, 91-95, 97, 99, 101-109, 111-113, 115, 117, 119-121, 127-130, 134 U24: 1, 11-12, 134 U54: 1-2, 8, 111-130, 134 undergraduate: 2, 6, 8-9, 20-21, 23, 31, 37-38, 55, 57, 59, 92, 121, 134 visualization: 1, 18, 25-26, 32, 39, 48, 51, 55, 59, 70, 87, 90, 105, 109, 113, 122, 129-130, 134 webinar: 121, 134

134


USC Mark and Mary Stevens Neuroimaging and Informatics Institute Keck School of Medicine of USC University of Southern California 2025 Zonal Avenue Los Angeles, CA 90033 Phone: (323) 44-BRAIN (442-7246)

For more information visit our website http://bigdatau.org

Funded by National Institutes of Health


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.