Systems Biology Data Science Symposium programme

Page 1

1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

…

1st Annual BD2K-LINCS DCIC

SYSTEMS BIOLOGY DATA SCIENCE SYMPOSIUM January 19 & 20, 2016 | Miami, Florida

1


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

… …

2


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Table of Contents

MUTINY HOTEL

LOIS POPE LIFE CENTER

Agenda ...............................1-5 Hackathon Topics ............... 5-6 Dining Options ....................... 6 Resources...........................7-9 Poster Session ....................... 9 Posters ........................... 10-27 Logistics.......................... 28-30 Credits ................................. 30

TOPPEL CAREER CENTER

GABLES ONE TOWER

… … 3


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

…

4


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

1st Annual Systems

Biology Data Science Symposium

Tuesday January 19, 2016 UM Miller School of Medicine Campus Lois Pope Life Center, 7th Floor Auditorium 1095 NW 14th Terrace Miami, FL 33136 7:30 AM

Shuttle departs lobby of Mutiny Hotel to Lois Pope Life Center (address above)

8:00–8:30 AM

Registration with Breakfast

8:30–8:40 AM

Welcome remarks and introduction to BD2K-LINCS DCIC Stephan Schürer, PhD, University of Miami | BD2K-LINCS DCIC Associate Professor, UM Department of Molecular and Cellular Pharmacology Interim Program Director, Drug Discovery, UM Center for Computational Science

8:40–9:20 AM

Keynote Speech: Overcoming Cancer Cell Heterogeneity through Epigenetic Therapies Stephen D. Nimer, MD, University of Miami Director, Sylvester Comprehensive Cancer Center Professor of Medicine, Biochemistry & Molecular Biology

9:20–9:40 AM

The NIH BD2K Initiative: How it hopes to impact biomedical research Ajay Pillai, PhD, Program Director - National Human Genome Research Institute (NHGRI), NIH Program Director, Molecular Libraries Program Co–Chief, Library of Integrated Network-based Cellular Signatures (LINCS) Division of Genome Sciences

9:40–10:10 AM

Coffee Break

10:10–10:40 AM

From big data to knowledge - experiences in rare disease genomics Stephan Zuchner, MD, PhD, University of Miami Chair and Professor, Dr. John T. Macdonald Foundation Department of Human Genetics Professor, Neurology, co–Director, John P. Hussman Institute for Human Genomics

10:40–11:00 AM

LINCS Computational Pipelines for Drug Discovery Avi Ma’ayan, PhD, Icahn School of Medicine at Mount Sinai | BD2K-LINCS DCIC Professor, Pharmacology and Systems Therapeutics

11:00–11:20 AM

Uncovering perturbation mode of action with LINCS data and tools Mario Medvedovic, PhD, University of Cincinnatti | BD2K-LINCS DCIC Professor, Division of Biostatistics and Bioinformatics

11:20–11:35 AM

L1K++: Fast and accurate pipeline for processing L1000 gene expression data Ka Yee Yeung, PhD, University of Washington | BD2K-LINCS DCIC external Data Science program Associate Professor, Institute of Technology

1


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI 11:35–11:50 AM

11:50 AM–12:05 PM

Target Predictions using LINCS Perturbation Data Ziv Bar-Joseph, PhD, Carnegie Mellon University | BD2K-LINCS DCIC external Data Science program Professor, Machine Learning Department, Computational Biology Department, School of Computer Science Disease and perturbagen posttranslational signatures across multiple signaling pathways in lung cancer cell lines: analysis of TMT data published in PhosphoSitePlus (PSP) Peter Hornbeck, PhD, Cell Signaling Technologies | BD2K-LINCS DCIC external Data Science program Director, and Principal Investigator, Cell Signaling Technologies

12:05–1:05 PM

Lunch (Lunch is on your own. Refer to Dining Options on page 10.)

1:05–1:30 PM

Building a Culture of Model-driven Drug Discovery at Merck Chris Waller, PhD, Merck & Co. Executive Director, MRLIT Modeling Platforms and CORE at Merck

1:30–1:55 PM

Modeling Spinal Cord Injury using knowledge-based and data driven approaches Vance Lemmon , PhD, University of Miami Walter G. Ross Distinguished Chair in Developmental Neuroscience Professor of Neurological Surgery Program Director, Computational Biology, UM Center for Computational Science

1:55–2:20 PM

A Next Generation Connectivity Map: L1000 platform and the first 1,000,000 profiles Aravind Subramanian, PhD, Broad Institute of MIT & Harvard Principal Investigator, Broad Institute LINCS Center for Transcriptomics

2:20–2:45 PM

Systematic Discovery of Drug Targets and Disease Indications using Genetics and Connectivity Map Pankaj Agarwal, PhD, GlaxoSmithKline Director, Computational Biology, Target Sciences at GlaxoSmithKline

2:45–2:50 PM

Concluding Remarks Stephan Schürer, PhD, University of Miami | BD2K-LINCS DCIC

2:50–3:00 PM

Poster set-up

3:00–5:00 PM

Poster reception | LPLC Breezeway 1095 NW 14th Terrace, Miami, FL 33136

Wednesday January 20, 2016 UM Coral Gables Campus Toppel Career Center, Loft (upstairs) Room #250 5225 Ponce De Leon Boulevard Coral Gables, FL 33146 7:30 AM

Shuttle departs lobby of Mutiny Hotel to Toppel Career Center (address above)

8:00–8:30 AM

Group breakout registration at Breakfast

8:30–8:40 AM

Welcome and introduction to day 2 sessions Stephan Schürer, PhD, University of Miami | BD2K-LINCS DCIC

8:45 AM–12:00 PM 8:45–8:55 AM

Morning Session: BD2K LINCS short poster and tools presentations LINCS Metadata standards as a foundation to the LINCS Integrated Knowledge Environment

2


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI Vasileios Stathias, University of Miami 8:55–9:05 AM

Democratizing gene set enrichment analysis via novel tools: GEO2Enrichr and GEN3VA Gregory Gundersen, Icahn School of Medicine at Mount Sinai, New York

9:05–9:15 AM

Small molecule standardization and annotation to enrich LINCS data and support integration Bryce Allen and Dušica Vidović, University of Miami

9:15–9:25 AM

LINCS Data Portal the global hub to all LINCS data Amar Koleti, University of Miami

9:25–9:35 AM

Clustergrammer: Interactive D3.js Heatmap Viewer for Unsupervised Clustering of Big Data Nicholas Fernandez, Icahn School of Medicine at Mount Sinai, New York

9:35–9:45 AM

Cell-Type-Specific KEGG Pathway Analysis and Visualization Shana White, University of Cincinnati

9:45–9:55 AM

Collection, integration, access and analysis of LINCS Proteomics Data via piLINCS Szymon Chojnacki, University of Cincinnati

9:55–10:05 AM

Integrated access and computation on over hundred public datasets via the Harmonizome Andrew Rouillard, Icahn School of Medicine at Mount Sinai, New York

10:05–10:15 AM

L1000CDS2 signature search engine: prioritizing small molecules for mimicking or reversing expression signatures in disease. Qiaonan Duan, Icahn School of Medicine at Mount Sinai New York

10:15–10:30 AM

Coffee Break

10:30–10:40 AM

LINCS DCIC BD2K interactions towards an integrated ecosystem of data and tools Caty Chung, University of Miami

10:40–10:50 AM

Exploratory Analysis of LINCS Genomics and Proteomics Data Naim Al Mahi, University of Cincinnati

10:50–11:00 AM

The LINCS Breast Cancer Browser: Interactive Network Visualization of LINCS Data Zichen Wang, Icahn School of Medicine at Mount Sinai, New York

11:00–11:10 AM

Internal benchmarking of connectivity between LINCS L1000 Level 5 signatures Nicholas Clark, University of Cincinnati

11:10–11:20 AM

Integrated analysis of gene expression signatures using the iLINCS platform Marcin Pilarczyk, University of Cincinnati

11:20–11:30 AM

ilincsR: The backend analysis engine of the iLINCS portal Mehdi Fazel-Najafabadi, University of Cincinnati

11:30–11:40 AM

Similarities in the human KINOME to interpret LINCS data Qiong Cheng, University of Miami

11:40–11:50 AM

Targeted polypharmacology: Combining kinase and BET activity in a single compound Bryce Allen, University of Miami

11:50–12:00 AM

The DCIC Tools Docent, LDR, and the Harmonizome Phone App Michael McDermott, Icahn School of Medicine at Mount Sinai New York

3


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI 12:00–1:00 PM

Lunch (Lunch is on your own. Refer to dining options on page 10.)

12:00–1:30 PM

DCIC PI, SAB, and NIH Closed session with catered lunch. Toppel Center, Conference and Seminar Room #114 (max. 15)

1:00–3:00 PM

DCIC Tools Demo and Hands-on SESSION 1 Toppel Center, Loft (upstairs) Room #250 DCIC tools overview and live demo This session is intended for individuals who want to get a demo’d overview of LINCS tools with specific scientific questions. LINCS tools and how they work together will be demonstrated with opportunity to ask questions. You can move back and forth between sessions 1 and 2

1:00–3:00 PM

1:20–1:30 PM

Introduction to LINCS and DCIC tools: lincsproject.org and bd2k-lincs.org Avi Ma’ayan, Icahn School of Medicine at Mount Sinai, New York

1:30–1:50 PM

Access of all LINCS data via the LINCS Data Portal Chris Mader, University of Miami

1:50–2:10 PM

iLINCS platform: workflows for integrated analysis of LINCS and Omics data sets Mario Medvedovic, University of Cincinnati

2:10–2:40 PM

GEO2Enrichr and GEN3VA: User friendly gene set enrichment analysis L1000CDS2 signature search to prioritize small molecules for drug discovery Harmonizome to access and compute on over hundred public datasets Avi Ma’ayan, Icahn School of Medicine at Mount Sinai, New York

2:40–3:00 PM

Q&A

DCIC Tools Demo and Hands-on SESSION 2 Toppel Center, Tech Lab Room #133 (max. 20) Hands-on sessions of using LINCS tools with the developers. You can bring your own data and laptop. This session provides the opportunity to use specific LINCS tools hands on and interact directly with the experts at the DCIC using one of the several computer systems set up on the Career Tech Lab or your own laptop. This lab sits 20 people and each computer will run one or two specific software tools. For some of the tools you can bring your own data, such as up and down regulated genes or GEO datasets to perform various analyses. It is recommended to review the available tools at: http://bd2k-lincs.org/#/resources Tools will be available for hands-on use and custom demos 6 work stations: Harmonizome, Enrichr, GEO2Enrichr, LINCS Data Portal, piLINCS, Slicr, L1000CDS2, PAEA, LINCS InFormation FramEwork (LIFE), iLINCS, LINCS Canvas Browser, Drug/Cell-ine Browser, Network2Canvas, Docent apps (various flavors)

3:15–3:30 PM

Closing of the SBDSS public meeting Stephan Schürer, University of Miami, Center for Computational Science

3:30–5:00 PM

Start of the TechTap / Hackathon; parallel sessions; DCIC only

Center for Computational Science, Viz Lab, Ungar Building, Room #330D (max. 14)

4


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

• Toppel Center, Career Technology Lab, Room #133 (max. 20) • Toppel Center, Executive Conference Room, Room #114 (max. 15) • Center for Computational Science, Main Conference Room #600A (max. 10)

Thursday January 21, 2016 UM Coral Gables Campus, Center for Computational Science (Suite 600) Gables One Tower, Training Room #639 1320 S Dixie Highway Coral Gables, FL 33146 8:30 AM 9:00 AM–2:00 PM

Shuttle departs lobby of Mutiny Hotel to Gables One Tower (address above) DCIC Developers TechTap / Hackathon; parallel sessions; DCIC only

• Center for Computational Science, Viz Lab, Ungar Building, Room #330D (max. 14)

• Center for Computational Science, Training Room #639 (max. 22) • Center for Computational Science, Main Conference Room #600A (max. 10) • Center for Computational Science, Small Conference Room #600E (max. 5) 10:00 AM–12:00 PM

DCIC PI and NIH wrap up and priorities

Hackathon Topics Session 1 (Wednesday) LINCS apps and resources harmonization between sites Session Goals: To harmonize CSS and other visual elements of the web pages created by the different DCIC centers, so that they have a common appearance. Session Outcomes: Common CSS and implementations of new look and feel for all LINCS DCIC apps and pages. Participants: [Amar, Nooshin, Greg, Mike, Marcin, Joe, Mehdi, Szymon]

Session 2 (Wednesday & Thursday) LINCS API harmonization and simple workflows Session Goals: To identify, document and implement existing or new workflows that use LINCS APIs for analytics. Session Outcomes: Documenting specific workflows facilitated via API, possibly on a Website; implement a workflow via the Portal Participants: [Amar, Raymond, Greg, Mike, Marcin, Mehdi, Michal, Szymon]

Session 3 (Wednesday & Thursday) Novel LINCS data visualizations Session Goals: Explore new visualization options; in particular for more complex LINCS data Session Outcomes: Exchange of ideas. Documented operational concept description. Potentially specific prototype implementation Participants: [Joe, Ray, Jianbin, Marcin, Szymon]

5


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Session 4 (Wednesday & Thursday) Data packaging, processing, representation and data analytics Session goals: Workshop to discuss how each DCIC group is organizing data, with a goal making access to data for analytical processing as simple as possible. Towards integrated real time computational analytics. Specific topics to be discussed will include: 1. processes for sharing data between sites and infrastructure 2. representation / integration to computation and efficient querying 3. analytical packages Session outcomes: Documented operational concept description. Possibly SOPs for data exchange and packaging. Participants: [Mehdi, Wen, Naim, Lixia, Vas, Dusica, Bryce, Qiong, Amar]

Session 5 (Wednesday & Thursday) LINCS Joint projects Session goals: Organize external data for MCF10-A to be integrated with dense cube data. Also, identify and organize data to be used by the “Breast Cancer Browser” Session outcomes: Data exchange processes (probably API definitions) Participants: [Ray, Vas, Dusica, Wen, and some people from Avi’s group, ]

Dining Options Medical Campus http://admissions.med.miami.edu/student-life/around-campus/campus-eats Chicken Kitchen, Subway, Manger Creole (Across 14th Terrace) Jimmy’s Johns, Salsa Fiesta, Dunkin Donuts (Across 14th Street behind Wells Fargo Bank)

Coral Gables Campus http://www.dineoncampus.com/miami/ Jamba Juice, Fresh Fusion (Shalala Student Center) Lime Fresh Mexican Grill (Whitten University Center across from Bookstore) Subway, Tossed, Pollo Tropical, Built Custom Burgers, Innovation Kitchen, Sushi Maki, Corner Deli (Hurricane Food Court)

Gables One Tower G.O.T. Spot (Gables One Tower Lobby) Bagel Emporium & Grille, TGI Fridays, Denny’s McDonald’s, Starbuck’s, Moon Thai (University Centre) Shake Shack, Five Guys, Publix, Subway, Gables Pizza & Salad (South on US1)

6


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Resources Tools and Apps The DCIC develops web-based tools for integrative data access and visualization across the distributed LINCS and BD2K sites and other relevant data sources. Our next generation integrated web-based platform for the LINCS project serves as the foundation for all LINCS activities and federates all LINCS data, signatures, analysis algorithms, pipelines, APIs and web tools.

Harmonizome Biological Knowledge Engine Built on top of information about genes and proteins from 114 datasets, the Harmonizome is a knowledge engine for a diverse set of integrated resources.

Enrichr Search Engine for Gene Lists and Signatures An easy to use intuitive enrichment analysis web-based tool providing various types of visualization summaries of collective functions of gene lists.

GEO2Enrichr Differential Expression Analysis Tool A browser extension and web application to extract gene sets from GEO and analyze these lists for common biological functions.

LINCS Data Portal (alpha version) Access to LINCS Data and Signatures Features for searching and exploring LINCS dataset packages and reagents.

piLINCS Interface to panoramaweb.org A seamless user interface and intermediate API for accessing LINCS proteomics datasets (P100, GCP, etc.) on Panorama.

Slicr LINCS L1000 Slicr [GSE70138 data only] Slicr is a metadata search engine that searches for LINCS L1000 gene expression profiles and signatures matching user's input parameters.

7


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

L1000CDS2 L1000 Characteristic Direction Signature Search Engine L1000CDS2 queries gene expression signatures against the LINCS L1000 to identify and prioritize small molecules that can reverse or mimic the observed input expression pattern.

PAEA Principal Angle Enrichment Analysis PAEA is a new R/Shiny gene set enrichment web application with over 70 gene set libraries available for enrichment analysis.

LINCS Information Framework (LIFE) LINCS Information System Integrates all LINCS content leveraging a semantic knowledge model and common LINCS metadata standards.

iLINCS LINCS Web Portal A computational biology project aimed at developing statistical methods and computational tools for integrative analysis of the data produced by the Library of Integrated Network-based Cellular Signatures (LINCS) program.

LINCS Canvas Browser LINCS L1000 Clustering, Visualization and Enrichment Analysis Tool The LINCS Canvas Browser is an interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures.

Drug/Cell-line Browser Data Visualization Tool An online interactive HTML5 data visualization tool for interacting with three of the recently published datasets of cancer cell lines/drug-viability studies.

Network2Canvas Network Visualization on a Canvas with Enrichment Analysis A web application that provides an alternative way to view networks and visualizes them by placing nodes on a square toroidal canvas. Global visualization of LINCS data

8


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Global Visualization of LINCS Data Docent - Grid view Searchable overview of the LINCS Consortium's datasets. Docent's grid view provides two interfaces for searching LINCS data by assay, perturbagen, cell, and readout.

Docent - List view Overview of the LINCS Consortium's datasets by assay and cell type. Docent's list view provides an interactive matrix of the most studied cell lines by assay.

Docent - Card view Overview of the LINCS Consortium's datasets by assay and cell type. Docent's card view provides an interactive matrix of the most studied cell lines by assay.

Poster Session Tuesday, January 19, 2016 | 3:00-5:00 PM UM Miller School of Medicine Campus Lois Pope Life Center, 7th Floor Auditorium 1095 NW 14th Terrace Miami, FL 33136

The poster session reception will follow from 3:00 to 5:00 PM on the Lois Pope Life Center breezeway located between the Schoninger Research Quadrangle and the Lois Pope Life Center. • • •

Printed posters should be a maximum of 4 feet high and 6 feet wide (48” x 72”). Push pins will be provided Posters will be collected at registration and mounted by the organizers. Presenters must be by their boards from 3:00 to 5:00 PM. Please remove your poster immediately at the close of the poster session.

9


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Posters The Community Training and Outreach (CTO) Component of the BD2K-LINCS Data Coordination and Integration Center (DCIC) Sherry Jenkins, MS1,2, Stephan Schürer, PhD2,3, Mario Medvedovic, PhD2,4, Avi Ma’ayan, PhD1,2 1 Department

of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, New York, NY Data Coordination and Integration Center 3 Center for Computational Science, University of Miami, Miami, FL 4 Department of Environmental Health, University of Cincinnati College of Medicine, Cincinnati, OH 2 BD2K-LINCS

The BD2K-LINCS Data Coordination and Integration Center is part of the Big Data to Knowledge (BD2K) NIH initiative. It is also the data coordination center for the NIH Common Fund’s Library of Integrated Network-based Cellular Signatures (LINCS) program, which aims to characterize how a variety of human cells, tissues and the entire organism respond to perturbations by drugs and other molecular factors. The BD2K and LINCS programs are at the cutting edge of biomedical research which is transitioning from a reductionist approach to a more holistic Big Data view made only possible recently due to breakthroughs in genome-wide omics and imaging technologies. However, since these new experimental methodologies produce much more data that we can currently digest, it is important to provide educational and outreach opportunities to realize the transformative potential of LINCS and BD2K. The Community Training and Outreach (CTO) component of the BD2K-LINCS DCIC engages, informs and educates key biomedical research communities about LINCS data and resources. We use innovative crowdsourcing and outreach mechanisms to harness expertise of the wider Data Science and software development communities for the benefit of the LINCS community. In our first year, the CTO’s activities included: 1) The development of LINCS-related courses (both MOOCs and in-classroom) to train biomedical researchers with LINCS-related experimental methods, datasets, and computational tools. 2) Our first cohort of students for our BD2K-LINCS DCIC Summer Research Training Program in Biomedical Big Data Science. This is a research-intensive ten-week training program for undergraduate and graduate students. 3) Establishment of the series ‘LINCS Data Science Research Webinars’ which provides a forum for data scientists within and outside of the LINCS program to present their work on problems related to LINCS data analysis and integration. 4) The development of the BD2K-LINCS DCIC Crowdsourcing Portal to bring awareness to LINCS data, extract signatures from external public repositories, and explain the efforts of LINCS to the general public. Our crowdsourcing portal engages the research community in various micro- and mega-tasks. 5) Engagement in collaborative external data science research projects which focus on mining and integrating data generated by the LINCS program for new scientific discovery. 6) Hosting of symposia, seminars and workshops to bring together the DCIC and researchers who utilize LINCS resources. 7) The development and maintenance of the lincs-dcic.org and lincsproject.org websites as well as an active presence on various social media platforms including YouTube, Google+, and Twitter. In achieving our first year milestones, we established comprehensive education, outreach, and training programs aimed at scientific communities that can benefit from LINCS data and tools. We expect that the LINCS community will continue to grow into a resourceful network that brings together researchers across disciplines and organizations.

The LINCS Data Portal as a Component of the Integrated Knowledge Environment (IKE) Amar Koleti1,2, Vasileios Stathias1,2, Dušica Vidović1,2,3, Christopher Mader1,2, Stephan Schürer1,2,3

10


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI 1 Center

for Computational Science and University of Miami, Miami, FL LINCS Data Coordination and Integration Center 3 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL 2 BD2K

The BD2K LINCS DCIC portal (LINCSPort) provides a unified interface for access to all LINCS data, signatures, analysis tools and other resources. The current version of LINCSPort provides features for searching and exploring LINCS dataset packages and reagents that have been described using the LINCS metadata standards. LINCSPort also enables download of complete LINCS dataset packages and associated metadata (for authenticated users). Other features include direct linking to LINCS analysis tools (e.g., iLINCS) and access to LINCS APIs. The portal is web-based and responsively designed, and therefore can be accessed and used across a wide range of devices. LINCSPort is integrated with the MetaData Registry and interfaces with other components of the Integrated Knowledge Environment (IKE) developed in our Center.

Internal benchmarking of connectivity between LINCS L1000 Level 5 signatures Nicholas Clark, Wen Niu, Mario Medvedovic University of Cincinnati

LINCS L1000 Level 5 differential gene expression signatures are formed by aggregating Level 4 signatures (z-scores) from experimental replicates in the same batch. We used two sets of these signatures: one created simply by the average of all intra-batch replicates (AvgZ) and one created by a more sophisticated weighted average of the intra-batch replicates (ModZ). We used a number of concordance methods to assess the connectivity between pairs of Level 5 signatures from batch to batch and between cell lines. We focused on small molecule perturbagen signatures for 5 cell lines (A375, A549, MCF7, PC3, and VCAP) and looked at the connectivity among pairs of Level 5 signatures within and between those cell lines. We used a number of concordance methods and constructed ROC curves to measure how predictive each method was - i.e. how well the method distinguished pairs of signatures sharing the same experimental conditions (same small molecule, same dosage, same duration) from pairs of signatures with different small molecule perturbagens.

Interactive Visualization and Analysis of LINCS Center-Generated Phenotype and Genotype Signatures Anders Dohlman1,2, Avi Ma’ayan1,2 1 Department

of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place Box 1215, New York, NY 10029 2 BD2K-LINCS Data Coordination and Integration Center

The NIH LINCS program ultimate goal is to advance our understanding of how human cells respond when exposed to a variety of chemical, natural, environmental and genomic perturbations. Here we present two examples of how the BD2K-LINCS DCIC center is analyzing, integrating, and visualizing different data types generated by the LINCS DSGCs. The first example is from the Micro-Environment Perturbagen (MEP) LINCS data generation center. This center aims to understand the ways in which microenvironment signals interact with a variety of endogenous perturbagens to produce cellular phenotypes. So far microenvironment microarray imaging data was produced in three cancer cell lines. In these experiments the cancer cell line were placed in wells coated by 46 ECM proteins and then treated with 56 endogenous ligands in a pairwise fashion for a total of 2,576 unique experimental conditions. Changes in morphology, metabolism, lineage, nuclear activity, and cell cycle processes were assayed using quantified high-content

11


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI imaging. We visualized this data using web-based interactive canvas mosaics that cluster ECM-ligand combinations. Such visualization illuminates correlations across cell lines for a variety of imaging endpoints. The second example provides in depth analysis of the RNA-seq data produced by the Drug Toxicity Signature (DToxS) LINCS data generation center. DToxS LINCS seeks to produce cellular signatures of drug combinations that are known to mitigate cardiotoxicity, hepatotoxicity, and neurotoxicity for a variety of FDA approved drugs. So far, Promocell cardiomyocites were perturbed with 17 offending drugs and 20 mitigating drugs, alone and in combination. RNA counts were obtained using Illumina High-Throughput sequencing. To analyze this data we first identified differentially expressed genes using the Characteristic Direction method. Then we analyzed these signatures with the Enrichr API to convert gene expression signatures to enrichment terms vectors. With this analysis we identified unique regulatory patterns for the offending drugs alone, and when applied in combination with mitigating drugs. To facilitate interactive analysis and ad-hoc discovery, both datasets were visualized using the Clustergrammer tool whereas the DToxS data analysis is available on GEN3VA.

A Genomic Signature Approach to Identify Small Molecules Effecting ΔF508CFTR Rescue Matthew D. Strub, Shyam Ramachandran, Samantha R. Osterhaus, Arthur Liberzon, Todd R. Golub, Robert J. Bridges, Paul B. McCray, Jr. University of Iowa

Background: Cystic fibrosis (CF) is a lethal autosomal recessive disease caused by mutations in the CF transmembrane conductance regulator (CFTR) gene. The most common CFTR mutation, termed ΔF508, causes protein misfolding, resulting in proteosomal degradation. However, if ΔF508-CFTR is allowed to traffic to the cell membrane, anion channel function may be partially restored. The McCray Lab previously reported that transfection with a miR-138 mimic or knockdown of SIN3A in primary cultures of CF airway epithelia increases ΔF508-CFTR mRNA and protein levels, and partially restores cAMP-stimulated Clconductance (Ramachandran et al., 2012 PNAS). Objective: We hypothesized that a genomic signature approach can be used to identify new bioactive small molecules effecting ΔF508-CFTR rescue. The Connectivity Map (CMAP): CMAP is a catalog of gene expression profiles from cultured human cells treated with a variety of bioactive chemical compounds and has pattern-matching software to mine data. A CMAP query, using gene expression signatures generated in Calu-3 epithelia treated with the miR-138 mimic or SIN3A DsiRNA, identified 27 small molecules that mimicked the miR-138 and SIN3A DsiRNA treatments. The molecules were screened in vitro for efficacy in improving ΔF508-CFTR trafficking, maturation, and Cl- current. The McCray Lab reported the identification of 4 small molecules that partially restore ΔF508-CFTR function in primary CF airway epithelia (Ramachandran et al., 2014 AJRCMB). Of these, pyridostigmine showed cooperativity with corrector compound 18 (C18) in improving ΔF508-CFTR function, highlighting the utility of a genomic signature approach in drug discovery. LINCS: Currently, the NIH is greatly expanding the CMAP dataset into the Library of Integrated Networkbased Cellular Signatures (LINCS). In collaboration with the Broad Institute, previously generated gene sets were used to iteratively query LINCS. 125 candidate small molecules were selected for further testing. Functional screens performed in CFBE (ΔF508/ΔF508) cells identified 7/125 compounds that partially rescued ΔF508 function, as assessed by cAMP-activated Cl- conductance. Additional experiments performed to assess their activity in primary human CF epithelial cells confirmed the ability of these seven compounds to partially rescue ΔF508 function. Interestingly, some of these compounds show significant cooperativity when administered with C18. Conclusion: There are few CF therapies based on new molecular insights. Querying LINCS with relevant genomic signatures offers a method to identify new candidates for rescuing ΔF508-CFTR function. Further analysis of these molecules and their derivatives are ongoing. We are also generating additional genomic signatures representing ΔF508 rescue and will use these in additional LINCS pipeline queries. These results represent an important step forward from our proof-of-concept CMAP studies and highlight the utility of LINCS in drug discovery for CF.

12


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Clustergrammer: A Web-based Visualization Tool for Making and Sharing Interactive Clustered Heatmaps Nicolas F. Fernandez1,2, Gregory W. Gundersen1,2, Qiaonan Duan1,2, Matthew R. Jones1,2, Andrew D. Rouillard1,2, Maxim Kuleshov1,2, Michael G. McDermott1,2, Avi Ma’ayan1,2 1 Department

of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1215, New York, NY 10029 2 BD2K-LINCS Data Coordination and Integration Center (DCIC)

Hierarchically clustered heatmap, or a clustergram, is a popular visualization method used in biomedical research to display results from genome-wide expression profiling, proteomics, or clinical data from cohorts of patients. While there are many software tools already available to make clustergrams, producing customized interactive clustergrams can often be difficult for those with no programming background. Furthermore, static clustergram images, which can be exported by many currently available software tools do not provide the ability to dynamically interact with the data; for example, providing controls for zooming, panning, sorting, filtering or finding a specific row or column. With the introduction of HTML5, interactive web-based data visualization has rapidly advanced; and one of the leading technologies in the field is the JavaScript library Data-Driven Documents (D3). Using D3 we developed a tool called Clustergrammer, which allows users to create interactive and shareable web-based clustergrams by uploading a tabseparated matrix of their own data. Once a user uploads their data matrix, Clustergrammer calculates a series of views of the user’s data by: hierarchical clustering, ranking of rows and columns, and filtering the data in the matrix to compute more-compact views. The user is immediately provided with an interactive web-based clustergram of their input matrix. The clustergram has zooming/panning, animated reordering/ranking, searching, filtering, and identifying and exporting clusters for analysis with other tools. Clustergrammer provides a permanent URL; this URL allows users to easily share their interactive visualizations with collaborators without the need for any specialized software other than a web browser. Clustergrammer is capable of visualizing large data sets, on the order of 100,000 data points. So far Clustergrammer was applied to visualize data from the library of integrated network-based cellular signatures (LINCS), the cancer cell line encyclopedia (CCLE), and the cancer genome atlas (TCGA) projects. Clustergrammer is an integral part of the big data to knowledge (BD2K) LINCS data coordination and integration (DCIC) tool set and was applied for the: Harmonizome, L1000CDS2, Enrichr, and GEO2Enrichr projects. For the Harmonizome project, Clustergrammer visualizes data from over 100 omics resources available for different classes of genes. For L1000CDS2, Clustergrammer visualizes small molecule signatures from the LINCS L1000 dataset. For Enrichr, Clustergrammer visualizes a matrix of a user’s input gene list and their enriched terms. For GEO2Enrichr, Clustergrammer visualizes gene expression signatures extracted by users from the Gene Expression Omnibus (GEO). These applications demonstrate that Clustergrammer provides a flexible data visualization component for creating dynamic web- based figures for a variety of data. Developers can also use the Clustergrammer API to dynamically produce visualizations for their own projects, or view/contribute to the open source project on GitHub: https://github.com/MaayanLab/clustergrammer. Clustergrammer provides a useful visualization tool for the analysis, exploration, and sharing of big biological data.

LINCS compound standardization, annotation, and integration Bryce Allen1,2,3 , Dušica Vidović1,2,3, Tanya Kelley2, 3 , Caty Chung1,2 , Saurabh Mehta1,2, Stephan Schürer1,2,3 1 Center

for Computational Science and University of Miami, Miami, FL LINCS Data Coordination and Integration Center 3 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL 2 BD2K

13


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI Structural representations of compounds can appear in various forms depending on the source. This makes compound identification problematic. To facilitate LINCS data integration, which requires to uniquely identify small molecules, we developed a robust standardization pipeline that employs several in-house components and the PubChem standardization service. The resulting standardized structures are registered and linked to the corresponding experimental LINCS data. Compound annotations are also important to ensure standardized chemical entities provide researchers with high-quality information regarding structural attributes, physicochemical properties. bioactivity data and drug information from a variety of peer-reviewed sources. Integration of chemical and bioactivity information extracted from ChEMBL, PubChem, DrugBank and BindingDB through an API pipeline facilitated systematic information mapping, retrieval and integration into LINCS data. Our standardization and annotation pipelines for LINCS compounds also include quality control and manual review and can easily be expanded to accommodate future data sources and types of information. Standardized compounds along with other standardized LINCS metadata entities can be searched and downloaded via the Metadata Registry and the LINCS Data Portal.

Cluster-filtered network analyses of post-translational modification signaling pathways in lung cancer cell lines Mark Grimes3, Nicolas Fernandez1, Neil Clark1, Avi Ma’ayan1, Klarisa Rikova2, Peter Hornbeck2 1 Department

of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, Systems Biology Center, New York, NY 2 Cell Signaling Technology, Danvers, MA 3 Division of Biological Sciences, University of Montana, Missoula, MT

Signaling pathways involving post-translational protein modification can go awry to cause cancer, and the study of protein modifications provides both characteristic signatures and clues to the driving signaling pathways in different cancers. Recent advances in generation of modification-specific antibodies has allowed acquisition of large scale mass spectrometry data for different post-translational modifications, including phosphorylation, methylation, and acetylation. We used immunoprecipitation with modificationspecific antibodies and Tandem Mass Tag (TMT) mass spectrometry to compare lung cancer cell lines to normal lung tissue, and cell lines treated with kinase inhibitor drugs. Analysis of the resulting data required special considerations because mass spectrometry produces data with a large number of missing values. We evaluated different methods for calculating statistical relationships within these data, which can grouped into three approaches that we call imputing zeros, pairwise-complete, and penalized matrix decomposition. Statistical relationships were embedded into a reduced dimension model of data structure using the machine learning algorithm, t-distributed stochastic neighbor embedding (t-SNE). Pairwise complete methods were the most effective statistical treatment that produced well-resolved t-SNE embeddings and clusters that made sense based on internal and external evaluations. A second penalized matrix decomposition and t-SNE step further resolved large clusters to produce a highly pruned co-cluster correlation network (CCCN) for strongly associated modifications. We combined the modification CCCN with protein-protein interaction (PPI) data to elucidate a cluster-filtered network that suggests a molecular signaling pathway between several receptor tyrosine kinases, transcription factors, and enzymes that modify chromatin proteins. We elucidated a pathway linking EGFR to the transcription factor, SMARCA4, and another pathway linking MET to the methyltransferase, ASH1L. These results support the hypothesis that clusters identified by statistical relationships that contain proteins known to interact with one another are likely to represent functional signaling pathways. The results suggest that extending analyses to include different post-translational modifications such as acetylation and methylation provides a link from kinases to epigenetic chromatin modifications.

14


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Cell-Type-Specific KEGG Pathway Analysis and Visualization Shana White, Mario Medvedovic, PhD University of Cincinnati

KEGG pathways offer a convenient platform for data analysis and visualization. These manually-curated biological pathways are represented as a network of nodes (genes) and directed edges where [it is assumed that] experimental evidence determines the nature and direction of the edge (relationship) between genes. The network structure for pathways that KEGG provides is a promising tool for bioinformatics research, and indeed there are existing methods for quantifying the level of pathway perturbation based on experimental gene expression data. However, the existing approaches consider changes of gene expression only of the genes in a particular pathway and not changes in expression of downstream targets. This restricts the definition of perturbation to mean change in gene expression rather than a much broader, and perhaps more meaningful, change in gene function. Furthermore, KEGG pathways are not currently cell-type-specific. The active pathway genes in any particular cell-type may be a subset of the entire pathway, and restricting attention to these genes may improve analytical power. The goal of this project is to use LINCS data to quantify relationships between genes in a given pathway in a cell-type-specific manner via analysis of overlapping de-regulated genes for pairs of experimental knockouts. This approach will yield quantitative measures and a novel method for visualizing relationships (edges) between genes using Cytoscape software.

The Harmonizome: A Web-based System that Integrates Knowledge about Genes and Proteins from over 70 Omics Resources Andrew D. Rouillard1, Gregory W. Gundersen1, Nicolas F. Fernandez1, Zichen Wang1, Michael G. McDermott1, Avi Ma’ayan1 1

Department of Pharmacology and Systems Therapeutics, Department of Genetics and Genomic Sciences, BD2KLINCS Data Coordination and Integration Center (DCIC), Mount Sinai’s Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), Icahn School of Medicine at Mount Sinai, New York, NY

Thanks to advances in genomics, epigenomics, transcriptomics, proteomics, and metabolomics, many research projects are profiling with high-throughput the structure, level and activity of molecular species within mammalian cells, and linking those molecular observations to phenotypic properties of cells, tissues, and organisms. In addition, curation projects that organize knowledge from the biomedical literature into online databases are continually expanding. Such projects are generating a wealth of information that potentially can guide research toward novel biomedical discoveries and advances in healthcare. However, this information is fragmented into domain specific databases, which generally include useful interfaces for browsing and querying the hosted data, but do not provide data in a manner well-suited for integrative analysis across databases. To address this, we developed the Harmonizome: a collection of knowledge about genes and proteins extracted, abstracted and organized from over 70 online resources. We processed the data from these resources to extract 72,000,000 associations between genes/proteins and their attributes, where attributes could be other genes/proteins, cell lines, tissues, perturbations, diseases, human or mouse phenotypes, or drugs. We stored these associations in a relational database along with rich metadata for the genes, attributes, and data sources. The freely available web resource at http://amp.pharm.mssm.edu/Harmonizome provides a user interface and an API for querying, browsing and downloading the data collection. To demonstrate the utility of the Harmonizome data collection, we computed and visualized gene-gene and attribute-attribute similarity networks, and through unsupervised clustering, identified many novel relationships. We also applied supervised machine learning methods to predict novel substrates of kinases, ligands of GPCRs, mouse phenotypes for uncharacterized gene knockouts, and whether unannotated transmembrane protein are likely to be ion channels. The Harmonizome is a comprehensive resource of knowledge about genes and proteins that enables researchers to discover relationships between biological entities and form data driven hypotheses.

15


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

The Harmonizome Mobile Application Michael G. McDermott1,2, Gregory W. Gundersen1,2, Maxim V. Kuleshov1,2, Avi Ma’ayan1,2 1

Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1215, New York, NY 10029 2 BD2K-LINCS Data Coordination and Integration Center (DCIC)

Most online databases that enlist properties of human genes and proteins only include information from a hand full of resources. Genomics, transcriptomics and proteomics resources can provide additional information about single genes or proteins, but these are not readily organized and abstracted for such purpose. To create the Harmonizome mobile app, we assembled, extracted, and organized knowledge from over 60 online resources, including novel databases that we created such as: ChEA, KEA, SILAC phosphoproteomics, ESCAPE, PPI Hubs, and collections of signatures extracted from GEO. The Harmonizome mobile app serves this accumulated knowledge in an easy to access interface where users can enter their gene/protein of interest to discover its properties and functions. The knowledge spans many bioinformatics omics resources from expression in cells, tissues and diseases; regulation by transcription factors, chromatin marks and microRNAs; functional membership in protein complexes, pathways and ontologies; genomic associations with disease, and differential expression upon treatment of human cells with drugs; as well as structural and other genomic features. The Harmonizome app serves the collected knowledge in defined categories for navigation ease, and with links out for further exploration of associated functions of genes and proteins. The Harmonizome mobile application is available at the Google Play Store: http://goo.gl/JWlI8H for Android devices, and the App Store http://appstore.com/harmonizome for iOS devices.

Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model Lujia Chen, Chunhui Cai, Vicky Chen, and Xinghua Lu Dept. Biomedical Informatics, University of Pittsburgh

Background A living cell has a complex, hierarchically organized signaling system that encodes and assimilates diverse environmental and intracellular signals, and it further transmits signals that control cellular responses, including a tightly controlled transcriptional program. An important and yet challenging task in systems biology is to reconstruct cellular signaling system in a data-driven manner. In this study, we investigate the utility of deep hierarchical neural networks in learning and representing the hierarchical organization of yeast transcriptomic machinery. Results We have designed a sparse autoencoder model consisting of a layer of observed variables and 4 layers of hidden variables. We applied the model to over a thousand of yeast microarrays to learn the encoding system of yeast transcriptomic machinery. After model selection, we evaluated whether the trained models captured biologically sensible information. We show that the latent variables in the first hidden layer correctly captured the signals of yeast transcription factors (TFs), obtaining a close to one-to-one mapping between latent variables and TFs. We further show that genes regulated by latent variables at higher hidden layers are often involved in a common biological process, and the hierarchical relationships between latent variables conform to existing knowledge. Finally, we show that information captured by the latent variables provide more abstract and concise representations of each microarray, enabling the identification of better separated clusters in comparison to gene-based representation. Conclusions Contemporary deep hierarchical latent variable models, such as the autoencoder, can be used to partially recover the organization of transcriptomic machinery.

16


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Combining phenotypic and biochemical screening to identify drug targets, exploit polypharmacology, and personalize treatment Hassan Al-Ali1, Do-Hun Lee1, Matt C. Danzi1, Diana Azzam5, Houssam Nassif6, Prson Gautam7, Krister Wennerberg7, Matt Soellner8, Jae K. Lee1,3, Vance P. Lemmon1,2,3, and John L. Bixby1,2,3,4,* 1

Miami Project to Cure Paralysis, University of Miami Miller School of Medicine, Miami FL 33136 Center for Computational Sciences, University of Miami Miller School of Medicine, Miami FL 33136 3 Neurological Surgery, University of Miami Miller School of Medicine, Miami FL 33136 4 Molecular & Cellular Pharmacology, University of Miami Miller School of Medicine, Miami FL 33136 5 Center for Therapeutic Innovation, University of Miami Miller School of Medicine, Miami FL 33136 6 Core Machine Learning Science Team, Amazon, Seattle WA 98109 7 Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland 8 Medicinal Chemistry, University of Michigan, Ann Arbor MI 48109 2

Target-based screening is an efficient technique for identifying potent modulators of individual drug targets. In contrast, phenotypic screening can identify biologically effective drugs with multiple targets; however, these targets remain unknown. Although drugs with multiple therapeutic targets are generally more effective than highly selective alternatives, we lack systematic methods for discovering such drugs. To address this gap, we combined the two screening approaches through the use of machine learning and information theory. We screened compounds in phenotypic assays and also in a panel of kinase enzyme assays. We used learning algorithms to relate the compounds’ kinase inhibition profiles to their influence on cellular phenotype. This allowed us to identify kinases that may serve as targets for promoting the desired phenotype, as well as others whose inhibition should be avoided (anti-targets). We observed that compounds that interact with multiple identified targets (beneficial polypharmacology) tend to be more effective than highly selective compounds, both in vitro and in vivo. This approach can be used to deconvolve drug targets from any disease model for which a phenotypic screening assay is available, and to discover effective multi-target drugs independent of molecular scaffold structure. Importantly, it can be used to personalize drug choice and combination decisions for individual patients (e.g. refractory cancer patients).

Kinase similarities for interpreting LINCS data Qiong Cheng1,2 , Bryce K. Allen1,2,3 , Stephan C. Schürer1,2,3 1 Center

for Computational Science and University of Miami, Miami, FL LINCS Data Coordination and Integration Center 3 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL 2 BD2K

Kinases play critical roles in the regulation of dynamic biological systems, including cancer cell growth, proliferation and survival. With the emerging high-throughput screening technologies, large small molecule libraries have been profiled against a panel of kinases. However it is noted that the kinases inhibitors may not selectively differentiate kinases since a large number of protein kinase enzymes share a common cofactor and similar three-dimensional structure of the catalytic site. We are interested in investigating the relationship of kinases from diverse spaces with a goal of identifying the potential linear or non-linear intersection of ligand-based chemical, pharmacogenomic, functional, and disease space by integrating large-scale data from multiple sources. Our analysis started from LINCS KinomeSCAN data, which is the “benchmark” kinase target competitive binding bioassay. We assessed kinase-to-kinase similarity through different measurements and quantify pairwise associations and predictability among those measurements. Further, we integrate this analysis with LINCS L1000 data to study the determinants underlying mechanism of action.

17


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

BDMERGE: a paired collection of omics datasets for benchmarking computational pipelines Simon Koplev1,2, Avi Ma’ayan1,2 1

Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, New York City, NY 2 BD2K-LINCS Data Coordination and Integration Center (DCIC)

Large-scale NIH projects such as ENCODE, TCGA, CCLE and LINCS generate data using multiple experimental platforms applied under the same conditions to the same cell lines. Combining multiple assay types promise to reveal new biological insights by illuminating complementary biological features of the regulatory layers of human cells. Such integrative analysis could provide additional understanding about how enzymes, metabolites, transcription factors, and epigenetic-marks jointly interact to establish the cellular phenotype. Comparing the consistency of paired data types such as transcriptomics, epigenomics and proteomics applied under the same conditions across regulatory layers could also assist in benchmarking computational methods by evaluating the consistency between the processed datasets. For example, the first dataset can be fixed as the silver standard whereas the second dataset can be processed by different computational pipelines and these pipelines evaluated for consistency with the first dataset. The computational pipelines that produce the most consistency between the first and the second paired datasets can be considered better at extracting knowledge from the data. To test this idea, we collected matching datasets from transcriptomics, epigenomics and proteomics data collected from multiple large-scale projects including: ENCODE, TCGA, CCLE, and LINCS. This data was first converted into a standardized format, which could also be used as a model for multi-type data packages, and then processed by alternate computational pipelines. Making minimal assumptions about the relationships between data types, we developed correlation-based benchmarks to evaluate the quality of pipelines. We find that the benchmarks are able to distinguish different computational pipelines and could therefore inform us about which analysis methods are more appropriate for each dataset. The different pipelines are dynamically and interactively visualized on the web, and served through the BDMERGE website at: http://amp.pharm.mssm.edu/bdmerge

LINDO: Ontology for LINCS Data Integration Hande Küçük-McGinty1,2, Asiyah Yu Lin2, Nooshin Nabizadeh1,2, Jianbin Duan1,2, Vasileios Stathias1,2,3, Dušica Vidović1,2,3, Stephan Schürer1,2,3 1

Center for Computational Science and University of Miami, Miami, FL LINCS Data Coordination and Integration Center 3 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL 2 BD2K

The Library of Integrated Network-based Cellular Signatures (LINCS) program generates diverse, multidimensional datasets, which are characterized by the model system (typically a cell), perturbation reagent (e.g. small molecule, genetic, environmental), and the assay with a specific cellular readout profile (e.g. transcription, proteomics, image-derived phenotype, protein binding, etc). In addition, many external datasets are being integrated with LINCS. In order to make sense of LINCS data as a whole, an ontology that facilitates contextual integration of data and querying of LINCS results is necessary. The LINCS consortium has developed LINCS metadata standards for the material entities describe their data. Those material entities include LINCS cell and cell lines, small compounds, short nucleotide (shRNA, siRNAs), antibodies, proteins, etc. The LINCS metaData Ontology (LINDO) classifies above material entities into class/facet hierarchies. More importantly, LINDO logically defines the hierarchical classes/facets, relations between entities, and link them with existing ontologies, such as BioAssay Ontology, Drug Target Ontology, Disease Ontology, Protein Ontology, Cell Line Ontology, CHEBI, and many others. The development of LINDO starts with ontologically modeling the metadata of LINCS metadata registry (MDR). A modularization approach is applied to ensure the reusability and scalability of LINCS internal

18


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI data, external data, and logical axioms. The first version of LINDO defines and represents LINCS cell and cell lines, diseases, small molecules, proteins, functions, roles, and assays in OWL. The future work of LINDO includes completely transforming the MDR data into LINDO based OWL representation, as well as developing LINDO based applications such as semantic queries against the whole LINCS data, and contextual data integration.

iLINCS: Web-Platform for Analysis of LINCS Data and Signatures (http://ilincs.org) Marcin Pilarczyk, Naim Mahi, Mehdi Fazel Najafabadi, Prudhvi Shedimbi, Michal Kouril, Nicholas Clark, Shana White, Mark Bennett, Wen Niu, John Reichard, Juozas Vasiliauskas, Jarek Meller, Mario Medvedovic University of Cincinnati

iLINCS (Integrative LINCS) is an integrative web platform for analysis of LINCS data and signatures. The portal provides biologists-friendly user interfaces for analyzing transcriptomics and proteomics LINCS datasets. The portal integrates R analytical engine via several R tools for web-computing (rserve, opencpu, Shiny, rgl) and DCIC developed web tools and applications (FTreeView, Enrichr) into a coherent web platform for LINCS data analysis. Users can follow several workflows which allow them to identify differentially expressed genes, proteins and phosphoproteins in LINCS datasets and use them in analysis of other LINCS and non-LINCS dataset (eg TCGA and GEO transcriptomic datasets), and in the analysis of LINCS L1000 signatures. In this way, the platform facilitates integrative analysis of LINCS data and signatures. The mechanistic interpretation of LINCS transcriptomic and proteomics signatures is facilitated by the enrichment analysis via Enrichr and DAVID, and by pathways analysis using the R implementation of the SPIA algorithm. The portal can be accessed freely and does not require user registration (http://ilincs.org).

Gene regulatory network inference using L1000 knockdown gene expression data William Chad Young, Adrian E. Raftery, Ka Yee Yeung University of Washington

We will present a method for inferring edges among LINCS L1000 landmark genes using knockdown experiments. Our approach uses the correlation between the knockdown gene and potential targets to compute posterior odds, which can be translated into a probability of regulation. We generate an edgelist by applying this method to all potential pairs and keeping those which exceed a predefined probability threshold. We compare this edgelist with TRANSFAC and JASPAR edges from Enrichr and show that it does indeed recover known edges.

Enrichr - A Search Engine for Gene Signatures Maxim V. Kuleshov, MS1,2; Avi Ma’ayan, PhD1,2 1

Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place Box 1215, New York, NY 10029

19


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI 2

BD2K-LINCS Data Coordination and Integration Center

Genomics, transcriptomics and proteomics studies produce lists of genes and proteins that are hard to interpret. In order to develop further understanding of the potential functions of these gene and protein lists prior knowledge about known gene lists can be utilized. Since 2007 we have systematically processed online resources into annotated gene lists/sets. Collectively we assembled over 80,000 annotated gene sets from over 70 online resources. To serve these annotated gene sets for search against user lists we developed Enrichr, a web-based gene signature search engine enrichment analysis tool. The annotated gene set libraries are organized and presented in eight categories: transcription, pathways, ontologies, diseases and drugs, cell types, miscellaneous, legacy and crowdsourcing. Enrichr computes enrichment using a method that improves upon the commonly used Fisher exact test, and enrichment results are visualized using several options implemented with the JavaScript library D3. Since it publication, Enrichr was accessed by over 22,500 users who uploaded over 917,000 lists for analysis. In 2015 the submission rate increased substantially and it is currently at a median of more than 450 lists per day. Besides offering a useful tool to the community, the large collection of gene sets on its own can provide an invaluable resource to further understand the universe of mammalian gene sets, and potentially used to improve the enrichment analysis results. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr

Large-Scale Computational Screening Identifies First in Class Multitarget Inhibitor of EGFR Kinase and BRD4 Bryce K. Allen1,2,3, Saurabh Mehta1,2,3, Stuart W. J. Ember4, Ernst Schonbrunn4, Nagi Ayad5, Stephan C. SchĂźrer1,2,3 1 Center

for Computational Science and University of Miami, Miami, FL LINCS Data Coordination and Integration Center 3 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL 4 Drug Discovery Department, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 5 Miami Project to Cure Paralysis, Department of Psychiatry and Behavioral Sciences, University of Miami, Miami, FL 2 BD2K

Inhibition of cancer-promoting kinases is an established therapeutic strategy for the treatment of many cancers, although resistance to kinase inhibitors is common. One way to overcome resistance is to target orthogonal cancer-promoting pathways. Bromo and Extra-Terminal (BET) domain proteins, which belong to the family of epigenetic readers, have recently emerged as promising therapeutic targets in multiple cancers. The development of multitarget drugs that inhibit kinase and BET proteins therefore may be a promising strategy to overcome tumor resistance and prolong therapeutic efficacy in the clinic. We developed a general computational screening approach to identify novel dual kinase/bromodomain inhibitors from millions of available small molecules. Our method integrated machine learning using big datasets of kinase inhibitors and structure-based drug design. Here we describe the computational methodology, including validation and characterization of our models and their application and integration into a scalable virtual screening pipeline. We screened over 6 million commercially available compounds and selected 24 for testing in BRD4 and EGFR biochemical assays. We identified several novel BRD4 inhibitors, among them a first in class dual EGFR-BRD4 inhibitor. Our studies suggest that this computational screening approach may be broadly applicable for identifying dual kinase/BET inhibitors with potential for treating various cancers.

20


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Exploratory Analysis of LINCS Genomics and Proteomics Data Naim Al Mahi, Prudhvi Shedimbi, Marcin Pilarczyk, Wen Niu, Michal Kouril, Mario Medvedovic Laboratory for Statistical Genomics, Department of Environmental Health, Division of Biostatistics and Bioinformatics, University of Cincinnati College of Medicine, 3223 Eden Ave. ML 56, Cincinnati OH 45267-0056

iLINCS (Integrative LINCS) is an integrative web platform for analysis of LINCS data and signatures. The portal offers user-friendly interfaces to scientists for analyzing transcriptomics and proteomics LINCS datasets. To provide users high-level description of patterns in LINCS datasets, this platform presents exploratory analysis results. These results are visualized via interactive web applications based on R tools for web-computing (opencpu, rgl, and pairsD3) and DCIC developed web tools (FTreeView). This exploratory analysis pipeline generates Spearman correlation coefficient plot, heatmap of the top 1000 noisy genes based on the median absolute deviation (MAD) values, interactive 2D and 3D principal component analysis (PCA) plots, FTreeView analysis, and table of experimental groups. opencpu generate heatmaps and PCA plots of any of the treatment groups available for the corresponding data.

L1000CDS2: LINCS L1000 Characteristic Direction Signature Search Engine Qiaonan Duan, BS1,2; St. Patrick Reid, PhD3; Neil R. Clark, PhD1,2; Zichen Wang, BS1,2; Nicolas F. Fernandez, PhD1,2; Andrew D. Rouillard, PhD1,2; Ben Readhead, MBBS2; Sarah R. Tritsch, PhD3; Rachel Hodos, BS2; Marc Hafner, PhD4; Mario Niepel, PhD4; Peter K. Sorger, PhD4; Joel T. Dudley, PhD2; Sina Bavari, PhD3; Rekha G. Panchal, PhD3; Avi Ma’ayan, PhD1,2. 1

Department of Pharmacology and Systems Therapeutics Department of Genetics and Genomics Science, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 3 US Army Medical Research Institute of Infectious Diseases, 1425 Porter Street, Frederick, MD 21702-5011 4 Department of Systems Biology, Harvard Medical School, 200 Longwood Avenue, WAB438 Boston, MA 02115 2

The LINCS L1000 dataset currently comprises of over a million gene expression profiles of chemically perturbed human cell-lines. Here we demonstrate that processing the L1000 data with the Characteristic Direction (CD) method significantly improves signal to noise through several intrinsic and extrinsic benchmarking schemes. The processed L1000 signatures are served through a state-of-the-art web-based search engine application called L1000CDS². The L1000CDS² search engine provides prioritization of thousands of small molecule signatures, and their pairwise combinations, predicted to either mimic or reverse an input gene expression signature using two principal methods. We applied L1000CDS² to prioritize small molecules that are predicted to reverse expression in 670 disease signatures extracted from the gene expression omnibus (GEO). With this tool we also prioritized small molecules that can mimic expression of 22 endogenous ligand signatures. As a case study, to demonstrate the utility of L1000CDS², we collected expression signatures from human cells infected with Ebola virus at 30, 60 and 120 minutes. Querying these signatures with L1000CDS² we identified kenpaullone, a GSK3B/CDK2 inhibitor that we show, in subsequent experiments, has a dose dependent efficacy in inhibiting Ebola infection in vitro without causing cellular toxicity. In summary, the L1000CDS² application can be applied in many biological and biomedical settings, while improving the extraction of knowledge from the LINCS L1000 resource. L1000CDS² is freely available at: http://amp.pharm.mssm.edu/L1000CDS2

Mutations and Drugs Portal (MDP): a database linking drug response data to genomic information Jimmy Caroli1, Cristian Taccioli2, Silvio Bicciato1

21


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI 1 2

Dept. of Life Sciences, University of Modena and Reggio Emilia, Modena 41125, Italy Dept. of Animal Medicine, Production, and Health, University of Padova, Padova 35020, Italy

Genetic alterations in cancer cells generate cancer-specific dependencies that represent optimal predictors of response and can be exploited to develop targeted therapies. The integration of large-scale genomic and pharmacological data from cancer cell lines promises to be effective in the discovery of new genetic markers of drug sensitivity and of clinically relevant anticancer compounds. The Mutations and Drugs Portal (MDP, http://mdp.unimore.it) is a web accessible database that combines the cell-based NCI60 pharmacological screening with genomic data extracted from the Cancer Cell Line Encyclopedia and the NCI60 DTP projects. MDP currently contains drug sensitivity data for more than 50,800 compounds, describing response to drugs across 115 cancer cell lines. To identify genomic features associated to drug response, cell line drug sensitivity data are integrated with large genomic datasets, including information on somatic mutations and transcriptional data. MDP can be queried for drugs active in cancer cell lines carrying mutations or transcriptional alterations in specific cancer genes and signaling pathways or for genetic and transcriptional profiles associated to sensitivity or resistance to a given compound. Results are presented through graphical representations with links to related data and are fully downloadable. All data and tools are freely available without restriction. MDP provides a user-friendly web resource to perform in-silico high-throughput screenings of thousands of compounds and facilitate the discovery of associations between genomic portraits and drug responses.

ilincsR: The backend analysis engine for integrative LINCS (iLINCS) portal (http://ilincs.org) Mehdi Fazel-Najafabadi, Mario Medvedovic Division of Biostatistics and Bioinformatics, Environmental Health Department, University of Cincinnati

iLINCS (Integrative LINCS) is an integrative web platform for analysis of LINCS data and signatures. The portal provides biologists-friendly user interfaces for analyzing transcriptomics and proteomics LINCS datasets. Back-end data analysis in iLINCS is performed using R analytical engine via several R tools for web-computing (rserve, opencpu, Shiny, rgl). The main analytical workflows utilize functions implemented in the ilincsR R package. The ilincsR functions query backend databases to acquire the data and metadata for the analysis, perform statistical analyses of the data and return analysis results. Results are returned in the form of tables, web pages, various graphical representations and via interactive browsers. For example, one of the workflows starts with an “omic” LINCS dataset (transcriptomic, proteomic, phosphoproteomic) stored in the backend databases, constructs the “global” perturbation signature by assessing the differential expression of all genes/proteins after a perturbation, visualizes data for differentially expressed genes/proteins in the form of heatmaps, submits them to Enrichr tool for enrichment analysis, performs pathway analysis of differentially expressed genes using the R implementation of the SPIA algorithm, correlates the newly constructed signature with the libraries of precomputed LINCS and non-LINCS signatures and uses differentially expressed gene list to query another dataset if user wants to proceed.

GEN3VA: Discovering and Analyzing Consensus Signatures from the Gene Expression Omnibus Gregory W. Gundersen1,2, Nicholas F. Fernandez1,2, Caroline D. Monteiro1,2, Avi Ma’ayan1,2 1

Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1215, New York, NY 10029

22


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI 2

BD2K-LINCS Data Coordination and Integration Center (DCIC)

In 2015 we published the tool GEO2Enrichr [1], a browser extension that enables performing differential gene expression analysis of datasets from the Gene Expression Omnibus (GEO). Since January 2015, users of GEO2Enrichr have extracted more than 11,000 tagged gene signatures from GEO. Many of these signatures were extracted as part of a crowdsourcing microtask project that challenged participants to find, process, and tag GEO studies centered on the following common biological themes: single gene perturbations, drug perturbations, disease vs. normal signatures, MCF7 cell perturbations, old vs. young tissue signatures, microbial perturbations, and endogenous ligand perturbations. To analyze these and other collections of themed gene expression signatures, sometimes profiled under different conditions with different experimental platforms by independent studies, we developed GenE Expression and Enrichment Vector Analyzer (GEN3VA). GEN3VA is a web-based server software system for analyzing collections of themed gene expression signatures. GEN3VA performs bulk enrichment analyses to produce an interactive web-based report with a variety of visualizations. For example, using GEN3VA we automatically created a report of 49 gene expression signatures collected from studies that compared normal tissues to tissues from patients or mouse models of Amyotrophic Lateral Sclerosis (ALS). This report can be viewed here: http://amp.pharm.mssm.edu/gen3va/report/8/ALS This report provides a tabular view of all 49 gene expression signatures with their associated metadata; principal component analysis (PCA); hierarchical clustering and enrichment analyses. For the enrichment analyses component of the report, GEN3VA submits every gene signature from the collection to Enrichr and L1000CDS2. Enrichr performs enrichment analysis against many gene set libraries that include pathway databases, gene ontology, and regulation of gene sets by transcription factors. L1000CDS2 queries the gene signatures against the LINCS L1000 dataset to identify small molecules that can reverse or mimic the input expression signatures. The results from these analyses are visualized in interactive heat maps that can potentially facilitate the discovery of unique and common regulatory mechanisms and potential small molecules that can be further experimentally tested. GEN3VA has been running since November 2015, and nearly 150 users have contributed to the creation of the underlying database of tagged collections of signatures.1. Gregory W. Gundersen, Matthew R. Jones, Andrew D. Rouillard, Yan Kou, Caroline D. Monteiro, Axel S. Fledmann, Kevin S. Hu, Avi Ma’ayan. GEO2Enrichr: browser extension and server app to extract gene sets from GEO and analyze them for biological functions. Bioinformatics. 31, 3060-3062 (2015)

BD2K-LINCS DCIC Interactions towards an Integrated Ecosystem of Data and Tools Caty Chung1,2, Dušica Vidović1,2,3, Amar Koleti1,2, Vasileios Stathias1,2,3, Christopher Mader1,2, Mario Medvedovic2,5, Avi Ma’ayan2,4, Stephan Schürer1,2,3 1 Center

for Computational Science and University of Miami, Miami, FL, USA; LINCS Data Coordination and Integration Center; 3 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA 4 Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1215, New York, NY 10029 USA 5 Division of Biostatistics and Bioinformatics, Environmental Health Department, University of Cincinnati 2 BD2K

The Big Data to Knowledge (BD2K) program is a trans-NIH initiative that aims to help the biomedical research community to realize the potential of Big Data. The focus of the BD2K program is to support the research and development of innovative and transforming approaches and tools to maximize and accelerate the integration of Big Data and data science into biomedical research. The BD2K Centers program has established 11 Centers of Excellence for Big Data Computing and two Centers that are collaborative projects with the NIH Common Fund LINCS program, the LINCS-BD2K Data Coordination and Integration Center, and the Broad Institute LINCS Center for Transcriptomics. Our BD2K LINCS DCIC is constructing a high-capacity scalable Integrated Knowledge Environment (IKE) enabling federated access, intuitive querying and integrative analysis and visualization across all LINCS resources and many additional external data types from other relevant resources. Our Center’s data science research projects are aimed at addressing various data integration and data science challenges

23


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI as well as developing new approaches of analyzing and visualizing complex LINCS datasets. One goal of our BD2K-LINCS DCIC is to provide an interface between the LINCS project, which produced complex systems biology datasets, and the BD2K consortium, which is addressing many data science challenges. This way, BD2K tools may become applicable to the LINCS project and LINCS data may serve to evaluate or test many of the tools and data science solutions developed in the BD2K centers of excellence. To work towards this goal, our Center has engaged in several collaborations in the BD2K consortium to develop the tools to automatically capture metadata, to define data exchange formats, API alignment, dataset citations, data provenance, minimum information standards.

Adverse Events Prediction with the LINCS L1000 Data Zichen Wang1,2, Neil R. Clark1,2, Avi Ma’ayan1,2 1

Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place Box 1215, New York, NY 10029 2 BD2K-LINCS Data Coordination and Integration Center

Adverse Drug Reactions (ADRs) are central considerations during drug development. Here we present a machine learning classifier to prioritize ADRs for approved drugs and pre-clinical small molecule compounds by combining chemical structure (CS) and gene expression (GE) features. The GE data is from the Library of Integrated Network-based Cellular Signatures (LINCS) L1000 dataset which measured changes in GE before and after treatment of human cells with over 20,000 small molecule compounds. Using various benchmarking methods, we show that the integration of GE data with the CS of the drugs can significantly improve the predictability of ADRs. Moreover, transforming GE features to enriched biological terms further improves the predictive capability of the classifiers. The most predictive biologicalterm-features can assist in understanding the drug mechanisms of action. Finally, we applied the classifier to all >20,000 small molecules profiled, and developed a web portal for browsing and searching predictive small-molecule/ADR connections.

Two novel algorithms for inferring the transcription factors regulating expression changes measured in RNA-Seq data Matt C Danzi, John L Bixby, Vance P Lemmon, and Stefan Wuchty The Miami Project to Cure Paralysis and the Department of Computer Science, University of Miami

Data science approaches to analyzing gene expression datasets often revolve around derivation of a “signature”, comprising a set of genes consistently regulated in the experimental or diseased condition. These analyses generally find patterns that fit the data at hand, but do not generalize to new datasets. This non-generalizability may stem from the fact that gene expression profiling is measuring the “passengers” rather than the “drivers” of the biological event being studied. We hypothesize that we will find greater consistency and thus generalizability in the patterns discovered in biological datasets if we can measure the “drivers” specifically. One major, well-studied class of “drivers” affecting gene expression are transcription factors. Therefore, investigators have proposed methods to glean information about the activities of transcription factors from RNA-Seq data and thereby derive the “driver” transcription factors responsible for observed changes in gene expression. Here, we propose two new algorithms for this purpose. The first method uses a decision tree structure to group genes into similarly regulated cohorts based on the degree to which their expression changes across experimental conditions and the combinations of transcription factors that regulate them. The second method takes advantage of the fact that transcription factors regulate other transcription factors to form complex webs and cycles of regulation. This approach applies a greedy algorithm to track through these complex regulatory relationships and

24


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI identifies the minimum set of transcription factors which, through many levels of regulation, are most likely responsible for the observed gene expression changes. Since the two algorithms approach the issue of transcriptional inference very differently, we plan to use them in conjunction to identify the most relevant transcription factors that drive the biological process being measured. Acknowledgments: This work was supported by The Miami Project to Cure Paralysis, The Buoniconti Fund, The Walter G. Ross Foundation and NICHD R01 HD057632 (VPL and JLB)

Extraction and Analysis of Mammalian Gene Expression Signatures from GEO by the Crowd Zichen Wang1,2, Caroline D Monteiro1,2, Nicolas F Fernandez1,2, Gregory W Gundersen1,2, Andrew D Rouillard1,2, Axel S Feldmann1,2, Kevin S Hu1,2, Michael G McDermott1,2, Qiaonan Duan1,2, Neil R Clark1,2, Matthew R Jones1,2, Yan Kou1,2, Sherry L Jenkins1,2, Coursera NASB15 students, Avi Ma’ayan1,2 1

Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place Box 1215, New York, NY 10029 2 BD2K-LINCS Data Coordination and Integration Center

The volume of gene expression data accumulating in public repositories such as GEO or ArrayExpress is growing exponentially. Reanalysis and integration of those datasets has the potential to produce new insights about data reproducibility, assist in better understanding disease mechanisms, help in the discovery of novel drug targets, and identify opportunities for drug repurposing. However, to achieve this goal, human curation to identify relevant studies for a specific theme, and labeling the control vs. experimental samples is required. We have developed a crowdsourcing project for annotating and analyzing a large number of gene expression profiles from GEO for the following themes: single gene perturbations, disease vs. normal signatures, and drug perturbation signatures. Through the Network Analysis in Systems Biology 2015 (NASB15) Coursera course we recruited 78 participants that collected 2460 single gene perturbation signatures, 839 disease vs. normal signatures, and 906 drug perturbation signatures. All these signatures are unique and were manually validated for quality. Global analysis of this collection of signatures confirmed known, and identified novel associations between genes, diseases and drugs. A web portal that enables browsing these relationships is being develop. The web portal provides different modes of visualizations and signature search capability where users can submit their own signatures to find matches with signatures in the database. The web portal prototype is available at: http://amp.pharm.mssm.edu/creeds/.

Metadata Verification and Registration Tools: Towards the enablement of FAIR (Find, Access, Interoperate, Re-use) Data Principles for LINCS Metadata Standards Amar Koleti1,2, Raymond Terryn1,2,3, Christopher Mader1,2 1 Center

for Computational Science and University of Miami, Miami, FL LINCS Data Coordination and Integration Center 3 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL 2 BD2K

Integration of diverse LINCS data resources by standardized metadata requires i) a clear definition of the data space in terms of what types of metadata and results will be captured, ii) how the metadata entities are identified or referenced, such as established controlled vocabularies, and iii) standardized data formats to enable data exchange (e.g. queries) via computational information systems. The DCIC has worked in

25


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI close coordination with the DSGCs to establish metadata standards. Metadata conforming to these standards are incorporated into LINCS data packages in our data release process and have been shown to be fundamentally trustworthy. However, long-term quality and usability of LINCS data, require tools that ensure the consistency of the data collection and that facilitate increasing automation of metadata collection, validation, and quality assurance as key components of a controlled data release process. The aim of this work has been to accomplish this by providing intuitive data entry tools for the consistent registration, revision, and validation of LINCS metadata. Several features have been incorporated including a unified interface, field mapping, quality control flags, and synonyms. Consistent and high quality metadata are a critical requirement of the FAIR data principles and our metadata verification and registration tools provide one step towards this goal in the LINCS consortium.

LINCS Dataset Registry (LDR): A Web-Based System to Capture and Manage LINCS Data Releases with Autocomplete Web Forms Michael G. McDermott1,2, Qiaonan Duan1,2, Amar Koleti2,3, Dušica Vidović2,3,4, Stephan Schurer2,3,4, Chris Mader2,3, Avi Ma’ayan1,2 1

Dept. of Pharmacology & Systems Therapeutics, Icahn School of Medicine at Mount Sinai, New York, NY BD2K-LINCS Data Coordination and Integration Center (DCIC) 3 Center for Computational Science, University of Miami, Miami, FL 4 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL 2

One of the challenges of BD2K is to capture metadata about dataset instances and link such metadata to controlled dictionaries and ontologies. This metadata capture is expected to improve dataset search and facilitate data integration. This challenge is central to the LINCS program and the BD2K-LINCS DCIC because different LINCS data generation centers use different but overlapping assays, perturbations (genes/proteins, small-molecules), cell-lines, disease models, readouts and other common entities. Most major data repositories such as GEO or Chorus currently do not have advanced web-based forms to capture metadata about dataset instances from their data submitters. In year 1, the BD2K-LINCS DCIC developed the LINCS Dataset Registry (LDR) system to capture, visualize and manage all LINCS released datasets. LDR is a modern, mobile-friendly web application designed to streamline the process of submitting, approving, and releasing datasets. LDR consists of a client-side application created with the JavaScript library AngularJS and a web server application written in NodeJS. The server's extensive API's communicate to a MongoDB database responsible for storing and querying each center's data. LDR has login authentication functionality that enables the security of unreleased datasets, and its advanced input forms allow for fast, hassle-free data entry. Form entities have autocomplete functionality drawing from ontologies and dictionaries managed by live remote servers. LDR also contains a dataset-specific message board that enables communication between the LINCS data generation centers and the NIH staff to facilitate an approval process. While designed for LINCS, LDR will be generalized to facilitate data capture for other projects.

LINCS Data standards, Metadata and the MetaData Registry (MDR) Vasileios Stathias1,2,3, Dušica Vidović1,2,3, Amar Koleti1,2, Christopher Mader1,2, Stephan Schürer1,2,3 1 Center

for Computational Science and University of Miami, Miami, FL LINCS Data Coordination and Integration Center 3 Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL 2 BD2K

The NIH Library of Integrated Network-Based Cellular Signatures (LINCS) program generates extensive multidimensional datasets using a variety of assay formats and technologies. Integration and analysis of

26


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI diverse LINCS datasets depend on the availability of sufficient metadata to describe the assays and screening results. LINCS metadata standards were developed in the LINCS Data Working Group (DWG) as the LINCS Consortium effort. A DWG private Google Web site and Google spreadsheets were set for sharing information within the LINCS Consortium. The metadata standards specifications have been published on the LINCS website: http://lincsproject.org. A dedicated information system, the Metadata registry was developed and implemented for registration and storage of standardized metadata entities with their roles and connections to LINCS assays, projects, organizations, and datasets.

Construction and Validation of Consensus Gene Signatures Lixia Zhang, Mario Medvedovic University of Cincinnati

The objective of our project was to construct and validate Consensus Gene Signatures (CGS) based on the LINCS L1000 shRNA knockdown data. The CGS’ were constructed by first constructing ModZ signatures of individual shRNA perturbations based on the Level 4 data released by the Broad Institute (http://lincscloud.org), and then aggregating individual shRNA signatures into a CGS using again moderated Z methodology. CGS’ were validated based on the known biological protein-protein and protein-gene interactions gathered from the STRING database and KEGG pathways.

27


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Logistics Hotel Mutiny 2951 S Bayshore Drive (“Coconut Grove�) Miami, FL 33133 305.441.2100 http://www.providentresorts.com/mutiny-hotel/ Shuttle departs Hotel Mutiny Lobby at 7:30 AM EST on 1/19 & 20, and at 8:30 AM on 1/21.

Lois Pope Life Center UM Miller School of Medicine Campus 1095 NW 14th Terrace Miami, FL 33136 305.243.6001 http://uhealthsystem.com/locations/lois-pope-life-center The entrance to the building is located on the right hand side of the 14th Terrace. At the lobby, attendees will be escorted to the elevator towards the 7th floor. Upon the elevator exit, the auditorium entrance can be accessed from either side.

Parking The entrance to the Lois Pope Life Center is on NW 14th Terrace. Public parking is available in the Dominion Towers Parking Garage (pink garage on the corner of NW 11th Avenue and NW 14th Street) or metered parking along NW 11th Avenue.

28


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI Public Transportation Public transportation is available via the Metrobus routes 12, 22, 32, 95 and M; and the Metrorail Civic Center Station adjacent to the University of Miami Medical Campus Lois Pope Life Center Breezeway The LPLC Breezeway lies between the LPLC and Schoninger Research Quadrangle.

Toppel Career Center 2951 S Bayshore Drive (“Coconut Grove”) Miami, FL 33133 305-441-2100 http://www.sa.miami.edu/toppel/mainSite/ The shuttle departs the Mutiny Hotel lobby at 7:30 am EST to the Toppel Career Center Reflections Courtyard. Entrances to the building are on Ponce De Leon Blvd. and at the back. • • •

The Loft (Multipurpose Room #250) At the lobby, attendees will be directed to the elevator towards the 2nd floor. The entrance is in front of the elevator. The Stone-Rodriguez Technology Lab (Room #133) At the lobby, head towards the 3x2 flat panels. In front of the single display on the right hand side, you will find the “Tech Lab”. Conference & Seminar Room (Room #114) At the lobby, head towards the 3x2 flat panels. After the single flat panel display turn left, walk to the end of the hallway, you will find the Conference and Seminar Room.

Parking The entrance to the Toppel Career Center is on both sides Ponce De Leon Blvd and the back. Public parking is available in the Pavia Garage http://www.miami.edu/index.php/about_us/visit_um/where_to_park/ Public Transportation Public transportation is available via the Metrorail (blue M at bottom of map). Get off at “University Station” (cross the street—Ponce de Leon Boulevard—to Toppel Center).

29


1st Annual Systems Biology Data Science Symposium | January 19-21, 2016 | MIAMI

Gables One Tower Center for Computational Science 1320 S Dixie Highway, Suite 600 Coral Gables, FL 33146 305-243-4962 http://www.ccs.miami.edu/ The shuttle departs the Mutiny Hotel lobby at 8:30 am EST to the Gables One Tower. The entrance to the building is located on S Dixie and the back. The Center for Computational Science is located on the sixth floor, Suite 600, but the meeting will take place in Training Room 639 (from the elevators turn left in the hallway and it’s on your right).

Credits This event is supported by the University of Miami School of Medicine Department of Molecular and Cellular Pharmacology, the University of Miami Center for Computational Science, and by grant U54HL127624 awarded by the National Heart, Lung, and Blood Institute through funds provided by the trans-NIH Library of Integrated Network-based Cellular Signatures (LINCS) Program (http://www.lincsproject.org/) and the trans-NIH Big Data to Knowledge (BD2K) initiative (http://www.bd2k.nih.gov).

30


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.