Cambridge Bioinformatics Training
Bioinformatics for Biologists An introduction to programming, analysis, and reproducibility 3 - 7 December 2018
1
Programme Monday 3rd December 2018 Getting you set up for Bioinformatics analyses 09:30
Welcome
09:45
Bioinformatics opportunities and challenges in handling biological data This session will highlight examples of opportunities and challenges that are associated with bioinformatics, touching on topics that will be covered in this course.
10:45
Coffee break
11:00
Introduction to Research Data Management This session will introduce the basic principles of Research Data Management (RDM) and how they are relevant throughout the research life cycle. Intended for those who are new to RDM, we will firstly explain what RDM is, and then go on to cover basic data back-up and storage options, file sharing tools, and strategies for organising your data.
13:00
Lunch
14:00
Data organisation in spreadsheets This session will discuss ways in which data should (and should not) be formatted and stored for adequate analysis. This includes a discussion about specific types of variables, such as dates, and missing values.
15:30
Coffee break
15:45
Introducing R In this session we will learn how to use R and RStudio software and cover the basic syntax of the R language. This includes being able to create different types of objects, use functions to carry particular tasks and read tabular data from a file.
17:30
End of day 1
Tuesday 4th December 2018 Introduction to programming in R 09:30
R recap
09:50
Data structures In this session we will learn how R represents tabular data and how to manipulate these objects, for example to subset rows and columns from a table.
10:45
Coffee break
To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538 | 1
11:00
Data manipulation in R In this session we will learn advanced ways to manipulate and query tabular data in R. This includes modifying variables, conditional filtering of rows, and creating grouped summaries from the data.
13:00
Lunch
14:00
Data visualisation in R In this session we will learn how to use the ggplot2 package to produce a range of graphs from tabular data. This includes the usage of colour or shape elements to highlight different groups in the dataset, as well as splitting the plots into facets for visualising highly dimensional data.
15:30
Coffee break
15:45
Data exploration in R This session applies the concepts learned in the course to a worked example of an exploratory data analysis from a transcriptome study. This includes exploring the data using graphs, and an application of a commonly used method to help interpret multi-dimensional datasets (Principal Components Analysis).
17:30
End of day 2
Wednesday 5th December 2018 Introduction to Statistics for data analysis 09:30
Introduction and R revision In this session we will set out the goals for the day’s sessions and ensure that all participants are comfortable using R/RStudio statistical software.
10:30
Coffee break
10:45
Simple hypothesis testing In this session we will explore the fundamental principles of traditional frequentist statistics and ensure that participants are able to analyse simple datasets in a confident manner; selecting appropriate tests and interpreting the results.
13:00
Lunch
14:00
Statistics for small data In this session we will build upon the earlier session and look at additional techniques for dealing with traditional small datasets. This will mainly revolve around linear models with small numbers of predictor variables.
15:30
Coffee break
15:45
Statistics for big data In this session we will consider what is meant by “big” data and explore how analysing big datasets differs from traditional “small” data. We will highlight this using unsupervised learning as an example.
17:30
End of day 3
2 | To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538
Thursday 6th December 2018 Application of programming and analyses to biological data 09:30
RNA-seq analyses: case study From fastq files to differential expression we will cover each step from an RNAseq analysis. We will explain the main steps in the pipeline and some of the theory behind it.
10:30
Coffee break
10:45
RNA-seq analyses: case study (continued)
13:00
Lunch
14:00
Biological imaging: case study This session will include: an introduction to data management in OMERO; how to integrate a variety of 3rd party processing tools e.g. ImageJ, Orbit, R with OMERO for manual analysis and mine the results using OMERO tools; how to transition from manual data processing to automated processing workflows using applications against the OMERO API; how to generate output ready for publication.
15:30
Coffee break
15:45
Biological imaging: case study (continued)
17:30
End of day 4
Friday 7th December 2018 Reusability and reproducibility for Bioinformatics analyses 09:30
Advanced Data Management This session covers managing personal and sensitive data in the context of the new GDPR legislation, why it is a Good Thing to share your data, and how to do this most effectively in terms of describing your data, deciding where to share it, and using licences to control how your data is used by others. You will even get to write your own Data Management Plan (DMP): these help you manage your data throughout a project and after it has ended and are increasingly required as part of a grant or fellowship application.
11:30
Coffee break
11:45
Reproducible research with Rmarkdown The session will introduce the notion of markdown and literate programming combining human-readable text and computer code. We will then examine how to write reproducible reports containing tables and figures and the challenges faced with writing large, complex reports.
13:00
Lunch
To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538 | 3
14:00
Introduction to version control using GitHub Have you ever wondered how to keep automatic control of the different versions of your research documents, computer scripts and data sets? In this session we will learn the basics of git and GitHub so that you can then develop version control for your own research projects.
15:30
Coffee break
15:45
Wrap-up and next steps This session will provide an opportunity to reflect on what was covered during this course.
17:30
End of day 5
Booking for this week-long course can be made at: https://training.csx.cam.ac.uk/bioinformatics/event/2731538
While lunch is not provided during the course, tea/coffee are provided throughout. Deadline for booking is 1 November 2018
Venue The course will be held at the Bioinformatics Training Facility located in Craik-Marshall Building, Downing Site, University of Cambridge, Cambridge CB2 3EB, United Kingdom (http://map.cam.ac.uk/Craik-Marshall+Building).
Contact us For more information contact us on grad.bioinfo@lifesci.cam.ac.uk
4 | To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538
Course Syllabus Aims The aim of this 1-week course is to: • Encourage the development of the bioinformatics skills needed to process biological data effectively • Provide practical experience and guidance on how to manage and analyse examples of biological data • Introduce best practises with regards to working with data reproducibly
Content This course provides an introduction to the best practises and tools needed to perform research effectively and reproducibly. The course will begin with discussing what opportunities and challenges are associated with aspects of bioinformatics analyses. While only conceptually touching on some of these, we will address a subset of them in greater details in the central part of the course and provide time for students to practise using some of the associated bioinformatics tools. Focusing on solutions around handling biological data, we will cover introductory lessons in R, version control, statistical analyses, and data management. The R component of the course will cover from basic steps in R to how to use some of the most popular R packages (dplyr and ggplot2) for data manipulation and visualisation. No prior R experience or previous knowledge of programming/coding is required.
At the end of the course we will address issues relating to reusability and reproducibility.
Target audience This course is aimed at individuals working across biological and biomedical sciences who have little or no experience in bioinformatics. Applicants are expected to have an interest in learning about bioinformatics and/or are in the beginning stages of using bioinformatics in their research with the need to develop their skills and knowledge further. No previous knowledge of programming/coding is required for this course.
Presentation of the course The course will consist of a mixture of trainer-led lectures and practical work which will consist of computer exercises that will introduce the participants to software
To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538 | 5
tools, including R, to analyse biomedical data under the guidance of the trainers and teaching assistants. As a result of attending the course participants should be able to: •
Define opportunities and challenges of bioinformatics use in research
•
Format and clean, visualise, and explore datasets in R
•
Evaluate which statistical tests are appropriate for a dataset
•
Develop an appropriate strategy for research data management
•
Implement reusable and reproducible methods for their research
Reading and resources list Listed below are a number of texts that might be of interest for future reference, but do not need to be bought (or consulted) for the course. Books to read following the R session: Garrett Grolemund and Hadley Wickham. R for Data Science. O'Reilly 2017 (available at: http://r4ds.had.co.nz/) Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning. Springer 2017 (available at: http://www-bcf.usc.edu/~gareth/ISL/) Reading list for Introduction to version control with GitHub session: Ten simple rules for taking advantage of Git and GitHub (http://journals.plos.org/ ploscompbiol/article?id=10.1371/journal.pcbi.1004947) URLs https://www.rstudio.com/resources/cheatsheets/ https://dockflow.org/ https://codeocean.com/
6 | To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538
Programme trainers Lauren Cadwallader
(TAP) and the Pathways to Higher EducaDr Lauren Cadwallader is Acting Joint Depu- tion Practice (PHEP). ty Head of Scholarly Communication with Sergio Martinez Cuesta responsibility for Research Data Management. She completed her PhD in Archaeolo- Sergio is a research associate in the CRUKgy at the University of Cambridge in 2013, CI and the Department of Chemistry at the carried out a one-year postdoc and then University of Cambridge (UK). He studied moved into the area of scholarly communiChemistry at the University of Granada cation, advising academics on open access (Spain) followed by a PhD in bioinformatics publishing, and providing training to the re- and cheminformatics at EMBL-EBI (UK). He search community around open research. In develops computational methods to map her current role, she is responsible for the and characterise chemical changes and Research Data services run by the Office of damage in DNA and RNA with an aim to unScholarly Communication, dealing with derstand basic biological processes, aging questions about research data management and disease. He very much enjoys developand sharing, and helping the University’s ing diverse training materials for computaresearchers comply with the open data shar- tional courses in life and physical sciences. ing requirements of their funders. https://orcid.org/0000-0001-9806-2805
Matt Castle
Matthew Eldridge
Dr Matt Castle is Head of the GSLS Biostatistics Initiative at the University of Cambridge. He teaches on, and coordinates, a wide range of practical statistics training courses for graduate students in the life sciences. His previous work in epidemiological modelling; analysing and providing advice to governments and NGOs on disease control strategies, and his experience as a secondary mathematics teacher, mean that he is very aware of the importance of clear communication when dealing with challenging topics like statistics, and his teaching emphasises the practical nature of these skills. Matt is heavily involved with teaching across the university at all levels from undergraduates through to new lecturers. He provides lectures, supervisions and practical classes for various courses in the Natural Sciences and Mathematics Triposes as well as working with the Cambridge Centre for Teaching and Learning to support early career academics and researchers develop their pedagogical understanding and teaching skills as part of the Teaching Associates’ Programme
Matt joined the CRUK CI in December 2007 as Head of the Bioinformatics Core. He originally studied chemistry, completing both a first degree and a D.Phil. in computational chemistry at the University of Oxford. He began his working life at Proteus Molecular Design, a company specializing in computer -aided drug design, before moving into bioinformatics at the European Bioinformatics Institute as part of the Industry Support Programme. En route to CRUK CI, he gained over 10 years commercial software development experience at Synomics, LSIS and Biowisdom, consulting on and delivering enterprise data management systems to several major pharmaceutical and biotechnology companies. Matt particularly enjoys coding in Java, Ruby and R. He has developed MGA, a contaminant screening QC tool for sequencing data, and various analysis pipelines for calling variants in cancer samples and circulating DNA. He has also had considerable fun creating web applications in R with Shiny, to support training courses and for providing access to data resources creat-
To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538 | 7
ed by CRUK CI labs (BCaPE), as well as deploying these using Docker and Singularity. He contributes to and helps run training courses in cancer genome analysis, R, Docker and Shiny.
Freising, he worked at different places as Java Software Developer with a bit of Linux system administration. His projects mostly had a bioinformatics background (analysis of multiple sequence alignments, DNA sequence optimization) with a short side trip Stephen Eglen into the world of logistics/warehousing. In Stephen Eglen is a University Reader in the his spare time he enjoys exploring Scotland's countryside on foot, on the bicycle or Department of Applied Mathematics and on the motorbike; respectively when the Theoretical Physics and Fellow of Magdalene College Cambridge. He is a computa- days are short and the weather's bad, trying to set up the ultimate Linux system. tional neuroscientist who studies mechanisms of neural development. He is also a Oscar M. Rueda strong advocate for open science and reproducible research. Oscar M. Rueda completed a BSc in Statistics and a MSc in Statistics at Universidad Kim Gurwitz de Valladolid (Spain). He then moved to the Spanish National Cancer Centre (CNIO) in Kim is the ELIXIR EXCELERATE Training Impact Coordinator. Her responsibilities in- Madrid and finished his PhD in Mathematics (Statistics) in 2008. He is currently a Senior clude assessing the impact of training courses run at the Cambridge Bioinformat- Research Associate in the Caldas Lab, where he has been for the last 9 years. His ics Training Programme as well as coordiwork involves developing statistical methods nating the impact assessment effort for training across ELIXIR, a Pan European Bi- for stratifying breast cancer patients based on molecular profiles and finding biomarkers oinformatics Network. She was previously Training and Outreach officer at H3ABioNet, for risk of relapse. a Pan African Bioinformatics Network. Kim completed her MSc in Medical Biochemistry from the University of Cape Town, South Africa in 2016. Her research experience includes: mass spectrometry-based proteomics, bioinformatics, HIV-associated neurocognitive disorders, stem cell culture.
Hugo Tavarez
Hugo studied Biology at the University of Lisbon (Portugal) and did a PhD in evolutionary genetics at the John Innes Centre (Norwich). In his PhD project, he applied a combination of classic genetics and population genomics to study a natural hybrid Danny Kingsley zone, which helped understand the forces shaping divergence patterns in genomes of Danny Kingsley is the Deputy Director for closely related subspecies. For the past Scholarly Communication and Research three years Hugo has been working as a Services at Cambridge University Library. She is responsible for the scholarly commu- postdoc at the Sainsbury Laboratory Camnication services of the University – includ- bridge University, where his research work focuses on the quantitative genetics of coming Open Access and Data Sharing. She writes extensively about scholarly communi- plex traits, and how they interact with the environment. Since moving to Cambridge, cation issues on the Unlocking Research he has become a regular R trainer for the blog. University's Bioinformatics Training team and in his institute, where he also gives sciDominik Lindner entific computing and data analysis support. Dominik Lindner joined the team in Dundee as a Software Developer in February 2014. Petr Walczysko After studying Bioinformatics at the UniversiPetr Walczysko joined the OME project in ty of Applied Sciences Weihenstephan,
8 | To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538
October 2012 as a software specialist for testing and quality assurance. He studied at Charles University of Prague where he received Master of Science degree in Physics and the University of Freiburg in Germany where he received PhD in Biology. Throughout his PhD studies and his further career as a researcher he was intensively using conventional, confocal and multiphoton fluorescence microscopy applications on biological systems. He was adapting these optical microscopy techniques for particular biological problems, and also worked on the subsequent image analysis of microscopic images in a range of image analysis programmes. He enjoys yoga, reading and chess.
To register: https://training.csx.cam.ac.uk/bioinformatics/event/2731538 | 9
Contact us: Cambridge Bioinformatics Training Craik-Marshall Building Downing Site University of Cambridge Cambridge CB2 3EB United Kingdom Email: grad.bioinfo@lifesci.cam.ac.uk Telephone: +44 (0)1223 333614 Website: https://bioinfotraining.bio.cam.ac.uk/ Follow us on Twitter: @BioInfoCambs