Data Exploration for Biologists - Winter School

Page 1

Cambridge Bioinformatics Training - Winter School

Data Exploration for Biologists An introduction to Data Exploration, Statistics, and Reproducibility 7 - 11 December 2020

1


The Data Exploration for Biologists is a week long school, which will introduce participants to the foundational methods used that enable exploration of biological data. It does so by introducing programming, statistical and reproducible concepts that enable the creation of automated ways to analyse data. The course will go over popular software and packages used in the field, such as R, dplyr, ggplot and GitHub. All these will be used to process, visualise and analyse data in a reproducible way. This is the third time we will run the Data Exploration for Biologists School which has been a successful event each year. This year due to the COVID-19 pandemic, the event will be taught online via an online training environment that we have in place. I hope that you and your loved ones stay safe and we look forward to welcoming you to the Winter School in our virtual classroom.

Dr Alexia Cardona Course Organiser for the Data Exploration for Biologists School

To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700 | 1


Programme Monday 7th December 2020 Getting you set up for Bioinformatics analyses 09:30

Welcome Dr Alexia Cardona

09:50

Keynote lecture Bioinformatics opportunities and challenges in handling biological data Dr Matthew Eldridge

This session will highlight examples of opportunities and challenges that are associated with bioinformatics, touching on topics that will be covered in this course.

10:50

Coffee break

11:00

Introduction to programming in R Dr Alexia Cardona, Dr Martin van Rongen

Learn how to use R and RStudio IDE and cover the basic syntax of the R programming language. You will be able to create scripts and use functions to perform specific operations on your data and read and write tabular data from and to a file.

13:00

Lunch break

14:00

Introducing to programming in R (cont.)

15:30

Coffee break

15:45

Introduction to programming in R (cont.)

17:30

End of day 1

Tuesday 8th December 2020 Data Manipulation and Visualisation in R 09:30

R recap

09:50

Data manipulation in R Dr Alexia Cardona, Dr Martin van Rongen In this session we will learn how to use the dplyr package. We will learn advanced ways to manipulate and query tabular data in R. This includes filtering of datasets, and creating grouped summaries from the data.

10:45

Coffee break

To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700 | 2


11:00

Data visualisation in R Dr Alexia Cardona, Dr Martin van Rongen

In this session we will learn how to use the ggplot2 package to produce a range of plots from tabular data. This includes the usage of colour or shape elements to highlight different groups in the dataset, as well as splitting the plots into facets for visualising highly dimensional data.

13:00

Lunch break

14:00

Data manipulation and visualisation in R (cont.)

15:30

Coffee break

15:45

Data manipulation and visualisation in R (cont.)

17:30

End of day 2

Wednesday 9th December 2020 Introduction to Statistics for Data Analysis 09:30

Introduction and R recap Dr Matt Castle

In this section we will set out the goals for the day’s sessions. We will ensure that all participants are comfortable using R/RStudio and introduce new R programming concepts that will be used to perform the statistical tests introduced.

10:30

Coffee break

10:45

Simple hypothesis testing Dr Matt Castle

In this section we will explore the fundamental principles of traditional frequentist statistics and ensure that participants are able to analyse simple datasets in a confident manner; selection appropriate tests and interpreting the results.

13:00

Lunch break

14:00

Statistics for small data Dr Matt Castle

In this section we will build upon the earlier session and look at additional techniques for dealing with traditional small datasets. This will mainly revolve around linear models with small numbers of predictor variables.

15:30

Coffee break

15:45

Statistics for big data Dr Matt Castle

In this section we will consider what is meant by “big” data and explore how analysing big datasets differs from traditional “small” data. We will highlight this using unsupervised learning as an example.

17:30

End of day 3

3 | To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700


Thursday 9th December 2020 Introduction to Exploratory Analysis of RNA-seq data in R 09:30

Introduction Dr Hugo Tavares, Dr Martin van Rongen

10:30

Coffee break

10:45

RNA-seq exploratory analysis Dr Hugo Tavares, Dr Martin van Rongen

From normalised expression data we will perform exploratory data analysis, unsupervised clustering, and unsupervised learning to identify expression signatures.

13:00

Lunch break

14:00

RNA-seq exploratory analysis (cont.)

15:30

Coffee break

15:45

RNA-seq exploratory analysis - advanced topics Dr Hugo Tavares, Dr Martin van Rongen

17:30

End of day 4

Friday 10th December 2020 Introduction to Reproducible Research 09:30

Reproducible Research with RMarkdown Dr Alexia Cardona, Dr Martin van Rongen

The session will introduce markdown and literate programming - combining human-readable text and computer code. We will then learn how to write reproducible reports.

10:45

Coffee break

11:00

Reproducible Research with RMarkdown (cont.)

13:00

Lunch break

14:00

Introduction to version control using GitHub Dr Alexia Cardona, Dr Martin van Rongen

Have you ever wondered how to keep automatic control of the different versions of your research documents, computer scripts and data sets? In this session we will learn the basics of git and GitHub. This will enable you to develop version control for your own research projects. We will also learn how to publish the R Markdown report created in the morning session on GitHub on the web.

15:30

Coffee break

15:45

Introduction to version control using GitHub (cont.)

16:30

Wrap-up This session will provide an opportunity to reflect on what was covered during this course.

17:30

End of day 5

To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700 | 4


Course Syllabus Aims The aim of this 1-week course is to: • Encourage the development of the bioinformatics skills needed to process biological data effectively • Provide practical experience with, and guidance on, how to manage and analyse examples of biological data • Introduce best practises with regards to working with data reproducibly

Content This course provides an introduction to data exploration of biological data. It provides a learning journey starting with learning about how we can automate processes that can be reproduced to analyse our biological data. The course will begin with discussing what opportunities and challenges are associated with aspects of bioinformatics analyses. While only conceptually touching on some of these, we will address a subset of them in greater detail in the central part of the course and provide time for participants to practise using some of the associated bioinformatics tools. Focusing on solutions around handling biological data, we will cover programming in R, version control, statistical analyses, and data exploration. The R component of the course will cover from basic steps in R to how to use some of the most popular R packages (dplyr and ggplot2) for data manipulation and visualisation. No prior R experience or previous knowledge of programming/coding is required. At the end of the course we will address issues relating to reusability and reproducibility.

Target audience This course is aimed at individuals working across biological and biomedical sciences who have little or no experience in bioinformatics. Applicants are expected to have an interest in learning about bioinformatics and/or are in the beginning stages of using bioinformatics in their research with the need to develop their skills and knowledge further. No previous knowledge of programming is required for this course.

Presentation of the course The course will consist of a mixture of trainer-led lectures and practical work which will consist of computer exercises that will introduce the participants to software tools, including R, to analyse biomedical data under the guidance of the trainers and teaching assistants.

To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700 | 5


As a result of attending the course participants should be able to: •

Define opportunities and challenges of bioinformatics use in research

Format, query, visualise, and explore datasets in R

Evaluate which statistical tests are appropriate for a dataset

Implement reproducible methods for their research

Live online training This year the course will be delivered online via our online training environment. You will be given access before the course. As such there will be no need to install software in advance. We will provide software installation instructions and support in case you would like to install the software on your computer. We aim to provide a classroom experience as closely as possible, with opportunities for one-to-one discussion with tutors.

Course enrolment https://www.training.cam.ac.uk/bioinformatics/event/3353700

Deadline for booking is 20 November 2020

Contact us For more information contact us on grad.bioinfo@lifesci.cam.ac.uk Subscribe to our mailing list: https://lists.cam.ac.uk/mailman/listinfo/ucam-bioinfo-training

6 | To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700


Reading and resources list Listed below are a number of texts that might be of interest for future reference, but do not need to be bought (or consulted) for the course. Books to read following the R session: Garrett Grolemund and Hadley Wickham. R for Data Science. O'Reilly 2017 (available at: http://r4ds.had.co.nz/) Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. An Introduction to Statistical Learning. Springer 2017 (available at: http://www-bcf.usc.edu/ ~gareth/ISL/) Reading list for Introduction to version control with GitHub session: Ten simple rules for taking advantage of Git and GitHub (http://journals.plos.org/ ploscompbiol/article?id=10.1371/journal.pcbi.1004947) URLs https://www.rstudio.com/resources/cheatsheets/

7 | To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700


Programme Instructors Alexia Cardona

Matthew Eldridge

Dr Cardona leads training development of the University of Cambridge’s Bioinformatics Training Programme. Her role involves the management of the different aspects of training including design, development, coordination and teaching of undergraduate and postgraduate training in Bioinformatics and Data Science. She is a leader in the ELIXIR international community, where together with the other leaders and partners she drives the establishment of high-quality training in Data Management for the Life Sciences. Dr Cardona is an advocate of participation in Communities of Practice and of women in leading and computational sectors which are currently underrepresented.

Matt is Head of the Bioinformatics Core facility at the Cancer Research UK Cambridge Institute and leads a team of analysts, statisticians and software developers. The handson aspects of his role involve analysis of large and complex cancer datasets and development of interactive data exploration and visualization tools to support the research groups at the institute. He particularly enjoys coding in R and helping to train biologists in cancer genome analysis. En route to CRUK CI, he gained over 10 years commercial software development experience, developing and consulting on enterprise data management systems for several major pharmaceutical and biotechnology companies. He originally studied chemistry, completing a first degree and then a doctorate in computational chemistry at the University of Oxford.

Matt Castle Dr Matt Castle is Head of the GSLS Biostatistics Initiative at the University of Cambridge. He teaches on, and coordinates, a wide range of practical statistics training courses for graduate students in the life sciences. His previous work in epidemiological modelling; analysing and providing advice to governments and NGOs on disease control strategies, and his experience as a secondary mathematics teacher, mean that he is very aware of the importance of clear communication when dealing with challenging topics like statistics, and his teaching emphasises the practical nature of these skills. Matt is heavily involved with teaching across the university at all levels from undergraduates through to new lecturers. He provides lectures, supervisions and practical classes for various courses in the Natural Sciences and Mathematics Triposes as well as working with the Cambridge Centre for Teaching and Learning to support early career academics and researchers develop their pedagogical understanding and teaching skills as part of the Teaching Associates’ Programme (TAP) and the Pathways to Higher Education Practice (PHEP).

Hugo Tavarez Hugo studied Biology at the University of Lisbon (Portugal) and did a PhD in evolutionary genetics at the John Innes Centre (Norwich). In his PhD project, he applied a combination of classic genetics and population genomics to study a natural hybrid zone, which helped understand the forces shaping divergence patterns in genomes of closely related subspecies. For the past three years Hugo has been working as a postdoc at the Sainsbury Laboratory Cambridge University, where his research work focuses on the quantitative genetics of complex traits, and how they interact with the environment. Since moving to Cambridge, he has become a regular R trainer for the University's Bioinformatics Training team and in his institute, where he also gives scientific computing and data analysis support.

Martin van Rongen Martin is a post-doctoral researcher at the

To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700 | 8


Sainsbury Laboratory in Cambridge. He studied Biology at the University of Leiden (The Netherlands) before doing his PhD in Plant Sciences in Cambridge. During his PhD project he quickly realised the importance of accurate data collection and analysis and started teaching himself R to better understand his data. He is now a regular R trainer for the University’s Bioinformatics Training team and provides local support to students at the Sainsbury Laboratory, focussing on exploratory data analysis and research reproducibility.

9 | To register: https://www.training.cam.ac.uk/bioinformatics/event/3353700


Contact us: Cambridge Bioinformatics Training Craik-Marshall Building Downing Site University of Cambridge Cambridge CB2 3EB United Kingdom Email: grad.bioinfo@lifesci.cam.ac.uk Telephone: +44 (0)1223 333614 Website: https://bioinfotraining.bio.cam.ac.uk/ Mailing list: https://lists.cam.ac.uk/mailman/listinfo/ucam-bioinfo-training Follow us on Twitter: @BioInfoCambs


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.