Switching to C++ ��

Speeding Up Your Code ��

CHAPTER 14 Profiling and Optimizing

In this last chapter, we briefly consider what to do when you find that your code is running too slow, and, in particular, how to figure out why it is running too slow.

Before you start worrying about your code’s performance, though, it is important to consider if it is worth speeding it up. It takes you time to improve performance, and it is only worth it if the improved performance saves you time when this extra programming is included. For an analysis you can run in a day, there is no point in spending one day making it faster, even much faster, because you still end up spending the same time, or more, to finally get the analysis done.

Any code you just need to run a few times during an analysis is usually not worth optimizing. We rarely need to run an analysis just once—optimistically we might hope to, but in reality, we usually have to run it again and again when data or ideas change—but we don’t expect to run it hundreds or thousands of times. So even if it will take a few hours to rerun an analysis, your time is probably better spent working on something else while it runs. It is rarely worth it to spend a lot of time making it faster. The CPU time is cheap compared to your own.

If you are developing a package, though, you often do have to consider performance to some extent. A package, if it is worth developing, will have more users and the total time spend on running your code makes it worthwhile, up to a point, to make that code fast.

Profiling

Before you can make your code faster, you need to figure out why it is slow, to begin with. You might have a few ideas about where the code is slow, but it is actually surprisingly hard to guess at this. Quite often, I have found, it is nowhere near where I thought it would be, that most of the time is actually spend. On two separate occasions, I have tried working really hard on speeding up an algorithm only to find out later that the reason my program was slow was the code used for reading the program’s input. The parser was slow. The algorithm was lightning fast in comparison. That was in C, where the abstractions are pretty low-level and where it is usually pretty easy to glance from the code how much time it will take to run. In R, where the abstractions are very high-level, it can be very hard to guess how much time a single line of code will take to run.

The point is, if you find that your code is slow, you shouldn’t be guessing at where it is slow. You should measure the running time and get to know for sure. You need to profile your code to know which parts of it is taking up most of the running time. Otherwise, you might end up optimizing code that uses only a few percentages of the total running time and leaving the real time-wasters alone.

In common code, there are only a few real bottlenecks. If you can identify these and improve their performance, your work will be done. The rest will run fast enough. Figuring out where those bottlenecks are requires profiling.

Chapter 14 ■ profiling and optimizing

We are going to use the profvis package for profiling. In the most recent versions of RStudio, there is support for this, if your version has it you should have a Profile item in the main menu. We will just use the package in our R code here.

A Graph-Flow Algorithm

For an example of some code, imagine you want to profile a small graph algorithm. It is an algorithm for smoothing out weights put on nodes in a graph. It is part of a method used for propagating weights of evidence for nodes in a graph and has been used to boost searching for disease-gene associations using gene-gene interaction networks. The idea is, that if a gene is a neighbor to another gene in this interaction network, then it is more likely to have a similar association with a disease as the other gene. So genes with known association are given an initial weight, and other genes get a higher weight if they are connected to such genes than if they are not.

The details of what the algorithm is used for is not so important, though. All it does is to smooth out weights between nodes. Initially all nodes, n, are assigned a weight w(n). Then in one iteration of smoothing, ∈ ( ) this weight is updated as ′( )w n = ( )wn + −( ) ( )Nn ∑ ( )α α 1 1 vN n wv , where α is a number between zero and one and N(n) denotes the neighbors of node n. If this is iterated enough times, the weights in a graph become equal for all connected nodes in the graph, but if stopped earlier, it is just a slight smoothing, depending on the value of α.

To implement this, we need both a representation of graphs and the smoothing algorithm. We start with representing the graph. There are many ways to do this, but a simple format is a so-called incidence matrix. This is a matrix that has entry Mi, j = 0 if nodes i and j are not directly connected and Mi, j = 1 if they are. Since we want to work on a non-directed graph in this algorithm, we will have Mi, j = Mj, i.

We can implement this representation using a constructor function that looks like this:

graph <- function(n, edges) { m <- matrix(0, nrow = n, ncol = n)

no_edges <- length(edges) if (no_edges >= 1) { for (i in seq(1, no_edges, by = 2)) { m[edges[i], edges[i+1]] <- m[edges[i+1], edges[i]] <- 1

structure(m, class = "graph")

Here I require that the number of nodes is given as an argument n and that edges are specified as a vector where each pair corresponds to an edge. This is not an optimal way of representing edges if graphs should be coded by hand, but since this algorithm is supposed to be used for very large graphs, I assume we can write code elsewhere for reading in a graph representation and creating such an edge vector.

There is not much to the function. It just creates the incidence matrix and then iterates through the edges to set it up. There is a special case to handle if the edges vector is empty. Then the seq() call will return a list going from one to zero. So we avoid this. We might also want to check that the length of the edge vector is a multiple of two, but I haven’t bothered. I am going to assume that the code that generates the vector will take care of that.

Even though the graph representation is just a matrix, I give it a class in case I want to write generic functions for it later.

304

Switching to C++ ��

Next Article

Speeding Up Your Code ��

CHAPTER 14

Profiling and Optimizing

Profiling

A Graph-Flow Algorithm

More articles from this publication:

Speeding Up Your Code ��

Exercises��

Using git in RStudio��

Version Control and Repositories ��

Collaborating on GitHub��

Profiling��

Formulas and Their Model Matrix��

Bayesian Linear Regression��

Parallel Execution��

This article is from:

Beginning of Data Science in R

Next Article

CHAPTER 14

Profiling and Optimizing

Profiling

A Graph-Flow Algorithm

More articles from this publication:

Version Control and Repositories ���������������������������������������������������������������������������������

Formulas and Their Model Matrix���������������������������������������������������������������������������������

Bayesian Linear Regression�����������������������������������������������������������������������������������������

This article is from:

Beginning of Data Science in R

Version Control and Repositories ��

Formulas and Their Model Matrix��

Bayesian Linear Regression��