Function Operations: Functions as Input and Output ��

Formulas and Their Model Matrix��

Chapter 8 ■ More r prograMMing

In the case of the merge function, we already know how long the result should be, so you can preallocate a result vector and copy single elements into it. You can create a vector of length n like this:

n <- 5 v <- vector(length = n)

Should you ever need it, you can make a list of length n like this:

vector("list", length = n)

Binary Search

Binary search is a classical algorithm for finding out if an element is contained in a sorted sequence. It is a simple recursive function. The basic case handles a sequence of one element. There you can directly compare the element you are searching for with the element in the sequence to determine if they are the same. If you have more than one element in the sequence, pick the middle one. If it is the element you are searching for, you are done and can return that the element is contained in the sequence. If it is smaller than the element you are searching for then, you know that if the element is in the list then it has to be in the last half of the sequence, and you can search there. If it is larger than the element you are searching for, then you know that if it is in the sequence, it must be in the first half of the sequence, and you search recursively there.

If you implement this exactly as described, you have to call recursively with a subsequence. This involves copying that subsequence for the function call which makes the implementation much less efficient than it needs to be. Try to implement binary search without this.

More Sorting

In the merge sort we implemented, we solve the sorting problem by splitting a sequence in two, sorting each subsequence, and then merging them. If implemented correctly this algorithm will run in time O(nlogn), which is optimal for sorting algorithms if we assume that the only operations we can do on the elements we sort are comparing them.

If the elements we have are all integers between 1 and n and we have m of them, we can sort them in time O(n + m) using bucket sort instead. This algorithm first creates a vector of counts for each number between 1 and n. This takes time O(n). It then runs through the m elements in our sequence, updating the counter for number i each time it sees i. This runs in time O(m). Finally, it runs through these numbers from 1 up to n and outputting each number, the number of times indicated by the counters, in time O(n + m).

Implement bucket sort.

Another algorithm that works by recursion, and that runs in expected time O(nlogn), is quick sort. Its worst case complexity is actual O(n2) but on average it runs in time O(nlogn) and with a smaller overhead than merge sort (if you implement it correctly).

It works as follows: the basis case—a single element—is the same as merge sort. When you have more than one element you pick one of the elements in the sequence at random; call it the pivot. Now split the sequence into those elements that are smaller than the pivot, those that are equal to the pivot, and those that are larger. Sort the sequences of smaller and larger elements recursively. Then output all the sorted smaller elements, then the elements equal to the pivot, and then the sorted larger elements.

Implement quick sort.

230

Chapter 8 ■ More r prograMMing

Selecting the k Smallest Element

If you have n elements, and you want the k smallest, an easy solution is to sort the elements and then pick number k. This works well and in most cases is easily fast enough, but it is actually possible to do it faster. See, we don’t actually need to sort the elements completely, we just need to have the k smallest element moved to position k in the sequence.

The quick sort algorithm from the previous exercise can be modified to solve this problem. Whenever we split a sequence into those smaller than, equal to, and larger than the pivot, we sort the smaller and larger elements recursively. If we are only interested in finding the element that would eventually end up at position k in the sorted lists we don’t need to sort the sequence that doesn’t overlap this index. If we have m < k elements smaller than the pivot, we can just put them at the front of the sequence without sorting them. We need them there to make sure that the k’th smallest element ends up at the right index, but we don’t need them sorted. Similar, if k < m we don’t need to sort the larger elements. If we sorted them, they would all end up at indices larger than k and we don’t really care about those. Of course, if there are m < k elements smaller than the pivot and l equal to the pivot, with m + l ≥ k, then the k smallest element is equal to the pivot, and we can return that.

Implement this algorithm.

231

CHAPTER 9 Advanced R Programming

This chapter gets into more details of some aspects of R. This chapter is called “Advanced R Programming” only because it is additional elements on top of the quick introduction you got in the last chapter. Except, perhaps, for the functional programming toward the end, we will not cover anything that is conceptually more complex that we did in the previous chapter. It is just a few more technical details we will dig into.

I stole the title from Hadley Wickham’s excellent book of the same name (see http://adv-r.had.co.nz) and most of what I cover here, he does in his book as well. He does cover a lot more, though, so this is a book you should get if you want really to drill into the advanced aspects of R programming.

Working with Vectors and Vectorizing Functions

We start out by returning to expressions. In the previous chapter, you saw expressions on single (scalar) values, but you also saw that R doesn’t really have scalar values; all the primitive data you have is actually vectors of data. What this means is that the expressions you use in R are actually operating on vectors, not on single values.

When you write this:

(x <- 2 / 3) ## [1] 0.6666667 (y <- x ** 2) ## [1] 0.4444444

The expressions you write are, of course, working on single values—the vectors x and y have length 1, but it is really just a special case of working on vectors.

(x <- 1:4 / 3) ## [1] 0.3333333 0.6666667 1.0000000 1.3333333 (y <- x ** 2) ## [1] 0.1111111 0.4444444 1.0000000 1.7777778

R works on vectors using two rules: operations are done element-wise, and vectors are repeated as needed.

When you write an expression such as x + y, you are really saying that you want to create a new vector that consists of the element-wise sum of the elements in vectors x and y. So for x and y like this:

x <- 1:5 y <- 6:10

Chapter 9 ■ advanCed r programming

Writing this:

(z <- x + y) ## [1] 7 9 11 13 15

Amounts to writing this:

z <- vector(length = length(x)) for (i in seq_along(x)) { z[i] <- x[i] + y[i]

} z ## [1] 7 9 11 13 15

This is the case for all arithmetic expressions or for logical expressions involving | or & (but not || or &&; these do not operate on vectors element-wise). It is also the case for most functions you can call, such as sqrt or sin:

sqrt((1:5)**2) ## [1] 1 2 3 4 5 sin(sqrt((1:5)**2)) ## [1] 0.8414710 0.9092974 0.1411200 -0.7568025 ## [5] -0.9589243

When you have an expression that involves vectors of different lengths, you cannot directly evaluate expressions element-wise. When this is the case, R will try to repeat the shorter vector(s) to create vectors of the same length. For this to work, the shorter vector(s) should have a length divisible in the length of the longest vector, i.e., you should be able to repeat the shorter vector(s) an integer number of times to get the length of the longest vector. If this is possible, R repeats vectors as necessary to make all vectors the same length as the longest and then does operations element-wise:

x <- 1:10 y <- 1:2 x + y ## [1] 2 4 4 6 6 8 8 10 10 12 z <- 1:3 x + z ## Warning in x + z: longer object length is not a ## multiple of shorter object length ## [1] 2 4 6 5 7 9 8 10 12 11

If the shorter vector(s) cannot be repeated an integer number of times to match up, R will still repeat as many times as needed to match the longest vector, but you will get a warning. Most of the time something like this happens, it is caused by buggy code.

z <- 1:3 x + z ## Warning in x + z: longer object length is not a ## multiple of shorter object length ## [1] 2 4 6 5 7 9 8 10 12 11

234

Chapter 9 ■ advanCed r programming

In the expression you saw a while back, different vectors are repeated:

(x <- 1:4 / 3) ## [1] 0.3333333 0.6666667 1.0000000 1.3333333 (y <- x ** 2) ## [1] 0.1111111 0.4444444 1.0000000 1.7777778

When we divide 1:4 by 3 we need to repeat the (length one) vector 3 four times to be able to divide the 1:4 vector with the 3 vector. When we compute x ** 2, we must repeat 2 four times as well.

Whenever you consider writing a loop over vectors to do some calculations for each element, you should always consider using such vectorized expressions instead. It is typically much less error prone and since it involves implicit looping handled by the R runtime system, it is almost guaranteed to be faster than an explicit loop.

ifelse

Control structures are not vectorized. For example, if statements are not. If you want to compute a vector y from vector x such that y[i] == 5 if x[i] is even and y[i] == 15 if x[i] is odd—for example—you cannot write this as a vector expression:

x <- 1:10 if (x %% 2 == 0) 5 else 15 ## Warning in if (x%%2 == 0) 5 else 15: the condition ## has length > 1 and only the first element will be ## used ## [1] 15

Instead, you can use the ifelse function that works like a vectorized selection; if the condition in its first element is true, it returns the value in its second argument; otherwise, it returns the value in its third argument. It does this as vector operations:

x <- 1:10 ifelse(x %% 2 == 0, 5, 15) ## [1] 15 5 15 5 15 5 15 5 15 5

Vectorizing Functions

When you write your own functions, you can write them so that they can also be used to work on vectors, that is, you can write them so that they can take vectors as input and return vectors as output. If you write them this way, then they can be used in vectorized expressions the same way as built-in functions such as sqrt and sin.

The easiest way to make your function work on vectors is to write the body of the function using expressions that work on vectors.

f <- function(x, y) sqrt(x ** y) f(1:6, 2) ## [1] 1 2 3 4 5 6 f(1:6, 1:2) ## [1] 1.000000 2.000000 1.732051 4.000000 2.236068 ## [6] 6.000000

235

Function Operations: Functions as Input and Output ��

Next Article

Formulas and Their Model Matrix��

Binary Search

More Sorting

Selecting the k Smallest Element

CHAPTER 9

Advanced R Programming

Working with Vectors and Vectorizing Functions

ifelse

Vectorizing Functions

More articles from this publication:

Formulas and Their Model Matrix��

Bayesian Linear Regression��

Parallel Execution��

Switching to C++ ��

Speeding Up Your Code ��

Exercises��

Using git in RStudio��

Version Control and Repositories ��

Collaborating on GitHub��

This article is from:

Beginning of Data Science in R

Next Article

Formulas and Their Model Matrix���������������������������������������������������������������������������������

Binary Search

More Sorting

Selecting the k Smallest Element

CHAPTER 9

Advanced R Programming

Working with Vectors and Vectorizing Functions

ifelse

Vectorizing Functions

More articles from this publication:

Formulas and Their Model Matrix���������������������������������������������������������������������������������

Bayesian Linear Regression�����������������������������������������������������������������������������������������

Version Control and Repositories ���������������������������������������������������������������������������������

This article is from:

Beginning of Data Science in R

Formulas and Their Model Matrix��

Formulas and Their Model Matrix��

Bayesian Linear Regression��

Version Control and Repositories ��