ВУЗ: Не указан
Категория: Не указан
Дисциплина: Не указана
Добавлен: 06.04.2021
Просмотров: 911
Скачиваний: 1
3.1. SUBSCRIPTING
CIRCLE 3. FAILING TO VECTORIZE
Table 3.1: Summary of subscripting with
8
[
8
.
subscript
effect
positive numeric vector
selects items with those indices
negative numeric vector
selects all but those indices
character vector
selects items with those names (or dimnames)
logical vector
selects the
TRUE
(and
NA
) items
missing
selects all
clearly expresses the relation between the variables. This same clarity is present
whether there is one item or a million. Transparent code is an important form of
efficiency. Computer time is cheap, human time (and frustration) is expensive.
This fact is enshrined in the maxim of Uwe Ligges.
Uwe
0
s Maxim
Computers are cheap, and thinking hurts.
A fairly common question from new users is: “How do I assign names to a
group of similar objects?” Yes, you can do that, but you probably don’t want
to—better is to vectorize your thinking. Put all of the similar objects into one
list. Subsequent analysis and manipulation is then going to be much smoother.
3.1
Subscripting
Subscripting in R is extremely powerful, and is often a key part of effective
vectorization. Table
summarizes subscripting.
The dimensions of arrays and data frames are subscripted independently.
Arrays (including matrices) can be subscripted with a matrix of positive
numbers. The subscripting matrix has as many columns as there are dimensions
in the array—so two columns for a matrix. The result is a vector (not an array)
containing the selected items.
Lists are subscripted just like (other) vectors. However, there are two forms
of subscripting that are particular to lists:
8
$
8
and
8
[[
8
. These are almost the
same, the difference is that
8
$
8
expects a name rather than a character string.
>
mylist <- list(aaa=1:5, bbb=letters)
>
mylist$aaa
[1] 1 2 3 4 5
>
mylist[[’aaa’]]
[1] 1 2 3 4 5
>
subv <- ’aaa’; mylist[[subv]]
[1] 1 2 3 4 5
You shouldn’t be too surprised that I just lied to you. Subscripting with
8
[[
8
can be done on atomic vectors as well as lists. It can be the safer option when
20
3.2. VECTORIZED IF
CIRCLE 3. FAILING TO VECTORIZE
a single item is demanded. If you are using
8
[[
8
and you want more than one
item, you are going to be disappointed.
We’ve already seen (in the
lsum
example) that subscripting can be a symp-
tom of not vectorizing.
As an example of how subscripting can be a vectorization tool, consider
the following problem: We have a matrix
amat
and we want to produce a new
matrix with half as many rows where each row of the new matrix is the product
of two consecutive rows of
amat
.
It is quite simple to create a loop to do this:
bmat <- matrix(NA, nrow(amat)/2, ncol(amat))
for(i in 1:nrow(bmat)) bmat[i,] <- amat[2*i-1,] * amat[2*i,]
Note that we have avoided Circle 2 (page
) by preallocating
bmat
.
Later iterations do not depend on earlier ones, so there is hope that we can
eliminate the loop. Subscripting is the key to the elimination:
>
bmat2 <- amat[seq(1, nrow(amat), by=2),] *
+
amat[seq(2, nrow(amat), by=2),]
>
all.equal(bmat, bmat2)
[1] TRUE
3.2
Vectorized if
Here is some code:
if(x < 1) y <- -1 else y <- 1
This looks perfectly logical. And if
x
has length one, then it does as expected.
However, if
x
has length greater than one, then a warning is issued (often ignored
by the user), and the result is not what is most likely intended. Code that fulfills
the common expectation is:
y <- ifelse(x < 1, -1, 1)
Another approach—assuming
x
is never exactly 1—is:
y <- sign(x - 1)
This provides a couple of lessons:
1. The condition in
if
is one of the few places in R where a vector (of length
greater than 1) is not welcome (
the
8
:
8
operator is another).
2.
ifelse
is what you want in such a situation (though, as in this case, there
are often more direct approaches).
21
3.3. VEC IMPOSSIBLE
CIRCLE 3. FAILING TO VECTORIZE
Recall that in Circle 2 (page
) we saw:
hit <- NA
for(i in 1:one.zillion)
{
if(runif(1) < 0.3) hit[i] <- TRUE
}
One alternative to make this operation efficient is:
ifelse(runif(one.zillion) < 0.3, TRUE, NA)
If there is a mistake between
if
and
ifelse
, it is almost always trying to use
if
when
ifelse
is appropriate. But ingenuity knows no bounds, so it is also
possible to try to use
ifelse
when
if
is appropriate. For example:
ifelse(x, character(0), ’’)
The result of
ifelse
is ALWAYS the length of its first (formal) argument.
Assuming that
x
is of length 1, the way to get the intended behavior is:
if(x) character(0) else ’’
Some more caution is warranted with
ifelse
: the result gets not only its length
from the first argument, but also its attributes. If you would like the answer
to have attributes of the other two arguments, you need to do more work. In
Circle
we’ll see a particular instance of this with factors.
3.3
Vectorization impossible
Some things are not possible to vectorize. For instance, if the present iteration
depends on results from the previous iteration, then vectorization is usually not
possible. (But some cases are covered by
filter
,
cumsum
, etc.)
If you need to use a loop, then make it lean:
•
Put as much outside of loops as possible. One example: if the same or a
similar sequence is created in each iteration, then create the sequence first
and reuse it. Creating a sequence is quite fast, but appreciable time can
accumulate if it is done thousands or millions of times.
•
Make the number of iterations as small as possible. If you have the choice
of iterating over the elements of a factor or iterating over the levels of the
factor, then iterating over the levels is going to be better (almost surely).
The following bit of code gets the sum of each column of a matrix (assuming
the number of columns is positive):
sumxcol <- numeric(ncol(x))
for(i in 1:ncol(x)) sumxcol[i] <- sum(x[,i])
22
3.3. VEC IMPOSSIBLE
CIRCLE 3. FAILING TO VECTORIZE
A more common approach to this would be:
sumxcol <- apply(x, 2, sum)
Since this is a quite common operation, there is a special function for doing this
that does not involve a loop in R code:
sumxcol <- colSums(x)
There are also
rowSums
,
colMeans
and
rowMeans
.
Another approach is:
sumxcol <- rep(1, nrow(x)) %*% x
That is, using matrix multiplication. With a little ingenuity a lot of problems
can be cast into a matrix multiplication form. This is generally quite efficient
relative to alternatives.
23
Circle 4
Over-Vectorizing
We skirted past Plutus, the fierce wolf with a swollen face, down into the fourth
Circle. Here we found the lustful.
It is a good thing to want to vectorize when there is no effective way to do
so. It is a bad thing to attempt it anyway.
A common reflex is to use a function in the apply family. This is not vector-
ization, it is loop-hiding. The
apply
function has a
for
loop in its definition.
The
lapply
function buries the loop, but execution times tend to be roughly
equal to an explicit
for
loop. (Confusion over this is understandable, as there
is a significant difference in execution speed with at least some versions of S+.)
Table
summarizes the uses of the apply family of functions.
Base your decision of using an apply function on Uwe’s Maxim (page
The issue is of human time rather than silicon chip time. Human time can be
wasted by taking longer to write the code, and (often much more importantly)
by taking more time to understand subsequently what it does.
A command applying a complicated function is unlikely to pass the test.
Table 4.1: The apply family of functions.
function
input
output
comment
apply
matrix or array
vector or array or list
lapply
list or vector
list
sapply
list or vector
vector or matrix or list
simplify
vapply
list or vector
vector or matrix or list
safer simplify
tapply
data, categories
array or list
ragged
mapply
lists and/or vectors
vector or matrix or list
multiple
rapply
list
vector or list
recursive
eapply
environment
list
dendrapply
dendogram
dendogram
rollapply
data
similar to input
package
zoo
24