ВУЗ: Не указан
Категория: Не указан
Дисциплина: Не указана
Добавлен: 06.04.2021
Просмотров: 909
Скачиваний: 1
CIRCLE 4. OVER-VECTORIZING
Figure 4.1: The panderers and seducers and the flatterers by Sandro Botticelli.
Use an explicit
for
loop when each iteration is a non-trivial task. But a simple
loop can be more clearly and compactly expressed using an apply function.
There is at least one exception to this rule. We will see in Circle
that
if the result will be a list and some of the components can be
NULL
, then a
for
loop is trouble (big trouble) and
lapply
gives the expected answer.
The
tapply
function applies a function to each bit of a partition of the data.
Alternatives to
tapply
are
by
for data frames, and
aggregate
for time series
and data frames. If you have a substantial amount of data and speed is an issue,
then
data.table
may be a good solution.
Another approach to over-vectorizing is to use too much memory in the pro-
cess. The
outer
function is a wonderful mechanism to vectorize some problems.
It is also subject to using a lot of memory in the process.
Suppose that we want to find all of the sets of three positive integers that
sum to 6, where the order matters. (This is related to partitions in number
theory.) We can use
outer
and
which
:
the.seq <- 1:4
which(outer(outer(the.seq, the.seq, ’+’), the.seq, ’+’) == 6,
arr.ind=TRUE)
This command is nicely vectorized, and a reasonable solution to this particular
25
CIRCLE 4. OVER-VECTORIZING
problem. However, with larger problems this could easily eat all memory on a
machine.
Suppose we have a data frame and we want to change the missing values to
zero. Then we can do that in a perfectly vectorized manner:
x[is.na(x)] <- 0
But if x is large, then this may take a lot of memory. If—as is common—the
number of rows is much larger than the number of columns, then a more memory
efficient method is:
for(i in 1:ncol(x)) x[is.na(x[,i]), i] <- 0
Note that “large” is a relative term; it is usefully relative to the amount of
available memory on your machine. Also note that memory efficiency can also
be time efficiency if the inefficient approach causes swapping.
One more comment: if you really want to change
NA
s to 0, perhaps you
should rethink what you are doing—the new data are fictional.
It is not unusual for there to be a tradeoff between space and time.
Beware the dangers of premature optimization of your code. Your first duty
is to create clear, correct code. Only consider optimizing your code when:
•
Your code is debugged and stable.
•
Optimization is likely to make a significant impact. Spending an hour or
two to save a millisecond a month is not best practice.
26
Circle 5
Not Writing Functions
We came upon the River Styx, more a swamp really. It took some convinc-
ing, but Phlegyas eventually rowed us across in his boat. Here we found the
treasoners.
5.1
Abstraction
A key reason that R is a good thing is because it is a language. The power of
language is abstraction. The way to make abstractions in R is to write functions.
Suppose we want to repeat the integers 1 through 3 twice. That’s a simple
command:
c(1:3, 1:3)
Now suppose we want these numbers repeated six times, or maybe sixty times.
Writing a function that abstracts this operation begins to make sense. In fact,
that abstraction has already been done for us:
rep(1:3, 6)
The
rep
function performs our desired task and a number of similar tasks.
Let’s do a new task. We have two vectors; we want to produce a single vector
consisting of the first vector repeated to the length of the second and then the
second vector repeated to the length of the first. A vector being repeated to a
shorter length means to just use the first part of the vector. This is quite easily
abstracted into a function that uses
rep
:
repeat.xy <- function(x, y)
{
c(rep(x, length=length(y)), rep(y, length=length(x)))
}
The
repeat.xy
function can now be used in the same way as if it came with R.
27
5.1. ABSTRACTION
CIRCLE 5. NOT WRITING FUNCTIONS
repeat.xy(1:4, 6:16)
The ease of writing a function like this means that it is quite natural to move
gradually from just using R to programming in R.
In addition to abstraction, functions crystallize knowledge. That
π
is approx-
imately 3.1415926535897932384626433832795028841971693993751058209749445
923078 is knowledge.
The function:
circle.area <- function(r) pi * r ^ 2
is both knowledge and abstraction—it gives you the (approximate) area for
whatever circles you like.
This is not the place for a full discussion on the structure of the R language,
but a comment on a detail of the two functions that we’ve just created is in
order. The statement in the body of
repeat.xy
is surrounded by curly braces
while the statement in the body of
circle.area
is not. The body of a function
needs to be a single expression. Curly braces turn a number of expressions into
a single (combined) expression. When there is only a single command in the
body of a function, then the curly braces are optional. Curly braces are also
useful with loops,
switch
and
if
.
Ideally each function performs a clearly specified task with easily understood
inputs and return value. Very common novice behavior is to write one function
that does everything. Almost always a better approach is to write a number of
smaller functions, and then a function that does everything by using the smaller
functions. Breaking the task into steps often has the benefit of making it more
clear what really should be done. It is also much easier to debug when things
go wrong.
The small functions are much more likely to be of general use.
A nice piece of abstraction in R functions is default values for arguments.
For example, the
na.rm
argument to
sd
has a default value of
FALSE
. If that
is okay in a particular instance, then you don’t have to specify
na.rm
in your
call. If you want to remove missing values, then you should include
na.rm=TRUE
as an argument in your call. If you create your own copy of a function just to
change the default value of an argument, then you’re probably not appreciating
the abstraction that the function gives you.
Functions return a value. The return value of a function is almost always
the reason for the function’s existence. The last item in a function definition
is returned. Most functions merely rely on this mechanism, but the
return
function forces what to return.
The other thing that a function can do is to have one or more side effects.
A side effect is some change to the system other than returning a value. The
philosophy of R is to concentrate side effects into a few functions (such as
,
plot
and
rm
) where it is clear that a side effect is to be expected.
1
Notice “when” not “if”.
28
5.1. ABSTRACTION
CIRCLE 5. NOT WRITING FUNCTIONS
Table 5.1: Simple objects.
object
type
examples
logical
atomic
TRUE FALSE NA
numeric
atomic
0 2.2 pi NA Inf -Inf NaN
complex
atomic
3.2+4.5i NA Inf NaN
character
atomic
’hello world’ ’’ NA
list
recursive
list(1:3, b=’hello’, C=list(3, c(TRUE, NA)))
NULL
NULL
function
function(x, y) x + 2 * y
formula
y ~ x
Table 5.2: Some not so simple objects.
object
primary
attributes
comment
data frame
list
class row.names
a generalized matrix
matrix
vector
dim dimnames
special case of array
array
vector
dim dimnames
usually atomic, not always
factor
integer
levels class
tricky little devils
The things that R functions talk about are objects. R is rich in objects.
Table
shows some important types of objects.
You’ll notice that each of the atomic types have a possible value
NA
, as in
“Not Available” and called “missing value”. When some people first get to R,
they spend a lot of time trying to get rid of
NA
s. People probably did the same
sort of thing when zero was first invented.
NA
is a wonderful thing to have
available to you. It is seldom pleasant when your data have missing values, but
life is much better with
NA
than without.
R was designed with the idea that nothing is important. Let’s try that again:
“nothing” is important. Vectors can have length zero. This is another stupid
thing that turns out to be incredibly useful—that is, not so stupid after all.
We’re not so used to dealing with things that aren’t there, so sometimes there
are problems—we’ll see examples in Circle 8, Circle
for instance.
A lot of the wealth of objects has to do with attributes. Many attributes
change how the object is thought about (both by R and by the user). An
attribute that is common to most objects is
names
. The attribute that drives
object orientation is
class
. Table
lists a few of the most important types
of objects that depend on attributes. Formulas, that were listed in the simple
table, have class
"formula"
and so might more properly be in the not-so-simple
list.
A common novice problem is to think that a data frame is a matrix. They
look the same. They are not that same. See, for instance, Circle
The word “vector” has a number of meanings in R:
1. an atomic object (as opposed to a list). This is perhaps the most common
29