Файл: R_inferno.pdf

Скачать файл (0,90Мб)

Заказать решение

ВУЗ: Не указан

Категория: Не указан

Дисциплина: Не указана

Добавлен: 06.04.2021

Просмотров: 910

Скачиваний: 1

ВНИМАНИЕ! Если данный файл нарушает Ваши авторские права, то обязательно сообщите нам.

5.1. ABSTRACTION

CIRCLE 5. NOT WRITING FUNCTIONS

usage.

2. an object with no attributes (except possibly

names

). This is the definition

implied by

is.vector

and

as.vector

3. an object that can have an arbitrary length (includes lists).

Clearly definitions 1 and 3 are contradictory, but which meaning is implied
should be clear from the context. When the discussion is of vectors as opposed
to matrices, it is definition 2 that is implied.

The word “list” has a technical meaning in R—this is an object of arbitrary

length that can have components of different types, including lists. Sometimes
the word is used in a non-technical sense, as in “search list” or “argument list”.

Not all functions are created equal. They can be conveniently put into three

types.

There are anonymous functions as in:

apply(x, 2, function(z) mean(z[z > 0]))

The function given as the third argument to

apply

is so transient that we don’t

even give it a name.

There are functions that are useful only for one particular project. These

are your one-off functions.

Finally there are functions that are persistently valuable. Some of these

could well be one-off functions that you have rewritten to be more abstract.
You will most likely want a file or package containing your persistently useful
functions.

In the example of an anonymous function we saw that a function can be an

argument to another function. In R, functions are objects just as vectors or
matrices are objects. You are allowed to think of functions as data.

A whole new level of abstraction is a function that returns a function. The

empirical cumulative distribution function is an example:

mycumfun <- ecdf(rnorm(10))

mycumfun(0)

[1] 0.4

Once you write a function that returns a function, you will be forever immune
to this Circle.

In Circle 2 (page

) we briefly met

do.call

. Some people are quite confused

do.call

. That is both unnecessary and unfortunate—it is actually quite

simple and is very powerful. Normally a function is called by following the
name of the function with an argument list:

sample(x=10, size=5)

5.1. ABSTRACTION

CIRCLE 5. NOT WRITING FUNCTIONS

The

do.call

function allows you to provide the arguments as an actual list:

do.call("sample", list(x=10, size=5))

Simple.

At times it is useful to have an image of what happens when you call a

function. An environment is created by the function call, and an environment
is created for each function that is called by that function. So there is a stack
of environments that grows and shrinks as the computation proceeds.

Let’s define some functions:

ftop <- function(x)

{

# time 1
x1 <- f1(x)
# time 5
ans.top <- f2(x1)
# time 9
ans.top

}

f1 <- function(x)

{

# time 2
ans1 <- f1.1(x)
# time 4
ans1

}

f2 <- function(x)

{

# time 6
ans2 <- f2.1(x)
# time 8
ans2

}

And now let’s do a call:

# time 0
ftop(myx)
# time 10

Figure

5.1

shows how the stack of environments for this call changes through

time. Note that there is an

in the environments for

ftop

and

. The

ftop

is what we call

myx

(or possibly a copy of it) as is the

. But the

is something different.

When we discuss debugging, we’ll be looking at this stack at a specific point

in time. For instance, if an error occurred in

f2.1

, then we would be looking at

the state of the stack somewhere near time 7.

5.2. SIMPLICITY

CIRCLE 5. NOT WRITING FUNCTIONS

Figure 5.1: Stack of environments through time.

Time

Environment

ftop

f1.1

f2.1

R is a language rich in objects. That is a part of its strength. Some of

those objects are elements of the language itself—calls, expressions and so on.
This allows a very powerful form of abstraction often called computing on the
language. While messing with language elements seems extraordinarily esoteric
to almost all new users, a lot of people moderate that view.

5.2

Simplicity

Make your functions as simple as possible. Simple has many advantages:

•

Simple functions are likely to be human efficient: they will be easy to
understand and to modify.

•

Simple functions are likely to be computer efficient.

•

Simple functions are less likely to be buggy, and bugs will be easier to fix.

•

(Perhaps ironically) simple functions may be more general—thinking about
the heart of the matter often broadens the application.

If your solution seems overly complex for the task, it probably is. There may
be simple problems for which R does not have a simple solution, but they are
rare.

Here are a few possibilities for simplifying:

•

Don’t use a list when an atomic vector will do.

5.3. CONSISTENCY

CIRCLE 5. NOT WRITING FUNCTIONS

•

Don’t use a data frame when a matrix will do.

•

Don’t try to use an atomic vector when a list is needed.

•

Don’t try to use a matrix when a data frame is needed.

Properly formatting your functions when you write them should be standard
practice. Here “proper” includes indenting based on the logical structure, and
putting spaces between operators. Circle

8.1.30

shows that there is a particularly

good reason to put spaces around logical operators.

A semicolon can be used to mark the separation of two R commands that

are placed on the same line. Some people like to put semicolons at the end of
all lines. This highly annoys many seasoned R users. Such a reaction seems to
be more visceral than logical, but there is some logic to it:

•

The superfluous semicolons create some (imperceptible) inefficiency.

•

The superfluous semicolons give the false impression that they are doing
something.

One reason to seek simplicity is speed. The

Rprof

function is a very convenient

means of exploring which functions are using the most time in your function
calls. (The name

Rprof

refers to time profiling.)

5.3

Consistency

Consistency is good. Consistency reduces the work that your users need to
expend. Consistency reduces bugs.

One form of consistency is the order and names of function arguments. Sur-

prising your users is not a good idea—even if the universe of your users is of
size 1.

A rather nice piece of consistency is always giving the correct answer. In

order for that to happen the inputs need to be suitable. To insure that, the
function needs to check inputs, and possibly intermediate results. The tools for
this job include

stop

and

stopifnot

Sometimes an occurrence is suspicious but not necessarily wrong. In this

case a warning is appropriate. A warning produces a message but does not
interrupt the computation.

There is a problem with warnings. No one reads them. People have to read

error messages because no food pellet falls into the tray after they push the
button. With a warning the machine merely beeps at them but they still get
their food pellet. Never mind that it might be poison.

The appropriate reaction to a warning message is:

1. Figure out what the warning is saying.

5.3. CONSISTENCY

CIRCLE 5. NOT WRITING FUNCTIONS

2. Figure out why the warning is triggered.

3. Figure out the effect on the results of the computation (via deduction or

experimentation).

4. Given the result of step 3, decide whether or not the results will be erro-

neous.

You want there to be a minimal amount of warning messages in order to increase
the probability that the messages that are there will be read. If you have a
complex function where a large number of suspicious situations is possible, you
might consider providing the ability to turn off some warning messages. Without
such a system the user may be expecting a number of warning messages and
hence miss messages that are unexpected and important.

The

suppressWarnings

function allows you to suppress warnings from spe-

cific commands:

log(c(3, -1))

[1] 1.098612 NaN
Warning message:
In log(c(3, -1)) : NaNs produced
>

suppressWarnings(log(c(3, -1)))

[1] 1.098612 NaN

We want our functions to be correct. Not all functions

are

correct. The results

from specific calls can be put into 4 categories:

1. Correct.

2. An error occurs that is clearly identified.

3. An obscure error occurs.

4. An incorrect value is returned.

We like category 1. Category 2 is the right behavior if the inputs do not make
sense, but not if the inputs are sensible. Category 3 is an unpleasant place for
your users, and possibly for you if the users have access to you. Category 4 is
by far the worst place to be—the user has no reason to believe that anything is
wrong. Steer clear of category 4.

You should consistently write a help file for each of your persistent functions.

If you have a hard time explaining the inputs and/or outputs of the function,
then you should change the function. Writing a good help file is an excellent
way of debugging the function. The

prompt

function will produce a template

for your help file.

An example is worth a thousand words, so include examples in your help

files. Good examples are gold, but any example is much better than none. Using
data from the

datasets

package allows your users to run the examples easily.