ВУЗ: Не указан
Категория: Не указан
Дисциплина: Не указана
Добавлен: 06.04.2021
Просмотров: 893
Скачиваний: 1
8.2. CHIMERAS
CIRCLE 8. BELIEVING IT DOES AS INTENDED
When the call to
data.frame
uses a tag (name) for an item, it is expected that
the corresponding column of the output will have that name. However, column
names that are already there take precedence.
Notice also that the data frames contain a factor column rather than char-
acter.
8.2.42
cbind favors matrices
If you
cbind
two matrices, you get a matrix. If you
cbind
two data frames, you
get a data frame. If you
cbind
two vectors, you get a matrix:
>
is.matrix(cbind(x=1:10, y=rnorm(10)))
[1] TRUE
If you want a data frame, then use the
data.frame
function:
>
dfxy <- data.frame(x=1:10, y=rnorm(10))
8.2.43
data frame equal number of rows
A data frame is implemented as a list. But not just any list will do—each
component must represent the same number of rows. If you work hard enough,
you might be able to produce a data frame that breaks the rule. More likely
your frustration will be that R stops you from getting to such an absurd result.
8.2.44
matrices in data frames
Let’s make two data frames:
>
ymat <- array(1:6, c(3,2))
>
xdf6 <- data.frame(X=101:103, Y=ymat)
>
xdf7 <- data.frame(X=101:103)
>
xdf7$Y <- ymat
>
xdf6
X Y.1 Y.2
1 101
1
4
2 102
2
5
3 103
3
6
>
xdf7
X Y.1 Y.2
1 101
1
4
2 102
2
5
3 103
3
6
>
dim(xdf6)
[1] 3 3
>
dim(xdf7)
[1] 3 2
100
8.3. DEVILS
CIRCLE 8. BELIEVING IT DOES AS INTENDED
They print exactly the same. But clearly they are not the same since they have
different dimensions.
>
xdf6[, ’Y.1’]
[1] 1 2 3
>
xdf7[, ’Y.1’]
Error in "[.data.frame"(xdf7, , "Y.1") :
undefined columns selected
>
xdf6[, ’Y’]
Error in "[.data.frame"(xdf6, , "Y") :
undefined columns selected
>
xdf7[, ’Y’]
[,1] [,2]
[1,]
1
4
[2,]
2
5
[3,]
3
6
xdf6
includes components
Y.1
and
Y.2
.
xdf7
does not have such components
(in spite of how it is printed)—it has a
Y
component that is a two-column matrix.
You will surely think that allowing a data frame to have components with
more than one column is an abomination. That will be your thinking unless,
of course, you’ve had occasion to see it being useful. The feature is worth the
possible confusion, but perhaps a change to printing could reduce confusion.
8.3
Devils
The most devilish problem is getting data from a file into R correctly.
8.3.1
read.table
The
read.table
function is the most common way of getting data into R.
Reading its help file three times is going to be very efficient time management if
you ever use
read.table
. In particular the
header
and
row.names
arguments
control what (if anything) in the file should be used as column and row names.
Another great time management tool is to inspect the result of the data you
have read before attempting to use it.
8.3.2
read a table
The
read.table
function does not create a table—it creates a data frame. You
don’t become a book just because you read a book. The
table
function returns
a table.
The idea of
read.table
and relatives is that they read data that are in a
rectangular format.
101
8.3. DEVILS
CIRCLE 8. BELIEVING IT DOES AS INTENDED
8.3.3
the missing, the whole missing and nothing but the
missing
Misreading missing values is an efficacious way of producing garbage. Missing
values can become non-missing, non-missing values can become missing, logi-
cally numeric columns can become factors.
The
na.strings
argument to
read.table
needs to be set properly. An
example might be:
na.strings=c(’.’, ’-999’)
8.3.4
misquoting
A quite common file format is to have a column of names followed by some
number of columns of data. If there are any apostrophes in those names, then
you are likely to get an error reading the file unless you have set the
quote
argument to
read.table
. A likely value for
quote
is:
quote=’’
This sounds like easy advise, but almost surely it is not going to be apparent
that quotes are the problem. You may get an error that says there was the
wrong number of items in a line. When you get such an error, it is often a
good idea to use
count.fields
to get a sense of what R thinks about your file.
Something along the lines of:
foo.cf <- count.fields(’foo.txt’, sep=’
\
t’)
table(foo.cf)
8.3.5
thymine is TRUE, female is FALSE
You are reading in DNA bases identified as A, T, G and C. The columns are
read as factors. Except for the column that is all T—that column is logical.
Similarly, a column for gender that is all F for female will be logical.
The solution is to use the
read.table
argument:
colClasses=’character’
or
colClasses=’factor’
as you like.
If there are columns of other sorts of data, then you need to give
colClasses
a vector of appropriate types for the columns in the file.
Using
colClasses
can also make the call much more efficient.
102
8.3. DEVILS
CIRCLE 8. BELIEVING IT DOES AS INTENDED
Figure 8.3: The treacherous to country and the treacherous to guests and hosts
by Sandro Botticelli.
103
8.3. DEVILS
CIRCLE 8. BELIEVING IT DOES AS INTENDED
8.3.6
whitespace is white
Whitespace is invisible, and we have a predilection to believe that invisible
means non-existent.
>
factor(c(’A ’, ’A’, ’B’))
[1] A
A
B
Levels: A A
B
It is extraordinarily easy to get factors like this when reading in data. Setting
the
strip.white
argument of
read.table
to
TRUE
can prevent this.
8.3.7
extraneous fields
When a file has been created in a spreadsheet, there are sometimes extraneous
empty fields in some of the lines of the file. In such a case you might get an
error similar to:
>
mydat <- read.table(’myfile’, header=TRUE, sep=’
\
t’)
Error in scan(file, what, nmax, sep, dec, quote, skip, :
line 10 did not have 55 elements
This, of course, is a perfect instance to use
count.fields
to see what is going
on. If extraneous empty fields do seem to be the problem, then one solution is:
>
mydat <- read.table(’myfile’, header=TRUE, sep=’
\
t’,
+
fill=TRUE)
>
mydat <- mydat[, 1:53]
At this point, it is wiser than usual to carefully inspect the results to see that
the data are properly read and aligned.
8.3.8
fill and extraneous fields
When the
fill
argument is
TRUE
(which is the default for
read.csv
and
read.delim
but not for
read.table
), there can be trouble if there is a variable
number of fields in the file.
>
writeLines(c("A,B,C,D",
+
"1,a,b,c",
+
"2,d,e,f",
+
"3,a,i,j",
+
"4,a,b,c",
+
"5,d,e,f",
+
"6,g,h,i,j,k,l,m,n"),
+
con=file("test.csv"))
>
read.csv("test.csv")
A B C D
104