During these days, I read the book “R IN ACTION” on my way to work and home. In order to strengthen my data analysis ability and facilitate our work, I decided to start from the begining. Reviewing the basic knowledge, I felt familiar and found its interests. Considering it could be useful in data science, I’d like to share them as weekly blogs. Today, let’s see something about data structures.
A dataset is usually a rectangular array of data with rows representing observations and columns representing variables. R has a wide variety of objects for holding data, including vectors, matrices, arrays, data frames, factors and lists. Let’s look at each structure in turn, starting with vectors.
Vectors
Vectors are one-dimensional arrays that can hold numeric data, character data,
or logical data. The combine function c()
is used to form the vector. You can
refer to elements of a vector using a numeric vector of positions within
brackets. For example, a[c(2, 4)]
refers to the second and fourth elements of
vector a
.
Matrices
A matrix is a two-dimensional array in which each element has the same mode
(numeric, character, or logical). Matrices are created with the matrix()
function.
Arrays
Arrays are similar to matrices but can have more than two dimensions. They’re
created with an array()
function.
Data frames
A data frame is more general than a matrix in that different columns can
contain different modes of data (numeric, character, and so on). A data frame is
created with the data.frame()
function.
attach, detach
The attach()
function adds the data frame to the R search path. When a
variable name is encountered, data frames in the search path are checked for the
variable in order. Here is example:
The detach()
function removes the data frame from the search path. Note that
detach()
does nothing to the data frame itself. The attach()
and detach()
functions are best used when you’re analyzing a single data frame and you’re
unlikely to have multiple objects with the same name.
with
An alternative approch is to use the with()
function. You can write the
previous example as
In this case, the statements within the {}
brackets are evaluated with
reference to the “mtcars” data frame. You don’t have to worry about name
conflicts. If there’s only one statement(for example, summary(mpg)), the {}
brackets are optional.
The limitation of the with()
function is that assignments exist only within
the function brackets. If you need to create objects that will exist outside of
the with()
construct, use the speciall assignment operator <<-
instead of
the standard one (<-
). It saves the object to the global environment outside
of the with()
call.
Factors
As you’ve seen, variables can be described as nominal, ordinal, or continuous. Nominal variables are categorical, without an implied order. Ordinal variables imply order but not amount. Continuous variables can take on any value within some range and both order and amount are implied.
Categorical(nominal) and ordered categorical(ordinal) variables in R are called
factors. Factors are crucial in R because they determine how data is analyzed
and presented visually. You can use the function factor()
to create a factor.
Lists
Lists are the most complex of the R data types. Basically, a list is an ordered collection of objects(components). A list allows you to gather a variety of(possibly unrelated) objects under one name.
You can specify elements of the list by indicating a component number or a name
within double brackets. For instance, mylist[[2]]
refers to the second element
of the list.
Summary
One of the most challenging tasks in data analysis is data preparation. We’ve made a good start by outlining the various structures provided by R for holding data. In next blog, we’ll review general methods for working with graphs.