Data structures - R vs Python - Jingwen Zheng

I’ve learnt python since the beginning of this year. In this blog, I’ll compare the data structures in R to Python briefly.

Array

R

Atomic vectors

one-dimensional array
contain only one data type
scalars are one-element vectors, e.g. f <- 3, g <- "US"
function c()

v <- c("k", "j", "w", "d", "v")
> v[1]
[1] "k"
> v[c(1, 3)]
[1] "k" "w"
> v[3:4]
[1] "w", "d"

Matrices

two-dimensional array
each element has the same mode (numeric, character or logical)
function matrix()

m <- matrix(1:6, nrow = 2, ncol = 3)
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

> m[2, ]
[1] 2 4 6
> m[1, 3]
[1] 5

Arrays

similar to matrices
more than two dimensions
function array()

dim1 <- c("A1", "A2")
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2")

arr <- array(1:12, c(2, 3, 2), dimnames = list(dim1, dim2, dim3))
> arr
, , C1

   B1 B2 B3
A1  1  3  5
A2  2  4  6

, , C2

   B1 B2 B3
A1  7  9 11
A2  8 10 12

> arr[1,2,2]
[1] 9

> arr[1,2,]
C1 C2 
 3  9 
 
> arr[1,,]
   C1 C2
B1  1  7
B2  3  9
B3  5 11

Python

package numpy
functions numpy.array(), numpy.arange()

In [1]: import numpy
In [2]: arr_1d = numpy.array([6, 5.2, 2, 7])
In [3]: print(arr_1d)
Out[3]: array([6. , 5.2, 2. , 7. ])

In [4]: arr_2d = numpy.array([[1, 2, 3], [4, 5, 6]])
In [5]: print(arr_2d)
Out[5]:
array([[1, 2, 3],
       [4, 5, 6]])

In [6]: arr_3d = numpy.array([[[1, 2, 3], [4, 5, 6]],[[7, 8, 9], [10, 11, 12]]])
In [7]: print(arr_3d)
Out[7]:
array([[[ 1,  2,  3],
        [ 4,  5,  6]],
       [[ 7,  8,  9],
        [10, 11, 12]]])

List

R

an ordered collection of objects
allow to gather a variety of objects under one name

str <- "My first list"
mtx <- matrix(1:6, nrow = 3)
intVtr <- c(5, 7, 32, 19)
strVtr <- c("one", "two")

mylist <- list(title = str, ages = intVtr, mtx, strVtr)
> mylist
$title
[1] "My first list"

$ages
[1]  5  7 32 19

[[3]]
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

[[4]]
[1] "one" "two"

> mylist[[2]]
[1]  5  7 32 19
> mylist[["ages"]]
[1]  5  7 32 19

Python

variable-length
can be modified in-place
[], list()
methods: append(), insert(), pop(), remove(), extend(), sort()

In [1]: a_list = [2, 7, None]
In [2]: print(a_list)
Out[2]: [2, 7, None]

In [3]: b_list = list(('foo', 'bar'))
In [4]: b_list[1] = 'pee'
In [5]: print(b_list)
Out[5]: ['foo', 'pee']

Dataframe

R

data.frame()

patientId <- c(1, 2, 3)
age <- c(34, 23, 7)
diabetes <- c("Type1", "Type2", "Type1")
status <- c("Poor", "Excellent", "Improved")
patientDF <- data.frame(patientId, age, diabetes, status)

> patientDF
  patientId age diabetes    status
1         1  34    Type1      Poor
2         2  23    Type2 Excellent
3         3   7    Type1  Improved

> patientDF[1:2]
  patientId age
1         1  34
2         2  23
3         3   7

> patientDF[1, ]
  patientId age diabetes status
1         1  34    Type1   Poor

> patientDF[c("patientId", "age")]
  patientId age
1         1  34
2         2  23
3         3   7

> patientDF$age
[1] 34 23  7

Python

contain an ordered collection of columns
have both row and column index
package pandas
pandas.DataFrame()

import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.7]}
frame = pd.DataFrame(data, columns=['year', 'state', 'pop'])
In [1]: print(frame)
Out[1]:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002  Nevada  3.6
3  2003  Nevada  2.7

Besides, there are some data structures which don’t exist in both R and Python:

Factors (R)

nominal / ordinal / continuous
factor()

patientId <- c(1, 2, 3)
age <- c(34, 23, 7)
diabetes <- c("Type1", "Type2", "Type1")
status <- c("Poor", "Excellent", "Improved")
diabetes <- factor(diabetes)
status <- factor(status, order = T)

patientDF <- data.frame(patientId, age, diabetes, status)

> str(patientDF)
'data.frame':	3 obs. of  4 variables:
 $ patientId: num  1 2 3
 $ age      : num  34 23 7
 $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1
 $ status   : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 1 2

> summary(patientDF)
   patientId        age         diabetes       status 
 Min.   :1.0   Min.   : 7.00   Type1:2   Excellent:1  
 1st Qu.:1.5   1st Qu.:15.00   Type2:1   Improved :1  
 Median :2.0   Median :23.00             Poor     :1  
 Mean   :2.0   Mean   :21.33                          
 3rd Qu.:2.5   3rd Qu.:28.50                          
 Max.   :3.0   Max.   :34.00

Tuple (Python)

fixed-length
immutable
tuple()

In [1]: tup_int = 4, 5, 6
In [2]: print(tup_int)
Out[2]: (4, 5, 6)

In [3]: tup_str = tuple('string')
In [4]: print(tup_str)
Out[4]: ('s', 't', 'r', 'i', 'n', 'g')
In [5]: tup_str[0]
Out[5]: 's'

Dict (Python)

hash map, associative array
key-value pairs
{}, ,
methods: del, pop(), update()

empty_dict = {}
d1 = {'a': 'some value', 'b': [1, 2]}
In [1]: print(d1)
Out[1]: {'a': 'some value', 'b': [1, 2]}

d1[7] = 'an integer'
In [2]: print(di)
Out[2]: {'a': 'some value', 'b': [1, 2], 7: 'an integer'}
In [3]: print(d1['b'])
Out[3]: [1, 2]

Set (Python)

unordered collection
unique element
set(), {}
set operations: union, intersection, difference, symmetric difference

In [1]: print(set([2, 2, 1, 3]))
Out[1]: {1, 2, 3}
In [2]: print({2, 2, 1, 3})
Out[2]: {1, 2, 3}

List, Set and Dict comprehensions (Python)

form a new list by filtering the elements of a collection
transform the elements passing the filter in one concise expression
list comprehension [expr for val in collection if condition]
dict comprehension dict_comp = {key-expr : value-expr for value in collection if condition}
set comprehension set_comp = {expr for value in collection if condition}

PREVIOUSReview of 2017

NEXTArray manipulation - R vs Python