02 Matrices, data frames and lists

question Questions
• How to work with tables and lists?
objectives Objectives
• learn how to create a matrix
• learn how to create a data frame
• learn how to extract elements from a data frame
• learn how to change elements in a data frame
• learn how to reorder columns in a data frame
• learn how to create a list
• learn how to extract elements from a list

time Time estimation: 90 minutes

Data structures in R

The power of R lies not in its ability to work with simple numbers but in its ability to work with large datasets. R has a wide variety of data structures including scalars, vectors, matrices, data frames, and lists.

Matrices

A matrix is a table, the columns are vectors of equal length. All columns in a matrix must contain the same type of data. The top row, called the header, contains column labels. Rows can also have labels. Data values are called elements. Indices are often used as column and row labels.

Creating a matrix

To create a matrix M use the matrix() function

M <- matrix(data,nrow=r,ncol=c,byrow=FALSE))


It takes a long list of arguments:

• data usually is a vector of elements to will fill the matrix
• nrow and ncol: dimensions (number of rows and columns). Only one dimension argument is needed. If there are 20 elements in the data vector and ncol=4 then R will automatically calculate that there should be 5 rows.
• byrow: how the matrix is filled, byrow=TRUE fills the matrix row by row whereas byrow=FALSE fills the matrix column by column

hands_on Hands-on: Demo

From the demo script run the Data creation: matrices section

hands_on Hands-on: Extra exercise 8a

1. Create a 2x2 matrix named mat containing numbers 2,3,1,5
2. Print the matrix
solution Solution
 mat<-matrix(c(2,3,1,5),nrow=2,ncol=2)
mat


hands_on Hands-on: Extra exercise 8b

1. Create a 2x3 matrix named onemat consisting of all ones
2. Print the matrix
solution Solution
 onemat<-matrix(1,nrow=2,ncol=3)
onemat


hands_on Hands-on: Extra exercise 8c

1. Create a 3x3 matrix containing numbers 1,2,3,4,5,6,7
2. Retrieve all elements that are larger than 3
solution Solution
 m <- matrix(c(1,2,3,4,5,6,7),ncol=3)
m[m > 3]


Data frames

Just like a matrix, a data frame is a table where each column is a vector. But a data frame is more general than a matrix: they are used when columns contain different data types, while matrices are used when all data is of the same type.

comment Comment

R has a number of built-in data frames like mtcars.

Creating a data frame

To create a data frame D use the function data.frame() with the vectors we want to use as columns:

D <- data.frame(column1,column2,column3)


comment Comment

The columns of a data frame are all of equal length

You can provide names (labels) for the columns:

D <- data.frame(label1=column1,label2=column2,label3=column3)


comment Comment

As an argument of data.frame() you use label=vector_to_add: the equals (and not the assignment) operator is used because you are naming columns not creating new variables. If you don’t define labels (as in the first example), the names of the vector names are used as column names.

hands_on Hands-on: Demo

From the demo script run the Data creation: data frames section

hands_on Hands-on: Exercise 9a

Create a data frame called Plant_study containing days and Plants_with_lesions. Name the columns Days and Plants.

solution Solution
 Plant_study <- data.frame(Days=days,Plants=Plants_with_lesions)
Plant_study


hands_on Hands-on: Exercise 9b

Create a data frame called Drug_study consisting of three columns: ID, treatment and smoking

solution Solution
 Drug_study <- data.frame(ID,treatment,smoking)
Drug_study


hands_on Hands-on: Exercise 9c

Create a data frame genomeSize containing genome sizes and print it.

• The first column is called organism and contains Human,Mouse,Fruit Fly, Roundworm,Yeast
• The second column size contains 3000000000,3000000000,135600000,97000000,12100000
• The third column geneCount contains 30000,30000,13061,19099,6034
solution Solution
 organism <- c("Human","Mouse","Fruit Fly", "Roundworm","Yeast")
size <- c(3000000000,3000000000,135600000,97000000,12100000)
geneCount <- c(30000,30000,13061,19099,6034)
genomeSize <- data.frame(organism,size,geneCount)


hands_on Hands-on: Extra exercise 9d

Create a data frame ab and print it.

• The first column is called a and contains 1,3,2,1
• The second column is called b and contains 2,3,4,1
solution Solution
 a <- c(1,3,2,1)
b <- c(2,3,4,1)
ab <- data.frame(a,b)

Referring to the elements of a data frame

Referring to elements of a data frame can be done in the same way as for matrices, using row and column indices in between square brackets. The only difference is that in data frames you can also use the labels of the columns to retrieve them.

To retrieve the element on the second row, first column:

D[2,1]


To select all values from one dimension leave the index blank, e.g. all elements of the first column:

D[,1]


comment Comment

If you want to retrieve all the rows you don?t write any index before the comma inside the square brackets.

You can also use column labels for retrieving elements. Column names have to be written between quotes:

D[,"label1"]


You can also use the range function to select elements:

D[2:4,1]


The $symbol can be used to retrieve a column based on its label e.g. to retrieve column label1 from D: D$label1


comment Comment

With $you do not have to put quotes around the column name Since the result of$ is a vector, you can address a specific element of a column using its index:

D$label1[2]  retrieves the second element of the column called label1 Specific for data frames is the subset() function that can be used to select columns that satisfy a logical operation: subset(D,select=columns to extract) subset(D,logical expression,columns to extract)  hands_on Hands-on: Demo From the demo script run the Data extraction: data frames section hands_on Hands-on: Exercise 10a 1. Retrieve the data for the Volvo 142E from mtcars 2. Retrieve the gas usage (mpg column) for the Volvo 142E solution Solution  mtcars["Volvo 142E",] mtcars["Volvo 142E","mpg"]  question Question What will happen when you run this code ?  mtcars["Volvo 142E"]  question Question What will happen when you run this code ?  mtcars[Volvo 142E,]  hands_on Hands-on: Exercise 10b 1. Retrieve the IDs of the smoking patients in Drug_study 2. Retrieve ID and treatment of the smoking patients 3. Retrieve the smoking behavior of all the patients 4. Change the treatment of the fourth patient to A 5. Add a column called activity with values: 4, NA, 12.1, 2.5 6. Use subset() to retrieve the full ID and treatment columns solution Solution  subset(Drug_study,smoking==TRUE,ID) subset(Drug_study,smoking==TRUE,c(ID,treament)) Drug_study$smoking
Drug_study$treatment[4] <- "A" Drug_study$activity <- c(4,NA,12.1,2.5)
subset(Drug_study,select=c(ID,treatment))


question Question

What will happen when you run this code ?

 Drug_study[Drug_study$smoking==TRUE,"ID"]  question Question What will happen when you run this code ?  Drug_study[Drug_study$smoking==TRUE,ID]



question Question

What will happen when you run this code ?

 Drug_study[,"smoking"]



question Question

What will happen when you run this code ?

 Drug_study[4,"treatment"] <- "B"


question Question

What will happen when you run this code ?

 Drug_study["activity"] <- c(4,NA,12.1,2.5)



question Question

What will happen when you run this code ?

 subset(Drug_study,c(ID,treatment))



question Question

What will happen when you run this code ?

 subset(Drug_study,,c(ID,treatment))



comment Comment

The order of the arguments is important except when you specify their names.

hands_on Hands-on: Extra exercise 10c

On which days did we observe more than 2 infected plants in the plant experiment? Answer this question with and without using the subset() function.

solution Solution
 > Plant_study[Plant_study$Plants > 2,"Days"] > subset(Plant_study,Plants > 2,Days)  question Question What will happen when you run this code ?  Plant_study[Plant_study["Plants"] > 2,"Days"]  hands_on Hands-on: Extra exercise 10d 1. Create vector q by extracting the a column of data frame ab (exercise 9) with and without subset(). 2. Retrieve the second element of column a of data frame ab 3. Add column c with elements 2,1,4,7 to data frame ab solution Solution  q <- ab$a
subset(q,select=a)
ab$a[2] ab$c <- c(2,1,4,7)

Removing elements from a data frame

To remove elements from a data frame use negative indices just as in a vector e.g. to remove the second row from data frame D use:

D <- D[-2,]


comment Comment

The minus sign only works with numbers not with column labels.

To remove columns based on labels assign them to NULL:

keypoints Key points

• We showed how to create a matrix
• We showed how to create a data frame
• We showed how to refer to the elements of a data frame
• We showed how to remove elements from a data frame
• We showed how to reorder columns in a data frame