# 03 Vectors and factors

question Questions

- How to work with vectors and factors?
objectives Objectives

- learn how to create vectors
- learn how to extract elements from a vector
- learn how to change elements in a vector
- learn how to perform logical and arithmetic operations on vectors
- learn how to create factors

time Time estimation:75 minutes

# Data structures in R

The power of R lies not in its ability to work with simple numbers but in its ability to work with large datasets. R has a wide variety of data structures including scalars, vectors, matrices, data frames, and lists.

### Vectors

The simplest data structure is the *vector*, a single row consisting of data values of the same type, e.g. all numbers, characters, Booleans…

#### Creating a vector

The function **c()** (short for “combine values” in a vector) is used to create vectors. The only arguments that need to be passed to c() are the values that you want to combine into a vector.
You can create a **numeric** (a), **character** (b) or **logical** (c) vector:

```
a <- c(1,2,5.3,6,-2,4)
b <- c("janick","jasper","niels")
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)
```

You can also create a vector by **joining existing vectors with the c () function:**

```
x1 <- c(1,2,3)
x2 <- c(3,4)
c(x1,x2)
# [1] 1 2 3 3 4
```

## hands_on Hands-on: Demo

From the demo script run the

Data Creation: vectorssection

## hands_on Hands-on: Exercise 4a

You count every day how many plants of the initial set of 40 plants developed lesions as a result of a mold infection.

- Create a vector called Plants_with_lesions containing the results of your counts: 1,3,4,2,6
- Create a vector days containing the days of the week in the following format: Mon, Tues, Wednes, Thurs, Fri.
## solution Solution

`> Plants_with_lesions <- c(1,3,4,2,6) > days <- c("Mon","Tues","Wednes","Thurs","Fri")`

## hands_on Hands-on: Extra exercise 4b

Create a vector newVector with the following elements: 2,5,5,3,3,6,2 and print its content.

## solution Solution

`newVector <- c(2,5,5,3,3,6,2) newVector`

If you need a sequence of consecutive integers you can create it with the **start:end** notation, e.g. a vector with values from 5 through 9

```
5:9
# [1] 5 6 7 8 9
```

You can also define a decreasing sequence of integers:

```
9:5
# [1] 9 8 7 6 5
```

You can create the same vector with the seq() function:

```
seq(5,9)
# [1] 5 6 7 8 9
```

But seq (short for sequence) can do a lot more: it allows to take increments other than 1. It takes four arguments:

*from*: the first number in the sequence*to*: the last possible number in the sequence.*by=increment*: increment, can be added or subtracted depending on the start and the end of the sequence. If from > to then subtract increment, if from < to then add increment.*length.out*: alternative to end, number of elements in the vector.

As you can see, some arguments of a function have a name, e.g. the increment argument is called *by*.

The **rep()** function **repeats** a value a specified number of times.

```
rep("bla", 3)
# [1] "bla" "bla" "bla"
```

You can combine these functions with the c() function to make more complicated vectors:

```
c(rep(1,3), rep(2,3), rep(3,3))
# [1] 1 1 1 2 2 2 3 3 3
```

To generate a **random** set of **numbers** drawn from a normal distribution with a given mean and spread use the **rnorm(n, mean = 0, sd = 1)** function where:

*n*: how many random numbers do you want ?*mean*: mean of the normal distribution*sd*: standard deviation of the normal distribution`rnorm(1000, 3, 0.25)`

generates 1000 numbers from a normal distribution with mean 3 and sd=0.25

The normal distribution implies that numbers close to the mean have a higher probability of occurring than numbers far from the mean.

If you want a set of random numbers from a uniform distribution (every number in the specified range has the same probability of being drawn) you can use the **runif(n, min=0, max=1)** function where:

*n*: how many random numbers do you want ?*min*: lowest number of the range of numbers to choose from*max*: highest number of the range of numbers to choose from

The most freedom is given by the **sample(x, size, replace = FALSE)** function: it takes a random sample of a specified size from the elements of x either with or without replacement:

*x*: a vector of elements from which to choose*size*: how many random numbers do you want ?*replace*: place sampled numbers back in set or not ?`sample(c(0,1), 100, replace=TRUE)`

generates a set of 100 random zeros or ones.

Suppose you want to simulate 10 rolls of a dice. Because the outcome of a single roll is a number between 1 and 6, your code looks like this:

```
sample(1:6, 10, replace=TRUE)
# [1] 2 2 5 3 5 3 5 6 3 5
```

You tell sample() to return 10 values, each in the range 1:6. Because every roll of dice is independent, you sample with replacement. This means that you put the element you?ve drawn back into the list of values to choose from.

## hands_on Hands-on: Exercise 4c

For a study checking the effect of a drug on a disease, we want to store patient info.

- Create a vector named ID containing numerical values 1,2,3,4
- Create a vector named treatment containing values A, placebo, B, and a missing value.
- Use the rep() function to create a vector called smoking containing booleans true, true, true, and false. Check the documentation and the examples of usage of rep().
## solution Solution

`ID <- 1:4 treatment <- c("A","placebo","B",NA) smoking <- c(rep(TRUE,3),FALSE)`

## question Question

What will happen when you run this code ?

`smoking <- c(rep(true,3),false)`

## question Question

What will happen when you run this code ?

`smoking <- c(rep("true",3),"false")`

## hands_on Hands-on: Extra exercise 4d

Create vector threes consisting of 3,3,3,3,3,3,3 and print the content of threes

## solution Solution

`threes<-rep(3,7) threes`

## hands_on Hands-on: Extra exercise 4e

Print ha ha ha ha

## solution Solution

`rep("ha",4)`

## comment Comment

Vectors cannot hold values of different types! R automatically converts all values to the same type so that the vector can hold them. If one of the values is a string all values will be converted to strings or in case of a mix of integers and booleans all values will be converted to integers.

## comment Comment

Words used as values have to be written between quotes, words used as variable names do not! If R encounters a word without quotes it will try to find a variable with that name.

#### Referring to elements of a vector

Every element in a vector is assigned an index (= its position in the vector) in the order in which elements were entered. This index starts with one, not zero.

You can extract elements from vectors in two ways:

- You directly identify specific elements using their indices
- You create a logical operation to select certain elements.

To refer to elements of a vector use indices or a logical operation inside square brackets [] e.g. to retrieve the 2nd element of vector a use:

```
a[2]
```

to retrieve the 2nd, 3rd and 4th element of vector a use:

```
a[2:4]
```

to retrieve the 2nd and 4th element of vector a use:

```
a[c(2,4)]
```

You also see [] when you look at output in the console. The number in between the square brackets is the index of the first value on the line.

```
v <- c(rep(5,10),rep(10,5))
#[1] 5 5 5 5 5 5 5 5 5 5 10 10
#[13] 10 10 10
```

There are 12 values on the first line, so on the second line of data, the first value (10) is actually on the 13th position in the vector v. So [13] refers to the index of the first element on the line.

Retrieving elements using a logical operation is done as follows:

```
x
#[1] 1 3 11 1 7
x[x < 4]
#[1] 1 3 1
```

Retrieving data with logical operators is based on the following fact: every logical statement produces the outcome TRUE or FALSE.

```
x < 4
#[1] TRUE TRUE FALSE TRUE FALSE
```

Logical operators applied to vectors will result in a vector of the same length consisting of TRUE or FALSE values depending on whether the statement is true for the particular element. If you use the outcomes of a logical operation to retrieve elements of a vector, only the elements where the outcome is TRUE will be selected.

## hands_on Hands-on: Demo

From the demo script run the

Data extraction: vectorssection

## hands_on Hands-on: Exercise 5a

Create a vector named x containing the numbers 20 to 2. Retrieve elements that are larger than 5 and smaller than 15.

## solution Solution

`x <- 20:2 x[x > 5 & x < 15]`

## question Question

What will happen when you run this code ?

`x[15 > x > 5]`

## question Question

What will happen when you run this code ?

`x(x > 5 & x < 15)`

## question Question

What will happen when you run this code ?

`x[x > 5] & x[x < 15]`

## hands_on Hands-on: Exercise 5b

- Retrieve the 4th and 5th elements from the days vector.
- Retrieve elements from Plants_with_lesions that are larger than 2.
## solution Solution

`days[c(4,5)] Plants_with_lesions[Plants_with_lesions > 2]`

## question Question

What will happen when you run this code ?

`days[4,5]`

## question Question

What will happen when you run this code ?

`days[4:5]`

## question Question

What will happen when you run this code ?

`days(4:5)`

## hands_on Hands-on: Extra exercise 5c

Create vector y with elements 9,2,4 and retrieve the second element of y.

## solution Solution

`y <-c (9,2,4) y[2]`

## hands_on Hands-on: Extra exercise 5d

- Create vector z with elements 1, 2, 3, 4, 12, 31, 2, 51, 23, 1, 23, 2341, 23, 512, 32, 312, 123, 21, 3
- Retrieve the 3rd, 4th, 5th, 6th and 7th element
- Retrieve the 2nd and 4th element
- Retrieve elements from z that are larger than 100
## solution Solution

`z <- c(1,2,3,4,12,31,2,51,23,1,23,2341,23,512,32,312,123,21,3) z[3:7] z[c(2,4)] z[z > 100]`

## hands_on Hands-on: Demo

From the demo script run the

Logical and arithmetic operations on variablessection

## hands_on Hands-on: Extra exercise 5h

Retrieve elements from newVector (exercise 4b) that are larger than the corresponding elements of vector threes (exercise 4d).

## solution Solution

`newVector[newVector > threes]`

#### Removing, changing or adding elements in a vector

To remove an element from a vector use a negative index: ?-? indicates ?NOT? followed by the index of the element you want to remove, e.g. to remove the second element of vector z use:

```
z <- z[-2]
```

Change or add elements by assigning a new value to that element .

## hands_on Hands-on: Demo

From the demo script run the

Data removal vectorssection

## hands_on Hands-on: Exercise 6a

From vector x (exercise 5a) remove the first 8 elements and store the result in x2.

## solution Solution

`x2 <- x[-(1:8)] x2`

## question Question

What will happen when you run this code ?

`x2 <- x[-1:8]`

## hands_on Hands-on: Extra exercise 6b

Retrieve the same elements from z as in exercise 5d2 but first replace the 3rd element by 7.

## solution Solution

`z[3] <- 7 z[3:7]`

### Factors

You can tell R that a variable is categorical (= text labels representing categories although sometimes numbers are also used) by making it a factor.

The difference between a categorical variable and a continuous variable is that a categorical variable represents a limited number of categories. A continuous variable is the result of a measurement and can correspond to an infinite number of values.

In most cases categorical data is used to **describe** other data, it is not used in calculations e.g. which group does a measurement belong to. Storing data as factors ensures that the graphing and statistical functions in R will treat such data correctly.

There are two types of categorical data:

- unranked categorical data do not have an implied order
- ranked categorical data do have a natural ordering

R will treat factors by default as unranked but you can create ordered (ranked) factors.

To create a factor, first create a vector and then convert it to a factor using the factor() function:

```
v <- c(1,4,4,4,3,5,4,4,5,3,2,5,4,3,1,3,1,5,3,4)
v
#[1] 1 4 4 4 3 5 4 4 5 3 2 5 4 3 1 3 1 5 3 4
f <- factor(v,ordered=TRUE)
f
#[1] 1 4 4 4 3 5 4 4 5 3 2 5 4 3 1 3 1 5 3 4
#Levels: 1 < 2 < 3 < 4 < 5
```

## comment Comment

The factor() function creates “Levels”: these are the labels of the categories.

The only **required argument** of the factor() function is a **vector** of values which will be factorized. Both numeric and character vectors can be made into factors but you will use factor() typically for numerical data that represents categories.

When you create a vector containing text values in R you have to factorize it but if you store the vector as a column in a data frame, text data is automatically converted to a factor.

When you import data into R using read.() functions, the data is automatically stored in a data frame so text will be automatically converted into a factor.

So in reality (since you mostly import data into R) you use factor() mainly to factorize **numbers** that represent categories.

By default, factor() transforms a vector into an unordered factor, as does the automated factorization of the read.() functions. Unordered means that the categories are processed in alphabetical order: High will be plotted before Low since H comes first in the alphabet.

If the categories are ranked, you have to create an ordered factor, you have to add two additional arguments:

- Set
*ordered*to TRUE to indicate that the factor is ordered *levels*: a vector of category labels (as strings) in the correct order

## hands_on Hands-on: Demo

From the demo script run the

Data creation: factorssection

## hands_on Hands-on: Extra exercise 7a

- Create a vector gender with the following elements: Male, Female, male.
- Convert gender into a factor with levels: Male and Female
- Print the content of the factor. What happens?
## solution Solution

`gender <- c("Male","Female","male") gender <- factor(gender,levels=c("Male","Female")) gender`

## keypoints Key points

- We showed how to create a vector
- We showed how to refer to the elements of a vector
- We showed how to remove elements from a vector
- We showed how to change or add elements to a vector
- We showed how to create a factor