Background

Programming is a structured way of telling the computer what to do. A key advantage of R relative to other statistics packages is that it is also a full-featured programming language. This means that by becoming proficient in R, you can do far more than just statistics. Some of the many benefits of working in R include:

  • It’s free.
  • It’s open (anyone can build on it and inspect every element).
  • It’s popular (which means that all main stream analyses are probably implemented, and there is a large, active support community).
  • You can carry out complex data manipulation in a way that is reproducible.
  • You can write your own statistical procedures.
  • You can build and explore dynamic models tailored to your specific study system instead of relying on a restricted pool methods.
  • You can easily retrace your steps and allow others to reproduce your work.
  • You can keep your data, and data analysis separate which means you are less likely to experience irreparable data alteration and/or loss.

However, working with R is not without its downsides, including:

  • There is a learning curve that must be overcome
  • Unlike point-and-click programs, R Does not guide you down a specific analytical path, so you need to know what tests to apply to each situation. In other words, it assumes you know what you want to do and gives you the tools to get there, but does not provide you with recipe book style guidance along the way.
  • Because of the popularity and open-source nature of R, package quality is highly variable. Anybody can make an R package and there is no quality control on whether it is doing anything meaningful under the hood. This means you need to be very careful with which packages you chose to use.
  • R and its growing collection of interdependent packages are constantly being updated, so a script may ‘break’ as updates occur.

In this practical you will be asked to:

  • Execute basic R commands
  • Apply built in functions
  • Write your own R functions
  • Import and inspect data sets
  • Subset elements of a data set

You are asked to complete the following exercises and submit to Canvas before the deadline. In addition to the points detailed below, 5 points are assigned to the cleanliness of the code and resulting pdf document. Only knit documents (.pdf, .doc, or .html) will be accepted. Unknit .Rmd files will not be graded.


Installing R

In order to complete the practicals for this course you will need to install R, R Studio, and rmarkdown. If these are already installed on your computer, feel free to skip this section.

You will first need to download and install the most recent version of R. The software is freely available from https://cran.r-project.org. Once complete, the installation process should provide you with a clickable icon that can be used to start the program. Run the program to ensure your installation ran through to completion without issue.

After installing R, you are encouraged to install RStudio. While not strictly necessary, RStudio provides an enhanced interface that makes working with R significantly more user friendly. RStudio Desktop is freely available from https://rstudio.com/products/rstudio/download/.

The final piece of software that is required for completing the practicals is rmarkdown. RMarkdown documents weave together narrative text and code to produce formatted, fully reproducible outputs. If you are unfamiliar with RMarkdown, a short tutorial is available from https://rmarkdown.rstudio.com/articles_intro.html

RMarkdown files are designed to be used with the rmarkdown package. rmarkdown comes installed with the RStudio, but you can also acquire rmarkdown from CRAN by entering the command (see below for more information on installing packages)

install.packages("rmarkdown")


Conceptual overview of R

Concept 1: Command line

R is a command line program that takes in written commands and passes them to the computer to run. When you start R, the first thing you see is a command prompt >. This tells you that the program is waiting for a command. After entering a command, and issuing it by hitting ENTER, one of three things can happen:

A correctly issued command will run through to completion

2+2
## [1] 4

An incomplete command will ask to be completed (indicated by a hanging + sign)

2+

+

Note: if this happens and you don’t know how to proceed, you can hit esc to cancel the incomplete command.

An incorrect command will return an error message

2+X
## Error in eval(expr, envir, enclos): object 'X' not found

When entering commands, it is important to know that R will ignore any text that follows after a hash symbol #. This is a useful way for adding comments in an R script, or blocking out certain unwanted commands without deleting them outright.

The following text returns an error so the command below does not run
2+2
## Error: <text>:1:5: unexpected symbol
## 1: The following
##         ^
#But the hash allows you to include readable text without generating errors
#Allowing for code to be `commented` without issue
2+2
## [1] 4

Note: Commenting code is a critically important aspect of command-line based data analysis. It allows you to remember steps long after a project was conducted, and also allows to more easily you share your code with others.

Concept 2: Procedural work flow

When working in R, you issue statements that the program evaluates sequentially.

#Statement 1
2+2
## [1] 4
#Statement 2
2-2
## [1] 0

Blocks of code denoted by {} define statements spanning multiple lines

#Statement 1
2+2
## [1] 4
#Statement 2
{2-2
2/2
  }
## [1] 1

Note how only the last result of the blocked statement has been printed to the screen. Everything within the block is run, but only a single, final output is returned. This is the intended behaviour of a blocked statement.

Concept 3: Variables

Variables are a core aspect of the way that R functions. Variables are named pieces of the computer’s memory (stored in RAM). A variable can be named almost anything, but names need to start with a letter. They can contain letters, numbers, ., or _, but can not be one of R’s reserved words/names/symbols. You can store values in variables (which stores them in the computer’s memory), and use those values in later calculations. Variables are signed using the assignment operator <-. Variables are also case sensitive, which means that x is not the same as X.

X <- 2+7

Y <- FALSE

pizza <- "tasty"

Note: Although variables can have nearly any name, informative names are ideal. You should try to develop a simple yet flexible naming structure instead of relying on interesting, yet difficult to remember naming structures.

Variables can also be deleted (i.e., removed from the computer’s memory) using the rm() function

X
## [1] 9
rm(X)
X
## Error in eval(expr, envir, enclos): object 'X' not found

Variables do not have to be a single value and can take on more complex structure. For example, vectors are fundamental to programming, and many programs you write will build up vectors. All elements of a vector must be of the same type: integers, real numbers, character strings, etc… Note: more details on vectors will show up below.

Z <- c(1,2,3)
  1. 3 points
    • Calculate 3 plus 8 – 0.5 point(s)
    • Calculate 4 times 3 and store the results in a variable called ‘x’ – 0.5 point(s)
    • Calculate x divided by 6 and store the results in a variable called ‘y’ – 0.5 point(s)
    • Calculate 3 times x minus y – 0.5 point(s)
    • Make a vector of length 6, call it anything you like – 0.5 point(s)
    • Provide comments on every step with # – 0.5 point(s)

Concept 4: Functions

Functions are the core workhorses of the R environment. They are pieces of packaged code that take some input (the arguments) and return some output. R has many built-in functions: mean(), sd(), cor(), anova(), t.test(), etc… If you want more information on a function you can use the ? operator to see the documentation (e.g., ?mean). Alternatively, you can use help('mean')

Z <- c(1,2,3,4,5)

mean(Z)
## [1] 3

You can also write your own functions using the function() function. Let’s write an R function that multiples any input number by 5.

times.five <- function(input){
  input*5
}

times.five(1)
## [1] 5
times.five(5)
## [1] 25
times.five(100)
## [1] 500

Note the use of the block code operator.

  1. 3 points
    • Write a function that takes any input and repeats it 10 times. – 2 point(s)
    • Apply that function to a number, and a piece of text. – 1 point(s)

    Hint: The function should be designed to handle both text (class character) and numbers (class numeric).

Concept 5: The working directory

When you run an R session, the program is always ‘pointing’ towards a location on your computer. This is called the working directory. It is the location where R will search for, and/or save any files. The first step in any project should be to set the working directory so you know where R will be pulling files from (so you can import the right data), and saving files (so you can find and inspect any results).

The working directory can be set using the setwd() function (see help(setwd)). You can also identify the current working directory using the getwd() function. You can also list all the files in the current working directory using list.files(). Note: The process of setting the working directory in RMarkdown (i.e., .Rmd files) is slightly different than when using R scripts directly (i.e., .R files). For help on this see: https://bookdown.org/yihui/rmarkdown-cookbook/working-directory.html


Working with variables

Data import

Data sets are essentially complex variables. They can be imported into R using the read.csv() function. R, and many R packages also have a number of built in data sets that can be imported using the data() function. For example the iris data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, for 50 flowers from each of 3 species of iris.

data("iris")

Data inspection

When you import a data set, R tries to automatically determine what class of information is in each column (e.g., numeric, factor, string of text, etc…). The first steps after importing a data set should always be to inspect the data to make sure the import was correct. This is done by applying the following functions to the data set variable: str(), summary(), class(), names(), head(), tail(), View().

  1. 2 points
    • Apply str(), summary(), class(), names(), head(), and tail(), to the mtcars dataset and briefly describe the outcome. – 2 point(s)

Sub setting elements

There are three operators that can be used to extract subsets of R objects.

  • The [ operator always returns an object of the same class as the original. It can be used to select multiple elements of an object

  • The [[ operator is used to extract elements of a list or a data frame. It can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame.

  • The $ operator is used to extract elements of a list or data frame by name.

In a standard data set with some number of rows and columns, the [ operator can be used to extract specific values from a data set: DATA[row,column]. Rows are typically indexed by i and columns indexed by j, such that DATA[i,j] denotes the ith row in the jth column.

The command [,1] will return the first column of a data set

mtcars[,1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

The first column of a data set can also be extracted by name

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

The command [1,] will return the first row of a data set

mtcars[1,]
##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Conditional sub setting

You will often want to select certain parts of a data set if some condition is TRUE, and/or remove parts if that condition is FALSE. The which() function allows you to identify elements that satisfy a condition (e.g., ==, >, <=, !=).

#Identify which cars have a miles/gallon greater than or equal to 25
KEEPERS <- which(mtcars$mpg >=25)

#Extract those rows
mtcars[KEEPERS,]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
  1. 6 points
      Using the iris data set:
    • Get the variance in petal length. – 1 point(s)
    • Select the second, third, and seventh rows of the second and third columns. – 1 point(s)
    • Make a new variable called ‘SUBSET’ from rows where sepal length is less than 7. – 1 point(s)
    • Find the median petal length of Iris versicolor. – 2 point(s)
    • Comment every step. – 1 point(s)

Vectors

A vector is the simplest type of data structure in R. Simply, a vector is a sequence of data elements, each of the same basic type. Vectors can be created by using the c() function, which is a generic function that combines its arguments.

help('c')

X <- c(2,3,5)

X
## [1] 2 3 5
X[1] # 2
## [1] 2
X[2] # 3
## [1] 3
X[3] # 5
## [1] 5
class(X)
## [1] "numeric"
  1. 1 point
    • Create a vector named Y that is the reverse of X. – 0.5 point(s)
    • Then check Y by printing it out. – 0.5 point(s)

Vector classes

As was just noted, a vector is a sequence of data elements, each of the SAME BASIC TYPE. This means that all arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed.

# numeric vector
2:4
## [1] 2 3 4
# character vector
c('cat','dog','bat')
## [1] "cat" "dog" "bat"
# logical vector
c(TRUE,FALSE,TRUE)
## [1]  TRUE FALSE  TRUE
# vector (and array) elements must all be the same 'class'
Z <- c(1,'cat','dog')
# vector (and array) elements can only be simple classes

Z[1]
## [1] "1"
Z[1] + 1
## Error in Z[1] + 1: non-numeric argument to binary operator
class(Z[1])
## [1] "character"
  1. 2 points
    • Create a vector of the numbers 1, 2, and 3. – 0.5 point(s)
    • Create a character vector “COL” with 3 color names from the output of colors(). – 0.5 point(s)
    • Plot the vector, colouring each point with your chosen colors. – 1 point(s)

Sequences

Sequences of numbers are used in many different tasks, from plotting the axes of graphs to generating simulated data. The simplest way to create a sequence of numbers in R is by using the : operator, or the seq() function

help('seq')

seq(from=2,to=4,by=1)
## [1] 2 3 4
seq(2,4,1)
## [1] 2 3 4
seq(2,4)
## [1] 2 3 4
help(':')

2:4
## [1] 2 3 4
  1. 1 point
    • Create a sequence of integers from 1 to 20 in steps of 1. – 1 point(s)

Lists

A list is a generic vector containing other objects. The elements of a list (also called slots) can be of different types like − numbers, strings, vectors, even other lists. A list can also contain a matrix or a function as its elements. List is created using list() function.

#Vectors cannot combine multiple input types
A <- vector(1,TRUE,'cat')
## Error in vector(1, TRUE, "cat"): unused argument ("cat")
#But lists are more flexible
A <- list(1,TRUE,'cat')
A
## [[1]]
## [1] 1
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] "cat"

Indexing lists is slightly more complicated than indexing a vector

A[1] # a list of length one
## [[1]]
## [1] 1
class(A[1])
## [1] "list"
A[[1]] # the first element of the list
## [1] 1
class(A[[1]])
## [1] "numeric"
A[[1]] + 1
## [1] 2

We can also pull out list elements by name

# the same list, but with named elements
B <- list(number=1,mammal=TRUE,taxa='cat')
B
## $number
## [1] 1
## 
## $mammal
## [1] TRUE
## 
## $taxa
## [1] "cat"
# different ways of accessing list elements
B[[3]] # indexed by number
## [1] "cat"
B[['taxa']] # indexed by name
## [1] "cat"
B$taxa # "slot" by name
## [1] "cat"
  1. 2 points
    • Create a list C with your favorite number, color, and species—named number, color, species (the color must be selected from the output of colors()). – 1.5 point(s)
    • Run the following: plot(C$number,col=C$color,main=C$species). – 0.5 point(s)

for loops

Let’s say we have a repeated task like naming all of R’s colours (note the print()function will print out whatever you input).

COLS <- colors()[1:10]
print(COLS[1])
## [1] "white"
print(COLS[2])
## [1] "aliceblue"
print(COLS[3])
## [1] "antiquewhite"

but we don’t want to type (or copy/paste/edit) the same code over and over. A for loop is a way to repeat a blocked sequence of instructions. This will allow you to automate parts of your code that are in need of repetition.

for(i in 1:length(COLS)){
  print(COLS[i])
}
## [1] "white"
## [1] "aliceblue"
## [1] "antiquewhite"
## [1] "antiquewhite1"
## [1] "antiquewhite2"
## [1] "antiquewhite3"
## [1] "antiquewhite4"
## [1] "aquamarine"
## [1] "aquamarine1"
## [1] "aquamarine2"

An easy way to understand what is going on in the for loop is by reading it as follows: For each number from 1 to the number of colours in your vector COLS, you execute the code chunk print(COLS[i]), which prints the indexed colour. Once the for loop has executed the code chunk for every colour in the vector, the loop stops and goes to the next command after the loop block.

  1. 5 points
    • Create a for loop that prints out the numbers 1 through 10 squared. – 5 point(s)

Saving your progress

R makes it easy to re-create the steps of an analysis, but some analyses can take a long time to run. It is always a good idea to save important variables for future use. To save a data set as a file you can open in other applications use the write.csv() function. To save a data set as an R specific file format you can use the save() function.

Note: NEVER SAVE YOUR ENTIRE WORKING DIRECTORY! RStudio will ask you this question when you exit the program, but it carries risks and is unreliable. Never rely on this option for saving your progress.


Review

After completing these assignments you should know how to:

  • Pass commands to R
  • Import a data set
  • Inspect a data set after import
  • Subset your data according to desired conditions
  • Apply functions to your data
  • Repeat a task multiple times
  • Save your progress after completing important steps