Programming is a structured way of telling the computer what to do. A
key advantage of R
relative to other statistics packages is
that it is also a full-featured programming language. This means that by
becoming proficient in R
, you can do far more than just
statistics. Some of the many benefits of working in R
include:
However, working with R
is not without its downsides,
including:
R
Does not guide you
down a specific analytical path, so you need to know what tests to apply
to each situation. In other words, it assumes you know what you want to
do and gives you the tools to get there, but does not provide you with
recipe book style guidance along the way.R
,
package quality is highly variable. Anybody can make an R
package and there is no quality control on whether it is doing anything
meaningful under the hood. This means you need to be very careful with
which packages you chose to use.In this practical you will be asked to:
R
commandsR
functionsYou are asked to complete the following exercises and submit to Canvas before the deadline. In addition to the points detailed below, 5 points are assigned to the cleanliness of the code and resulting pdf document. Only knit documents (.pdf, .doc, or .html) will be accepted. Unknit .Rmd files will not be graded.
In order to complete the practicals for this course you will need to
install R
, R Studio, and rmarkdown
. If these
are already installed on your computer, feel free to skip this
section.
You will first need to download and install the most recent version
of R
. The software is freely available from https://cran.r-project.org. Once complete, the
installation process should provide you with a clickable icon that can
be used to start the program. Run the program to ensure your
installation ran through to completion without issue.
After installing R
, you are encouraged to install
RStudio. While not strictly necessary, RStudio provides an enhanced
interface that makes working with R
significantly more user
friendly. RStudio Desktop is freely available from https://rstudio.com/products/rstudio/download/.
The final piece of software that is required for completing the
practicals is rmarkdown
. RMarkdown documents weave together
narrative text and code to produce formatted, fully reproducible
outputs. If you are unfamiliar with RMarkdown, a short tutorial is
available from https://rmarkdown.rstudio.com/articles_intro.html
RMarkdown files are designed to be used with the
rmarkdown
package. rmarkdown
comes installed
with the RStudio, but you can also acquire rmarkdown
from
CRAN by entering the command (see below for more information on
installing packages)
install.packages("rmarkdown")
R
is a command line program that takes in written
commands and passes them to the computer to run. When you start R, the
first thing you see is a command prompt >
. This tells
you that the program is waiting for a command. After entering a command,
and issuing it by hitting ENTER, one of three things can happen:
A correctly issued command will run through to completion
2+2
## [1] 4
An incomplete command will ask to be completed (indicated by a hanging + sign)
2+
+
Note: if this happens and you don’t know how to proceed, you can hit
esc
to cancel the incomplete command.
An incorrect command will return an error message
2+X
## Error in eval(expr, envir, enclos): object 'X' not found
When entering commands, it is important to know that R
will ignore any text that follows after a hash symbol #
.
This is a useful way for adding comments in an R
script, or
blocking out certain unwanted commands without deleting them
outright.
The following text returns an error so the command below does not run
2+2
## Error: <text>:1:5: unexpected symbol
## 1: The following
## ^
#But the hash allows you to include readable text without generating errors
#Allowing for code to be `commented` without issue
2+2
## [1] 4
Note: Commenting code is a critically important aspect of command-line based data analysis. It allows you to remember steps long after a project was conducted, and also allows to more easily you share your code with others.
When working in R
, you issue statements that the program
evaluates sequentially.
#Statement 1
2+2
## [1] 4
#Statement 2
2-2
## [1] 0
Blocks of code denoted by {}
define statements spanning
multiple lines
#Statement 1
2+2
## [1] 4
#Statement 2
{2-2
2/2
}
## [1] 1
Note how only the last result of the blocked statement has been printed to the screen. Everything within the block is run, but only a single, final output is returned. This is the intended behaviour of a blocked statement.
Variables are a core aspect of the way that R
functions.
Variables are named pieces of the computer’s memory (stored in RAM). A
variable can be named almost anything, but names need to start with a
letter. They can contain letters, numbers, .
, or
_
, but can not be one of R’s reserved words/names/symbols.
You can store values in variables (which stores them in the computer’s
memory), and use those values in later calculations. Variables are
signed using the assignment operator <-
. Variables are
also case sensitive, which means that x is not the same as X.
X <- 2+7
Y <- FALSE
pizza <- "tasty"
Note: Although variables can have nearly any name, informative names are ideal. You should try to develop a simple yet flexible naming structure instead of relying on interesting, yet difficult to remember naming structures.
Variables can also be deleted (i.e., removed from the computer’s
memory) using the rm()
function
X
## [1] 9
rm(X)
X
## Error in eval(expr, envir, enclos): object 'X' not found
Variables do not have to be a single value and can take on more complex structure. For example, vectors are fundamental to programming, and many programs you write will build up vectors. All elements of a vector must be of the same type: integers, real numbers, character strings, etc… Note: more details on vectors will show up below.
Z <- c(1,2,3)
#
– 0.5 point(s)
Functions are the core workhorses of the R
environment.
They are pieces of packaged code that take some input (the arguments)
and return some output. R
has many built-in functions:
mean()
, sd()
, cor()
,
anova()
, t.test()
, etc… If you want more
information on a function you can use the ?
operator to see
the documentation (e.g., ?mean
). Alternatively, you can use
help('mean')
Z <- c(1,2,3,4,5)
mean(Z)
## [1] 3
You can also write your own functions using the
function()
function. Let’s write an R
function
that multiples any input number by 5.
times.five <- function(input){
input*5
}
times.five(1)
## [1] 5
times.five(5)
## [1] 25
times.five(100)
## [1] 500
Note the use of the block code operator.
Hint: The function should be designed to handle both text (class
character
) and numbers (class numeric
).
When you run an R
session, the program is always
‘pointing’ towards a location on your computer. This is called the
working directory. It is the location where R
will search
for, and/or save any files. The first step in any project should be to
set the working directory so you know where R
will be
pulling files from (so you can import the right data), and saving files
(so you can find and inspect any results).
The working directory can be set using the setwd()
function (see help(setwd)
). You can also identify the
current working directory using the getwd()
function. You
can also list all the files in the current working directory using
list.files()
. Note: The process of setting the working
directory in RMarkdown (i.e., .Rmd files) is slightly different than
when using R scripts directly (i.e., .R files). For help on this see: https://bookdown.org/yihui/rmarkdown-cookbook/working-directory.html
Data sets are essentially complex variables. They can be imported
into R
using the read.csv()
function.
R
, and many R
packages also have a number of
built in data sets that can be imported using the data()
function. For example the iris
data set gives the
measurements in centimeters of the variables sepal length, sepal width,
petal length and petal width, for 50 flowers from each of 3 species of
iris.
data("iris")
When you import a data set, R tries to automatically determine what
class of information is in each column (e.g., numeric, factor, string of
text, etc…). The first steps after importing a data set should always be
to inspect the data to make sure the import was correct. This is done by
applying the following functions to the data set variable:
str()
, summary()
, class()
,
names()
, head()
, tail()
,
View()
.
str()
, summary()
,
class()
, names()
, head()
, and
tail()
, to the mtcars
dataset and briefly
describe the outcome. – 2 point(s)
There are three operators that can be used to extract subsets of
R
objects.
The [
operator always returns an object of the same
class as the original. It can be used to select multiple elements of an
object
The [[
operator is used to extract elements of a
list or a data frame. It can only be used to extract a single element
and the class of the returned object will not necessarily be a list or
data frame.
The $
operator is used to extract elements of a list
or data frame by name.
In a standard data set with some number of rows and columns, the
[
operator can be used to extract specific values from a
data set: DATA[row,column]. Rows are typically indexed by i and columns
indexed by j, such that DATA[i,j] denotes the ith row in the jth
column.
The command [,1] will return the first column of a data set
mtcars[,1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
The first column of a data set can also be extracted by name
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
The command [1,] will return the first row of a data set
mtcars[1,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
You will often want to select certain parts of a data set if some
condition is TRUE
, and/or remove parts if that condition is
FALSE
. The which()
function allows you to
identify elements that satisfy a condition (e.g., ==
,
>
, <=
, !=
).
#Identify which cars have a miles/gallon greater than or equal to 25
KEEPERS <- which(mtcars$mpg >=25)
#Extract those rows
mtcars[KEEPERS,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
A vector is the simplest type of data structure in R. Simply, a
vector is a sequence of data elements, each of the same basic type.
Vectors can be created by using the c()
function, which is
a generic function that combines its arguments.
help('c')
X <- c(2,3,5)
X
## [1] 2 3 5
X[1] # 2
## [1] 2
X[2] # 3
## [1] 3
X[3] # 5
## [1] 5
class(X)
## [1] "numeric"
Y
that is the reverse of
X
. – 0.5 point(s)Y
by printing it out. – 0.5 point(s)
As was just noted, a vector is a sequence of data elements, each of the SAME BASIC TYPE. This means that all arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed.
# numeric vector
2:4
## [1] 2 3 4
# character vector
c('cat','dog','bat')
## [1] "cat" "dog" "bat"
# logical vector
c(TRUE,FALSE,TRUE)
## [1] TRUE FALSE TRUE
# vector (and array) elements must all be the same 'class'
Z <- c(1,'cat','dog')
# vector (and array) elements can only be simple classes
Z[1]
## [1] "1"
Z[1] + 1
## Error in Z[1] + 1: non-numeric argument to binary operator
class(Z[1])
## [1] "character"
Sequences of numbers are used in many different tasks, from plotting
the axes of graphs to generating simulated data. The simplest way to
create a sequence of numbers in R
is by using the
:
operator, or the seq()
function
help('seq')
seq(from=2,to=4,by=1)
## [1] 2 3 4
seq(2,4,1)
## [1] 2 3 4
seq(2,4)
## [1] 2 3 4
help(':')
2:4
## [1] 2 3 4
A list is a generic vector containing other objects. The elements of
a list (also called slots) can be of different types like − numbers,
strings, vectors, even other lists. A list can also contain a matrix or
a function as its elements. List is created using list()
function.
#Vectors cannot combine multiple input types
A <- vector(1,TRUE,'cat')
## Error in vector(1, TRUE, "cat"): unused argument ("cat")
#But lists are more flexible
A <- list(1,TRUE,'cat')
A
## [[1]]
## [1] 1
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] "cat"
Indexing lists is slightly more complicated than indexing a vector
A[1] # a list of length one
## [[1]]
## [1] 1
class(A[1])
## [1] "list"
A[[1]] # the first element of the list
## [1] 1
class(A[[1]])
## [1] "numeric"
A[[1]] + 1
## [1] 2
We can also pull out list elements by name
# the same list, but with named elements
B <- list(number=1,mammal=TRUE,taxa='cat')
B
## $number
## [1] 1
##
## $mammal
## [1] TRUE
##
## $taxa
## [1] "cat"
# different ways of accessing list elements
B[[3]] # indexed by number
## [1] "cat"
B[['taxa']] # indexed by name
## [1] "cat"
B$taxa # "slot" by name
## [1] "cat"
C
with your favorite number, color, and
species—named number
, color
,
species
(the color must be selected from the output of
colors()
). – 1.5 point(s)plot(C$number,col=C$color,main=C$species)
. – 0.5 point(s)
Let’s say we have a repeated task like naming all of R’s colours
(note the print()
function will print out whatever you
input).
COLS <- colors()[1:10]
print(COLS[1])
## [1] "white"
print(COLS[2])
## [1] "aliceblue"
print(COLS[3])
## [1] "antiquewhite"
but we don’t want to type (or copy/paste/edit) the same code over and over. A for loop is a way to repeat a blocked sequence of instructions. This will allow you to automate parts of your code that are in need of repetition.
for(i in 1:length(COLS)){
print(COLS[i])
}
## [1] "white"
## [1] "aliceblue"
## [1] "antiquewhite"
## [1] "antiquewhite1"
## [1] "antiquewhite2"
## [1] "antiquewhite3"
## [1] "antiquewhite4"
## [1] "aquamarine"
## [1] "aquamarine1"
## [1] "aquamarine2"
An easy way to understand what is going on in the for loop is by
reading it as follows: For each number from 1 to the number of colours
in your vector COLS, you execute the code chunk
print(COLS[i])
, which prints the indexed colour. Once the
for loop has executed the code chunk for every colour in the vector, the
loop stops and goes to the next command after the loop block.
R
makes it easy to re-create the steps of an analysis,
but some analyses can take a long time to run. It is always a good idea
to save important variables for future use. To save a data set as a file
you can open in other applications use the write.csv()
function. To save a data set as an R
specific file format
you can use the save()
function.
Note: NEVER SAVE YOUR ENTIRE WORKING DIRECTORY! RStudio will ask you this question when you exit the program, but it carries risks and is unreliable. Never rely on this option for saving your progress.
After completing these assignments you should know how to: