The Interactive Interpreter

Open R and type 2 + 2. R tells you that the answer is 4.

Congrats! You are now using R.

This is the most basic way to interact with the interpreter.

Later we will see how you can save a complicated series of commands to a script file and execute them automatically.

Assignment

Sometimes it is useful to assign the results of computations to a named variable that we will use later.

We can create a named variable using the <- assignment operator.

Note It is technically possible to use the = symbol for assignment, but this is bad form. Get used to typing the <- symbol.

myVariable <- 2 + 2

Now the result from this computation is assigned to myVariable and saved for later use. Later, we can call this variable like so:

myVariable * myVariable
## [1] 16

Atomic vectors

An atomic vector is a series of values stored in a single object.

You can create a vector using the c() function (c stands for combine).

Note: you can only store a single type of data in a vector (e.g. numeric or character data).

vector1 <- c(1,2,3,4,5,6,7,8,9,10)
vector2 <- c(11,12,13,14,15,16,17,18,19,20)

Atomic vectors

Many of the basic functions and operations in R are vectorized, meaning that they operate on all elements of a vector in sequence.

For example, to add the elements of vector1 with the corresponding elements in vector2, you simply add the vectors, because the + function is vectorized.

This is one of the fundamental advantages of R.

vector1 + vector2
##  [1] 12 14 16 18 20 22 24 26 28 30

Atomic vectors

What happens when I try this?

vector3 <- c(21,22,23,24,25,26,27,28,29,"thirty")

What type of vector is vector3?

What happens when I do this? vector2 + vector3

Accessing values within a vector

Vectors can be indexed with brackets [] to get a subset of values.

What value do you get by typing vector2[3] in the interpreter?

What about vector2[3:5]? HINT: the : operator makes a sequence of integers.

Challenge

Extract the first, fifth, and ninth element of vector2 that you created earlier.

Negative Indexing

When you use negative indices, then the referenced elements are removed from the resulting vector.

dwarves <- c("Dopey", "Gimli", "Larry")
dwarves[-3] #real dwarves
## [1] "Dopey" "Gimli"

Note: comments using the octothorpe (AKA the hashtag #) are for people

Challenge

Remove the 1st, 5th and 9th elements from vector2

Logical Tests

Logical tests are assertions that R evaluates as either TRUE or FALSE.

For instance, you might assert that "1 plus 1 equals 10 minus 8"?

In R that looks like:

1 + 1 == 10 - 8
## [1] TRUE

R tells us that this is TRUE.

Note the double equals symbol, which means something different than a single equals symbol.

Logical operators

The logical operators are:

  • == is equal to
  • != does not equal
  • > greater than
  • < less than
  • >= greater than or equal to
  • <= less than or equal to

Logical indexing

Like most things in R, logical tests work on vectors.

Remember vector2 from before?

vector2
##  [1] 11 12 13 14 15 16 17 18 19 20

Lets find out which values are greater than 17.

vector2 > 17
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Logical indexing

This becomes useful when you only want a subset of values from a vector using logical indexing.

To get only the values of vector2 that are greater than 17, we can use a logical (AKA boolean) vector.

vector2[vector2 > 17] 
## [1] 18 19 20

You can also pass complicated logical tests using the AND operator & or the logical OR operator |

vector2[vector2 > 17 & vector2 < 20]
## [1] 18 19
vector2[vector2 > 17 | vector2 == 13]
## [1] 13 18 19 20

Assignment & Indexing

You can also use <- to replace particular elements in a vector.

primates <- c("gorilla", "gibbon", "langur", "gibbon", "gorilla")
primates
## [1] "gorilla" "gibbon"  "langur"  "gibbon"  "gorilla"
primates[3] <- "bushbaby"
primates
## [1] "gorilla"  "gibbon"   "bushbaby" "gibbon"   "gorilla"
primates[primates == "gorilla"] <- "chimpanzee"
primates
## [1] "chimpanzee" "gibbon"     "bushbaby"   "gibbon"     "chimpanzee"

Factors

A factor is a special type of vector for storing categorical data.

pets <- factor(c("cat", "cat", "dog", "pony", "dog", "dog"))
pets
## [1] cat  cat  dog  pony dog  dog 
## Levels: cat dog pony

R will now treat this differently from other vectors.

These come in useful later on, when we want to summarize by different factor levels.

Ordered factors

By default, R assumes the order of the levels in your factor is alphabetical.

You can change this by replacing the normal factor pets with a new ordered factor

pets <- ordered(pets, levels=c("pony", "dog", "cat"))
pets
## [1] cat  cat  dog  pony dog  dog 
## Levels: pony < dog < cat

Data Frames

Data frames store related vectors of data together in a single object.

They are analogous to a spreadsheet:

  • each row corresponds to an individual (e.g., specimen, species)
  • each column corresponds to some observation about that individual

You will use the read.table() function to read a dataframe directly from a .csv or .txt file

Accessing data in a dataframe

$ is used to access a named column within an dataframe

The [row, column] syntax is used to identify the index number of the row and column desired

Accessing data in a dataframe

Lets extract particular data within the built-in dataframe iris

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Getting data by named column

iris$Sepal.Length
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

Getting data by row and column number

iris[1,3]
## [1] 1.4

Getting a whole row of data

iris[1, ]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa

Note that leaving a blank indicates ALL rows or columns.

So iris$Species is the same as iris[,5] because the fifth column of the iris dataframe is named 'Species'

Creating a new column with $

You can also create new columns in a dataframe using the $ operator.

There is no colum names Petal.Area

iris$Petal.Area
## NULL

But we can add one

iris$Petal.Area <- iris$Petal.Length * iris$Petal.Width
iris$Petal.Area
##   [1]  0.28  0.28  0.26  0.30  0.28  0.68  0.42  0.30  0.28  0.15  0.30
##  [12]  0.32  0.14  0.11  0.24  0.60  0.52  0.42  0.51  0.45  0.34  0.60
##  [23]  0.20  0.85  0.38  0.32  0.64  0.30  0.28  0.32  0.32  0.60  0.15
##  [34]  0.28  0.30  0.24  0.26  0.14  0.26  0.30  0.39  0.39  0.26  0.96
##  [45]  0.76  0.42  0.32  0.28  0.30  0.28  6.58  6.75  7.35  5.20  6.90
##  [56]  5.85  7.52  3.30  5.98  5.46  3.50  6.30  4.00  6.58  4.68  6.16
##  [67]  6.75  4.10  6.75  4.29  8.64  5.20  7.35  5.64  5.59  6.16  6.72
##  [78]  8.50  6.75  3.50  4.18  3.70  4.68  8.16  6.75  7.20  7.05  5.72
##  [89]  5.33  5.20  5.28  6.44  4.80  3.30  5.46  5.04  5.46  5.59  3.30
## [100]  5.33 15.00  9.69 12.39 10.08 12.76 13.86  7.65 11.34 10.44 15.25
## [111] 10.20 10.07 11.55 10.00 12.24 12.19  9.90 14.74 15.87  7.50 13.11
## [122]  9.80 13.40  8.82 11.97 10.80  8.64  8.82 11.76  9.28 11.59 12.80
## [133] 12.32  7.65  7.84 14.03 13.44  9.90  8.64 11.34 13.44 11.73  9.69
## [144] 13.57 14.25 11.96  9.50 10.40 12.42  9.18

Adding a new column with $

Note: Adding a calculated column like this is less error prone and more repeatable than doing it in Excel…..

You should store the data in a spreadsheet, and then manipulate it in R.

Functions

Functions are the heart of R.

A function is just a series of commands that is assigned a name.

Functions

  • accept arguments
  • perform a series of commands using the argument values
  • return a single object.

You can create your own functions (and you will!) but there are many hundreds of pre-defined functions available for your use.

A huge part of the learning curve of R is learning which functions exist, which is why they invented google.com!

Functions

You call a function by typing its name, followed by parentheses containing 0 or more arguments.

Arguments are the way you pass data and/or options to a function.

For example, the paste() function simply pastes together its arguments into a single text string.

part1 <- "The quick brown fox"
part2 <- "jumped over the lazy dog."
paste(part1, part2)
## [1] "The quick brown fox jumped over the lazy dog."

Note: You can see all the arguments and default values for any function using the ? operator like this ?NameOfFunction.

Scripts

Typing commands interactively into the command-line interpreter is fine for experimentation

Ultimately we want to save every single command to a text file, so that this can be run later, shared with collaborators, or published online with the article as supplementary information.

This element of reproducibility is a critical benefit of doing scripted data analysis.

In Rstudio, you can create a new script by using the File >> New File menu.

Challenge

  1. Create a new script file.
  2. Create a variable called numbers that contains the numbers 1 through 100. Hint use the ? operator to investigate the seq function or explore the : operator.
  3. Create a new vector that contains only the elements of numbers that are greater than 36. Save this to a variable called bignumbers
  4. Calculate the natural logarithm of each value in bignumbers.
  5. Make a histogram of bignumbers using the hist() function
  6. Save your script file, then clear your workspace to delete ALL saved variables, and re-run your script file to redo all the steps.

R Markdown

R Markdown documents are like scripts, but allow you to mix human readable text with snippets of R code.

At any time you can "knit" the R Markdown file, which causes the code snippets to be run.

Any results (figures or text output) get combined with the human readable parts, and combined into a single pdf, word doc, or html file.

You can create a new Rmarkdown using the File >> New File menu.

Example R Markdown document

This sentence is just normal text, made for humans to read.

The following bit is a "code chunk", which is set off by the ``` characters and consists of actual R code that gets run, and the results knitted back into the document.

You can also put R code inline like this. What is 2 + 2? The answer is `r 2 + 2`.

```{r examplechunk}
x <- rnorm(100)
y <- (x * 0.3) + rnorm(100, sd=0.2)
plot(x,y,pch=16)
```

Output of R Markdown after knitting

This sentence is just normal text, made for humans to read.

The following bit is a "code chunk", which is set off by the ``` characters and consists of actual R code that gets run, and the results knitted back into the document.

You can also put R code inline like this. What is 2 + 2? The answer is 4.

Packages

One of the most important features of R is that it is has a vast ecosystem of user-contributed packages, which extend the base functionality of R.

Packages can be installed from the command line install.packages('ggplot2') or by using the graphical package manager in Rstudio in the Packages tab.

Each time you want to use functions from a package, you must make the package available with the library() function.

Data Input

Getting data into R can be very frustrating for new users.

This is because we tend to be sloppy when collecting data in a spreadsheet.

R has rigid expectations.

  • each column of a text file should contain a single type of data (text, numeric, factor)
  • Each row should hold the observations for each column for a single individual
  • That's it. Period. No extra rows that are just for formatting and don't contain data observations.

Data Input

The primary function for reading data into R is the read.table() function. There are a few important arguments.

  • file = The full path to the file as a text string. (Use the forward slash /, even on Windows.) You can also read from a remote URL (i.e. from the web).
  • sep = the character that separates columns in your text file. The default is " ", which kind of sucks because most of our files will be separated by either \t or ,.
  • header = Whether or not there is a header row in your text file. Defaults to FALSE, but usually we need it to be TRUE.

Challenge

Read in the file called "femur_lengths.txt" directly from the following URL

http://hompal-stats.wabarr.com/datasets/femur_lengths.txt

Save to an object called femora.

Use the str() function to examine the data type of each column. You can also do the same thing in R Studio in the Environment tab.

Use the mean() function to calculate the mean of all the femur lengths.

Data Output

Usually, I would recommend not writing out modified data frames to text files.

It is far better to have a single input file, and to do all necessary manipulations in your saved R script file.

But, if you need to write data, do it with write.table()

  • x = The name of the data frame to be saved.
  • file = Path to the output file. You can't use file.choose() though.
  • quote = Do you want quotation marks around strings? Defaults to TRUE
  • sep = Same as for read.table()
  • row.names = Do you want row names? Defaults to TRUE, but usually you will set to FALSE

Challenge

Add a new column of log transformed lengths to the femora dataframe and then write out the new file to the Desktop.