Data Shapes - short versus long

Short

animal    nToes   size                smell
chicken   4       small               so-so
cow       2       big         objectionable
pig       2       medium  more_objectionable

Long

animal   variable              value
cow     size                    big
cow     nToes                     2
cow     smell         objectionable
pig     size                 medium
pig     nToes                     2
pig     smell    more_objectionable
chicken size                  small
chicken nToes                     4
chicken smell                 so-so

But the long form is stupid!

Why would I ever use it??

I'm glad you asked.

Two reasons come up most often:

  • You want to take advantage of group_by() in the dplyr package to do complex stuff
  • You want to take advantage of facets in ggplot2

dplyr and ggplot2 just work better on the long form

(which makes sense, as all these packages are part of the tidyverse)

Example data

filepath <- "http://hompal-stats.wabarr.com/datasets/studentgrades.txt"
grades <- read.table(filepath, header=T, sep=',')
head(grades)
##            students            TA exam1 exam2 exam3
## 1           Guy May       Raphael 57.18 56.79 74.91
## 2   Genevieve Clark       Raphael 56.51 64.14 61.45
## 3     Jody Phillips     Donatello 71.49 71.95 71.14
## 4 Claire Fitzgerald Michaelangelo 80.93 71.33 68.11
## 5    Everett Graves       Raphael 62.64 45.31 55.21
## 6       Lance Logan      Leonardo 70.18 64.03 64.81

Two main functions

gather() takes multiple columns, and collapse them into two key - value columns

spread() takes two columns (key - value pairs) and spreads them into separate columns

gather()

Collapses multiple columns into two key - value pairs

library(tidyverse)
head(grades)
##            students            TA exam1 exam2 exam3
## 1           Guy May       Raphael 57.18 56.79 74.91
## 2   Genevieve Clark       Raphael 56.51 64.14 61.45
## 3     Jody Phillips     Donatello 71.49 71.95 71.14
## 4 Claire Fitzgerald Michaelangelo 80.93 71.33 68.11
## 5    Everett Graves       Raphael 62.64 45.31 55.21
## 6       Lance Logan      Leonardo 70.18 64.03 64.81
grades <- gather(grades, key = "test", value="grade", exam1, exam2, exam3)
head(grades)
##            students            TA  test grade
## 1           Guy May       Raphael exam1 57.18
## 2   Genevieve Clark       Raphael exam1 56.51
## 3     Jody Phillips     Donatello exam1 71.49
## 4 Claire Fitzgerald Michaelangelo exam1 80.93
## 5    Everett Graves       Raphael exam1 62.64
## 6       Lance Logan      Leonardo exam1 70.18

This allows you to use the test variable in facets, to make multiple plots easily

gather()

qplot(x=TA, y=grade, geom='boxplot', data=grades, fill=TA) + 
  facet_wrap(~test) + 
  theme_bw(15) + 
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

spread()

Does the opposite of gather(). Spreads single column across multiple columns

head(grades)
##            students            TA  test grade
## 1           Guy May       Raphael exam1 57.18
## 2   Genevieve Clark       Raphael exam1 56.51
## 3     Jody Phillips     Donatello exam1 71.49
## 4 Claire Fitzgerald Michaelangelo exam1 80.93
## 5    Everett Graves       Raphael exam1 62.64
## 6       Lance Logan      Leonardo exam1 70.18
grades %>% spread(key=test, value=grade)
##             students            TA exam1 exam2 exam3
## 1       Alexis Woods     Donatello 66.57 66.19 74.16
## 2        Angel Baker     Donatello 73.80 65.49 62.45
## 3    Angelina Wright Michaelangelo 70.07 69.66 62.55
## 4         Anne Craig Michaelangelo 74.36 79.10 72.60
## 5    Brandon Pearson Michaelangelo 72.33 67.72 68.75
## 6      Caleb Morales       Raphael 58.12 66.29 55.83
## 7      Candace Bryan      Leonardo 66.11 60.70 76.52
## 8     Cecelia Glover     Donatello 68.67 72.53 68.84
## 9       Cesar Carter      Leonardo 65.71 79.94 76.30
## 10         Chris Kim Michaelangelo 80.46 67.64 63.89
## 11 Claire Fitzgerald Michaelangelo 80.93 71.33 68.11
## 12      Clint Steele Michaelangelo 84.87 87.51 60.22
## 13 Connie Williamson       Raphael 64.74 59.98 62.32
## 14   Conrad Chambers      Leonardo 74.79 68.82 70.42
## 15   Courtney Tucker      Leonardo 74.69 80.15 69.20
## 16       Doyle Terry      Leonardo 71.02 81.99 73.68
## 17         Ed Cannon     Donatello 75.06 66.85 75.04
## 18    Everett Graves       Raphael 62.64 45.31 55.21
## 19      Floyd Willis Michaelangelo 79.15 68.01 76.69
## 20      Fred Simmons Michaelangelo 87.32 75.46 74.15
## 21      Gail Jenkins     Donatello 83.59 64.83 67.02
## 22   Genevieve Clark       Raphael 56.51 64.14 61.45
## 23    Gerard Jackson       Raphael 68.52 74.46 59.83
## 24           Guy May       Raphael 57.18 56.79 74.91
## 25       Jackie Hart       Raphael 61.68 60.92 55.58
## 26     Janet Padilla Michaelangelo 71.09 69.10 80.10
## 27          Jim Todd     Donatello 69.89 75.22 64.05
## 28     Jody Phillips     Donatello 71.49 71.95 71.14
## 29   Johnnie Spencer Michaelangelo 72.78 72.80 69.07
## 30   Kathleen Fisher Michaelangelo 78.13 79.42 70.70
## 31       Lance Logan      Leonardo 70.18 64.03 64.81
## 32     Latoya Duncan       Raphael 77.13 74.39 64.27
## 33  Lynette Alvarado      Leonardo 73.96 68.73 78.85
## 34     Lynn Holloway      Leonardo 63.08 73.22 72.99
## 35      Marsha White     Donatello 68.36 72.03 64.06
## 36  Michelle Robbins       Raphael 62.67 60.28 62.80
## 37       Oscar Weber Michaelangelo 68.87 75.66 81.27
## 38          Pam King      Leonardo 73.45 76.66 66.63
## 39        Pat Torres Michaelangelo 66.44 66.05 66.45
## 40       Patsy Payne       Raphael 67.29 59.00 73.88
## 41       Peter Quinn       Raphael 58.04 62.54 72.15
## 42     Raul Gonzales       Raphael 63.67 70.20 63.90
## 43     Roberto Gross       Raphael 67.74 67.27 65.71
## 44   Rolando Schultz      Leonardo 76.79 78.89 65.70
## 45         Ron Hines Michaelangelo 75.03 67.03 73.04
## 46     Stacy Salazar      Leonardo 70.11 81.56 63.11
## 47          Ted Hunt      Leonardo 72.98 79.85 87.50
## 48       Teri Guzman     Donatello 73.73 76.57 73.53
## 49       Terry Kelly Michaelangelo 72.32 78.66 71.07
## 50      Willard Lamb     Donatello 71.09 76.34 78.14

Challenge - using this dataset

filepath <- "http://hompal-stats.wabarr.com/datasets/barr_astrag_2014.txt"
astrag <- read.table(filepath, header=TRUE, sep="\t")
head(astrag)
##   individual                 Taxon Habitat    ACF  APD     B DistRad DMTD
## 1  AMNH81690    Aepyceros_melampus      LC 384.72 6.89 15.16    9.22 1.78
## 2  AMNH82050    Aepyceros_melampus      LC     NA 6.09 14.00    8.73 1.64
## 3  AMNH83534    Aepyceros_melampus      LC     NA 6.14 17.22    9.19 1.70
## 4  AMNH85150    Aepyceros_melampus      LC     NA 5.80 15.27    8.85 1.72
## 5 AMNH233038 Alcelaphus_buselaphus       O 607.83 7.46 18.47   11.33 2.26
## 6  AMNH34717 Alcelaphus_buselaphus       O 465.91 6.56 15.47    9.79 1.78
##   DTArea   LML   MIN   MML PMTD ProxRad  PTArea   WAF   WAT
## 1 553.14 35.78 28.04 34.39 6.33   10.00  915.50 21.39 21.83
## 2 558.17 35.46 27.28 33.18 6.54   10.50  841.79 22.47 21.11
## 3 554.00 39.15 29.99 36.94 7.63   10.65  976.25 22.07 21.54
## 4 498.67 36.61 28.02 34.09 6.97   10.00  841.71 22.55 21.03
## 5 900.76 45.25 36.03 43.55 6.83   13.85 1557.05 28.98 27.07
## 6 649.47 36.81 30.80 35.95 4.28   10.91 1139.50 24.04 25.84

Challenge - Make this Plot