August 9, 2023

Course Schedule

Daily agenda:

  • 9:30 - 10:35 Morning Session #1
  • 10:35 - 10:50 Morning Break
  • 10:50 - 11:55 Morning Session #2
  • 11:55 - 1:10 Lunch
  • 1:10 - 2:15 Afternoon Session #1
  • 2:15 - 2:30 Afternoon Break
  • 2:30 - 3:35 Afternoon Session #2
  • 3:35 - 3:40 Feedback/Q&A

Outline

  • R Markdown Basics

  • R Markdown Options: Part 1

  • R Markdown Options: Part 2

  • R Markdown with Numerical Summaries

  • R Markdown with Graphics

Data Analysis Project Overview

  • Prepare data

    • Read
    • Manipulate
    • Reshape
  • Explore data

    • Univariate Summaries
    • Multivariate summaries
    • Model
  • Documentation

    • Explain process
    • Explain findings

  • Usability/reproducibility

    • Collocate code/output/notes

RStudio - Project Feature

  • Often working for multiple clients/on multiple projects at one time

  • Takes time to manage data/programs/analyses/results separately

  • Project in RStudio: way to divide work into multiple silos. Each with own:

    • Working directory

    • Workspace

    • History

    • Source documents

R Studio - Project

  • Easy to create!

  • Can save workspace, etc. and pick up right where you left off!

  • Let’s create one for today!

R Markdown Basics

  • Integrate data management, analysis, documentation, results
  • Doesn’t matter how great your analysis is unless you can explain it to others :)
  • Need to communicate results effectively!
  • (Traditional) Basic Use of R

    • Write a script with all code

    • Execute code via console

    • Outside R: present results

Using a Notebook Instead

  • Notebook tools allow

    • Multiple languages (R, Python, Java, C, etc.)
    • Multiple output styles (PDF, HTML, Presentation, etc.)
    • Collocated code/analysis/results
  • May have heard of JUPYTER notebooks
  • R Markdown - built in notebook for RStudio

RMarkdown and (Data) Science

As scientist/consultant I tend to interface with three “groups”

  • Clients/subject-matter experts: non-stats/non-data science so you need conclusions not code
  • Colleagues: Code, results, process all equally important
  • Past/Present/Future Me: Documentation, documentation, documentation

R Markdown Basics: Vocabulary

  • HTML (HyperText Mark-up Language) is common concept
  • Plain text that browser (e.g., Chrome) interprets and renders
  • Flat file with .html extension
  • HTML (HyperText Mark-up Language) is common concept
    • Plain text that browser (e.g., Chrome) interprets and renders
    • Flat file with .html extension
  • RMarkdown is a specific markup language
    • Easier syntax
    • Not as powerful
    • Flat file with .Rmd extension

R Markdown Basics: File Contents

  • Markdown is not R-specific: just converts plain text to HTML
  • R Markdown just implements Markdown language in R
    • Header that sets parameters for the Rmarkdown file
      • Separate from other text with - - -
  • Plain text chunks
  • Code chunks
    • Separate from other text with ```
  • Comments
    • <!- - This is how you style a comment - - >
  • Mix in HTML/CSS too if you know that!

R Markdown Basics: Creating an .Rmd

  • R Studio makes it easy!

R Markdown Basics: Output Type

  • Commonly used document types can be created

R Markdown Basics: Presentations

  • Slide presentations

R Markdown Basics: Header

  • Top section: YAML header
---
title: "Untitled"
author: "Jonathan W. Duggins"
date: "August 10, 2022"
output: html_document
---
  • Define settings for document
  • Author, Title, etc.
  • Output type/Options

R Markdown Basics: Code Chunk

  • All other set-up markdown code, r code, plain text is after YAML

  • Start code chunk with ```{r} or CTRL/CMD + Alt + I
  • Can specify options on individual code chunks
    • toyData is title of chunk
    • echo = TRUE prints chunk contents (code) to output
    • Many other options available

R Markdown Basics: Text + Markdown Chunk

### R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax
for authoring HTML, PDF, and MS Word documents. For more details on
using R Markdown see [RStudio](http://rmarkdown.rstudio.com) at <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that
includes _both_ content as well as the output of any embedded R code
chunks within the document. 
  • On execution:

    • ## becomes a level 3 header
    • [](...) becomes hidden link or <...> becomes visible link
    • **Knit** or __Knit__ is in bold
    • *both* or _both_ is in italics

R Markdown Basics

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see RStudio at http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Where do we go from here?

  • Expand knowledge of Markdown syntax
  • Look at “Notebook” feature
  • Check options for code chunks
  • Change type of output
  • Mix in numerical and graphical summaries as we go!

R Markdown Syntax Can Include…

  • Plain text

  • End a line with two spaces to start a new paragraph

    • Line breaks are not always added when you return!

    • Use two spaces and a return!

    • Can specify <br> to get line break like HTML

  • *italics* and _italics_
  • **bold** and __bold__
  • superscript^2^ becomes superscript2
  • ~~strikethrough~~ becomes strikethrough

Links, Headers, and Code, oh my!

Lists!

  • Can do lists: be sure to end each line with two spaces!
    • Indent sub lists four spaces
* unordered list  
* item 2  

  + sub-item 1  
  + sub-item 2  

1. ordered list  
1. item 2  

  + sub-item 1  
  + sub-item 2  
  • unordered list

  • item 2

    • sub-item 1
    • sub-item 2
  1. ordered list

  2. item 2

    • sub-item 1
    • sub-item 2

A Bit of HTML: Code

  • Some HTML can be helpful!

<div style = "float: left; width: 50%">

- unordered list

</div>

<div style = "float: right; width: 50%">

1. ordered list

</div>

A Bit of HTML: Results

  • Some HTML can be helpful!
  • unordered list
  1. ordered list

Text Tables

Table Header | Second Header | Col 3

------------- | ------------- | -----------

Table Cell | Cell (1, 2) | Cell (1, 3)

Cell (2, 1) | Cell (2, 2) | Cell (2, 3)

Table Header Second Header Col 3
Table Cell Cell (1, 2) Cell (1, 3)
Cell (2, 1) Cell (2, 2) Cell (2, 3)

Activity #1

  • Open the Activity1StarterCode.Rmd file from our resources

  • Use the RMarkdown syntax we’ve learned so far to edit the document so that it produces the output shown in EdaActivity1.html

Outline

  • R Markdown Basics

  • R Markdown Options: Part 1

  • R Markdown Options: Part 2

  • R Markdown with Numerical Summaries

  • R Markdown with Graphics

Including Code Chunks & Inline Code

We’ve already seen how to include an R code chunk:

  • Add chunk via Ctrl/Cmd + Alt + I or by typing
    ```{r}
    code
    ```
  • Can use code inline: e.g., ToothGrowth has 60 observations
  • Begin with `r, enter code, & end with another back-tick
    - ToothGrowth has `r length(iris$Sepal.Length)` observations

Notebook Functionality: Single Code Chunk

  • Execute code with Cmd/Ctrl + Shift + Enter or with “Run”

  • Results show up in editor!

Notebook Functionality: Across Chunks

  • Allows for quick iteration within a chunk: editing and re-executing - when you are happy, you move on and start a new chunk.
  • Can run all code chunks with Ctrl/Cmd + Alt + R
  • Can develop code and record your thoughts - similar to classic lab notebook in the physical sciences

Code Chunk Options

  • Many options depending on chunk purpose!
  • Hide/show code with echo = FALSE/TRUE
  • Evaluate with eval = TRUE/FALSE
  • Include = FALSE is equivalent to echo = FALSE, eval = TRUE
  • message = TRUE/FALSE and warning = TRUE/FALSE can turn on/off displaying messages/warnings
  • error = TRUE allows file to be created with code that has an error

Set-Up Code

  • Chunk immediately following YAML; e.g.,

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(huxtable)
```

  • Load any packages you need
  • Can set global options for all chunks
  • Local option supercedes global option

Caching Code Chunks

In a large analysis it may take a long time to run code chunks/knit your document

  • Can “Cache” results! Code will only rerun if it has changed.
  • Use cache = TRUE in code chunk definition
  • Delete folders created to rerun everything

Adding Images

Adding images in markdown: ![](path/to/file)

  • Not ideal… difficult to control size/scale
  • Better way to add images – use R function!
  • knitr package has include_graphics function
    • (It’s what I used to include all those screenshots you’ve seen today!)
  • Use knitr or code chunk options to control size/scale!
  • Ex:
    ```{r graphics, out.width = “800px”, echo = FALSE} knitr::include_graphics(path/to/file)
    ```

Adding Equations

Adding Equations

  • Inline equation: $A = \pi*r^{2}$ becomes \(A = \pi*r^{2}\)
  • Block equation $$A = \pi*r^{2}$$ becomes \[A = \pi*r^{2}\]
  • Outputting equations for HTML is done through MathJax (javascript)
  • For PDFs it is done through LaTeX (may need to install)

Tables with huxtable

Data tables generally look a little underwhelming for presentations…

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Tables with huxtable

But look better with help from packages like kableExtra or huxtable!

as_hux(summary(cars),add_rownames = FALSE) %>% set_bold(row=1,col=everywhere) %>%
  set_font_size(row=1,,24) %>% set_all_padding(1) %>% theme_striped()
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00

Selecting an Output Type

  • Change output type in YAML and use CTRL+SHIFT+K to knit to declared type

output: html_document

Selecting an Output Type

  • Change output type in YAML and use CTRL+SHIFT+K to knit to declared type

output: html_document

  • Use code explicitly:

rmarkdown::render("file.Rmd", output_format = "word_document")

Selecting an Output Type

  • Change output type in YAML and use CTRL+SHIFT+K to knit to declared type

output: html_document

  • Use code explicitly:

rmarkdown::render("file.Rmd", output_format = "word_document")

  • Use Knit menu:

HTML Output Options

For HTML, you can include Table of Contents with options

output:
  html_document:
    toc: true
    toc_float: true

HTML Output Options

For HTML, you can include Table of Contents with options

output:
  html_document:
    toc: true
    toc_float: true

For html_documents, another option is to make the code chunks hidden by default, but visible with a click:

output:
  html_document:
    code_folding: hide

Common Outputs

  • Word

    output: word_document

  • PDF

    output: pdf_document

  • PDF typically done with LaTeX (beyond scope for today)

Producing Presentations

Presentations/Slides (new slides with ##)

  • output: ioslides_presentation - HTML presentation
  • output: slidy_presentation - HTML presentation
  • output: beamer_presentation - PDF presentation with LaTeX Beamer
  • output: powerpoint_presentation - PowerPoint

Interactivity with Leaflet

HTML documents inherently interactive

  • Widgets can be included
library(leaflet)
leaflet() %>%
  setView(-87.6553,41.9485 , zoom = 16) %>% 
  addTiles() %>%
  addMarkers(-87.6553,41.9485 , popup = "Wrigley!") 

Interactivity with Leaflet

Interactivity with DataTables

Interactive tables with DT library

library(DT)
datatable(ToothGrowth)

Interactivity with DataTables

Interactivity with JavaScript

  • 3d scatterplots with rthreejs package

Interactivity with JavaScript

Interactivity Summary

Previous interactivity happened in the browser

  • Great because anyone can access with a browser
  • Bad because you can’t have as much functionality as you want…
  • Shiny allows for interactivity with R!
    • Major con: Requires R running somewhere
    • R Shiny beyond scope of this course
    • Covered in other sessions later in the week

Recap

  • RMarkdown combines two languages: R + Markdown
  • Power of computing with R
  • Flexibility of document creation with Markdown
  • Options/best practices vary by output type
  • My approach: select output type, create/refine/document r chunks, add in aesthetics last!

Activity #2

  • Open Activity2StarterCode.Rmd

  • Modify it to produce output you see in EdaActivity2.html

Outline

  • R Markdown Basics

  • R Markdown Options: Part 1

  • R Markdown Options: Part 2

  • R Markdown with Numerical Summaries

  • R Markdown with Graphics

Types of Data

  • Numeric: Values are numbers with magnitude
    • May be discrete or continuous
    • Examples: Number of tattoos (discrete), Height (continuous)
    • Not Examples: Rating (0 to 5), Zip Code,

Types of Data

  • Numeric: Values are numbers with magnitude
    • May be discrete or continuous
    • Examples: Number of tattoos (discrete), Height (continuous)
    • Not Examples: Rating (0 to 5), Zip Code,
  • Categorical: Values levels/categories from a list
    • May be ordinal or nominal
    • Ordinal Examples: Likert rating (0 to 5), Size (Small, Medium, Large)
    • Nominal Examples: Job Title, College Major, Gender

Types of Data

  • Numeric: Values are numbers with magnitude
    • May be discrete or continuous
    • Examples: Number of tattoos (discrete), Height (continuous)
    • Not Examples: Rating (0 to 5), Zip Code,
  • Categorical: Values levels/categories from a list
    • May be ordinal or nominal
    • Ordinal Examples: Likert rating (0 to 5), Size (Small, Medium, Large)
    • Nominal Examples: Job Title, College Major, Gender
  • Analysis tools are specific to variable type

Possible Numeric Analyses

  • Common Goals:
    • Summarize distribution: pattern and frequency of variable’s values
    • Communicate summary: tables (now) and graphs (later)

Possible Numeric Analyses

  • Common Goals:
    • Summarize distribution: pattern and frequency of variable’s values
    • Communicate summary: tables (now) and graphs (later)
  • Univariate: investigate one variable at at time
    • Numeric data: mean, median, variance, quantiles, skewness, etc.
    • Categorical data: frequency, relative frequency, cumulative freq and rel. freq.

Possible Numeric Analyses

  • Common Goals:
    • Summarize distribution: pattern and frequency of variable’s values
    • Communicate summary: tables (now) and graphs (later)
  • Univariate: investigate one variable at at time
    • Numeric data: mean, median, variance, quantiles, skewness, etc.
    • Categorical data: frequency, relative frequency, cumulative freq and rel. freq.
  • Multivariate: investigate joint relationship of variables
    • Numeric: Correlation/covariance
    • Categorical: contingency tables

Know Your Data!

#install.packages(c("dslabs", "pander"))
library(dslabs)
library(pander)
glimpse(gapminder)

Know Your Data!

#install.packages(c("dslabs", "pander"))
library(dslabs)
library(pander)
glimpse(gapminder)
## Rows: 10,545
## Columns: 9
## $ country          <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
## $ year             <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
## $ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, 37.…
## $ life_expectancy  <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 70.8…
## $ fertility        <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2.70,…
## $ population       <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 1867396,…
## $ gdp              <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966778…
## $ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, Ame…
## $ region           <fct> Southern Europe, Northern Africa, Middle Africa, Cari…

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(
              
              
              
              
              
                        
             ) 

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(
              Avg = mean(infant_mortality, na.rm = TRUE),
              
              
              
              
              
             ) 

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(
              Avg = mean(infant_mortality, na.rm = TRUE),
              SD = sd(infant_mortality, na.rm = TRUE),
              Med = median(infant_mortality, na.rm = TRUE),
              IQR = IQR(infant_mortality, na.rm = TRUE),
              mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) -
                        quantile(infant_mortality, 0.05, na.rm = TRUE)
             )

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(var = "Infant Mortality",
              Avg = mean(infant_mortality, na.rm = TRUE),
              SD = sd(infant_mortality, na.rm = TRUE),
              Med = median(infant_mortality, na.rm = TRUE),
              IQR = IQR(infant_mortality, na.rm = TRUE),
              mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) -
                        quantile(infant_mortality, 0.05, na.rm = TRUE)
             ) 

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(var = "Infant Mortality",
              Avg = mean(infant_mortality, na.rm = TRUE),
              SD = sd(infant_mortality, na.rm = TRUE),
              Med = median(infant_mortality, na.rm = TRUE),
              IQR = IQR(infant_mortality, na.rm = TRUE),
              mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) -
                        quantile(infant_mortality, 0.05, na.rm = TRUE)
             ) %>%
      pander()

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(var = "Infant Mortality",
              Avg = mean(infant_mortality, na.rm = TRUE),
              SD = sd(infant_mortality, na.rm = TRUE),
              Med = median(infant_mortality, na.rm = TRUE),
              IQR = IQR(infant_mortality, na.rm = TRUE),
              mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) -
                        quantile(infant_mortality, 0.05, na.rm = TRUE)
             ) %>%
      pander()
var Avg SD Med IQR mid90
Infant Mortality 55.31 47.73 41.5 69.1 143

Displaying Multiple Univariate Analyses

var Avg SD Med IQR mid90
Infant Mortality 55.31 47.73 41.5 69.1 143
var Avg SD Med IQR mid90
Life Expectancy 64.81 10.67 67.54 15.5 34.05

Combining Multiple Univariate Analyses

var Avg SD Med IQR mid90
Infant Mortality 55.31 47.73 41.5 69.1 143
Life Expectancy 64.81 10.67 67.54 15.5 34.05
Fertility 4.084 2.027 3.75 3.8 5.811

Highlights of Summarize/Summarise

Combine multiple analyses using:

  • Built in analysis functions: var, IQR, quantile, etc.
  • Expressions: e.g., 95th percentile - 5th percentile
  • User-defined functions
  • Constants
  • Essentially, anything that returns a single value

Grouped Analysis

gapminder %>% 
  select(continent, infant_mortality) %>% 
    group_by(continent) %>% 
      summarise(Avg = mean(infant_mortality, na.rm=TRUE), 
                SD = sd(infant_mortality, na.rm=TRUE), 
                Med = median(infant_mortality, na.rm=TRUE), 
                IQR = IQR(infant_mortality, na.rm=TRUE), 
                Mid90 = quantile(infant_mortality,.95,na.rm=TRUE) - 
                          quantile(infant_mortality,.05,na.rm=TRUE))
continentAvgSDMedIQRMid90
Africa95.143.993.462.5148  
Americas42.934.630.839.5110  
Asia55.346.943.159  146  
Europe15.314.211.213.738.5
Oceania39.129.129.135.994.7

Goal: Build a Summary Table

How can we create the following?

continentRankCountRelative FrequencyCumulative Relative Frequency
Africa12,90727.627.6
Americas42,05219.547  
Asia22,67925.472.4
Europe32,22321.193.5
Oceania56846.5100  

Step 1: Count Within Group

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n())
continentCount
Africa2907
Americas2052
Asia2679
Europe2223
Oceania684

Step 2: Calculate Relative Frequency Variables

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
        mutate(.rf = Count/sum(Count), #relative freq
               .crf = cumsum(.rf)) #cumulative relative freq
continentCount.rf.crf
Africa29070.276 0.276
Americas20520.195 0.47 
Asia26790.254 0.724
Europe22230.211 0.935
Oceania6840.06491    

Step 3: Determine Row Rank

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      mutate(.rf = Count/sum(Count),
             .crf = cumsum(.rf),
             Rank = row_number(desc(.rf))) #row rank, descending order
continentCount.rf.crfRank
Africa29070.276 0.2761
Americas20520.195 0.47 4
Asia26790.254 0.7242
Europe22230.211 0.9353
Oceania6840.06491    5

Step 4: Clean Up Derived Columns

#install.packages("scales")
#library(scales)
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      mutate(.rf = Count/sum(Count), .crf = cumsum(.rf), Rank = row_number(desc(.rf))) %>%
        mutate(Count = comma(Count), #format with commas
               `Relative Frequency` = round(100*.rf,1), #note the names!
               `Cumulative Relative Frequency` = round(100*.crf,1)) 
continentCount.rf.crfRankRelative FrequencyCumulative Relative Frequency
Africa2,9070.276 0.276127.627.6
Americas2,0520.195 0.47 419.547  
Asia2,6790.254 0.724225.472.4
Europe2,2230.211 0.935321.193.5
Oceania6840.06491    56.5100  

Step 5: Select and Order Columns

#install.packages("scales")
#library(scales)
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      mutate(.rf = Count/sum(Count), .crf = cumsum(.rf), Rank = row_number(desc(.rf))) %>%
        mutate(Count = comma(Count), `Relative Frequency` = round(100*.rf,1), `Cumulative Relative Frequency` = round(100*.crf,1)) %>%
          select(continent, Rank, !(starts_with(".")))
continentRankCountRelative FrequencyCumulative Relative Frequency
Africa12,90727.627.6
Americas42,05219.547  
Asia22,67925.472.4
Europe32,22321.193.5
Oceania56846.5100  

Frequency Table- Option #2

#install.packages("clean")
#library(clean)
gapminder %>% select(continent) %>% freq()
itemcountpercentcum_countcum_percent
Africa29070.276 29070.276
Asia26790.254 55860.53 
Europe22230.211 78090.741
Americas20520.195 98610.935
Oceania6840.0649105451    

Option #1 is More Flexible!

  • tidyverse offers many other summary tools

  • Say we wanted to create the following table

continentCountDifferenceCumDiff
Africa29070       
Asia2679-228-228       
Europe2223-456-684       
Americas2052-171-855       
Oceania684-1368-2.22e+03

Step 1: Group + Summarize + Arrange

  • Count can be used to derive other columns
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count))
continentCount
Africa2907
Asia2679
Europe2223
Americas2052
Oceania684

Step 2: Difference

  • lag(x,n) looks back n entries in x
  • lead(x,n) looks ahead
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1))
continentCountDifference
Africa2907
Asia2679-228
Europe2223-456
Americas2052-171
Oceania684-1368

Step 3: Cumulative Difference

  • Uh oh. What happened?
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1),
               CumDiff = cumsum(Difference))
continentCountDifferenceCumDiff
Africa2907
Asia2679-228
Europe2223-456
Americas2052-171
Oceania684-1368

Aside: cumsum function

  • Need to remove NA entries
  • No na.rm option!
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1),
               CumDiff = cumsum(Difference, na.rm = TRUE))
## Error in `mutate()`:
## ℹ In argument: `CumDiff = cumsum(Difference, na.rm = TRUE)`.
## Caused by error in `cumsum()`:
## ! 2 arguments passed to 'cumsum' which requires 1

Step 3 (again): Try Using a function!

  • na.omit to the rescue?
  • No.
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1),
               CumDiff = cumsum(na.omit(Difference)))
## Error in `mutate()`:
## ℹ In argument: `CumDiff = cumsum(na.omit(Difference))`.
## Caused by error:
## ! `CumDiff` must be size 5 or 1, not 4.

Step 3 (yet again): Using Logic!

  • Replace missing values with zero
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1),
               CumDiff = cumsum(ifelse(is.na(Difference),0,Difference)))
continentCountDifferenceCumDiff
Africa29070       
Asia2679-228-228       
Europe2223-456-684       
Americas2052-171-855       
Oceania684-1368-2.22e+03

Frequency Table - Option #3

gapminder %>% select(continent) %>% table()
## continent
##   Africa Americas     Asia   Europe  Oceania 
##     2907     2052     2679     2223      684

Quick Recap

  • Many ways to create a frequency table
  • In general, many ways to program everything!
  • Choose best package/function and get it done!
  • tidyverse popular because
    • consistent syntax and approach
    • actively maintained
  • tidyverse not all-inclusive though, branch out!

On To Bivariate Statistics!

  • Two-way tables and beyond

  • Covariance

  • Correlation

    • Pearson
    • Spearman

Two-Way Tables

  • Add TFLcutoff via mutate

  • Use table with two arguments table(row,column)

g<- gapminder %>% mutate(TFLcutoff = ifelse(fertility<=2.1,"At or below", "Exceeds"))
table(g$continent, g$TFLcutoff)
##           
##            At or below Exceeds
##   Africa            34    2822
##   Americas         326    1688
##   Asia             436    2196
##   Europe          1493     691
##   Oceania           64     608
  • Can add more variables to get three-way tables, etc.

Covariance

gapminder[complete.cases(gapminder),] %>% 
  select(infant_mortality:population) %>% 
    cov()
##                  infant_mortality life_expectancy     fertility    population
## infant_mortality       2170.30577      -446.48580  7.892329e+01  8.829752e+06
## life_expectancy        -446.48580       108.41315 -1.734594e+01  9.496002e+06
## fertility                78.92329       -17.34594  4.020522e+00 -2.272713e+07
## population          8829751.70792   9496002.35247 -2.272713e+07  1.380468e+16
  • What is complete.cases for?

Pearson Correlation

gapminder[complete.cases(gapminder),] %>% 
  select(infant_mortality:population) %>% 
    cor() #Pearson is default method
##                  infant_mortality life_expectancy   fertility   population
## infant_mortality       1.00000000    -0.920462741  0.84489650  0.001613150
## life_expectancy       -0.92046274     1.000000000 -0.83083660  0.007762232
## fertility              0.84489650    -0.830836601  1.00000000 -0.096469514
## population             0.00161315     0.007762232 -0.09646951  1.000000000

Other Correlation Methods

gapminder[complete.cases(gapminder),] %>% 
  select(infant_mortality:population) %>% 
    cor(method = "spearman")
##                  infant_mortality life_expectancy   fertility  population
## infant_mortality       1.00000000     -0.93578029  0.88870740  0.01592765
## life_expectancy       -0.93578029      1.00000000 -0.83688091  0.04472858
## fertility              0.88870740     -0.83688091  1.00000000 -0.09106856
## population             0.01592765      0.04472858 -0.09106856  1.00000000

P-values!

  • What kind of course would this be if we didn’t see at least one p-value?!
install.packages("Hmisc")
library(Hmisc)
  • This gives access to rcorr function which computes correlations and p-values

Example of rcorr

gapminder[complete.cases(gapminder),] %>% 
  select(infant_mortality:population) %>% as.matrix() %>% rcorr()
##                  infant_mortality life_expectancy fertility population
## infant_mortality             1.00           -0.92      0.84       0.00
## life_expectancy             -0.92            1.00     -0.83       0.01
## fertility                    0.84           -0.83      1.00      -0.10
## population                   0.00            0.01     -0.10       1.00
## 
## n= 7139 
## 
## 
## P
##                  infant_mortality life_expectancy fertility population
## infant_mortality                  0.0000          0.0000    0.8916    
## life_expectancy  0.0000                           0.0000    0.5120    
## fertility        0.0000           0.0000                    0.0000    
## population       0.8916           0.5120          0.0000

Recap

  • Analyses depend on data type: numeric vs categorical

  • Select (or create!) function to get summary statistic of interest

  • As with most things, multiple ways to get frequencies/contingency tables, numeric summaries, etc.

Activity #3

  1. Use the built-in data frame, iris

  2. Summarize the categorical variable, Species

  3. Compute the five number summary (min, Q1, median, Q3, max) for each of the length variables

  4. Compute the mean and standard deviation for the width variables

  5. Determine the Spearman correlation matrix for the four numeric variables

  6. Advanced #1: Get a two-way table with Species as the column variable and with rows based on whether Sepal.Length<=5.8

  7. Advanced #2: Find all observations that have a Sepal.Length that is more than two standard deviations away from its mean.

Outline

  • R Markdown Basics

  • R Markdown Options: Part 1

  • R Markdown Options: Part 2

  • R Markdown with Numerical Summaries

  • R Markdown with Graphics

Graphics via ggplot2

  • ggplot2 often associated with tidyverse

Graphics via ggplot2

  • ggplot2 often associated with tidyverse
  • ggplot2 works in layers
    • Use ggplot(data = data_frame) to prepare a plot area (canvas)
    • Add layers using additional code
    • Final graph is produced by displaying all layers

ggplot2 Layers

  • ggplot(data = data_frame) prepares the canvas

  • Layer examples:

    • geoms are geometric objects like bars, histograms, lines, points, text
    • labs controls graph and axis labels
  • Apply/modify layer settings by:

    • using aes to control aesthetics within a layer
    • using additional functions to change how a previous layer is rendered
  • Use facet to build panel of graphs

Goal: Build a Bar Chart

How do we build the following plot?

Step 1: Canvas

ggplot(data = gapminder)

Step 2: Adding aes

ggplot(data = gapminder, aes(x = continent))

Step 3: Add geom_bar Layer

ggplot(data = gapminder, aes(x = continent)) + 
  geom_bar()

Aside

  • Default stat is count, i.e., geom_bar(stat = count)
    • Often want other stats though
    • I find it easiest to pre-summarize
  • If you’ve pre-summarized to find, say, IQR
    • aes(x = continent, y = IQR)
    • geom_bar(stat = identity)
  • If you plan to make multiple plots, you can save layers
    • g <- ggplot(data = gapminder) saves just the canvas
    • g <- ggplot(data = gapminder, aes(x=continent))+geom_bar() saves basic bar chart

Step 4: Add Labels

ggplot(data = gapminder, aes(x = continent)) + 
  geom_bar() + 
  labs(x = "Continent", y = "Frequency", title = "Absolute Frequency of Responses in Gapminder Data")

Step 5: Adjust Y-Axis Format

Recap

  • ggplot for canvas

  • geom_bar for graph of choice

  • labs for labels

  • scale_*_* to adjust axis elements

    • scale_y_continuous makes changes to a y axis with continuous variable

Building a Grouped Bar Chart

How do we updated our previous bar chart to look like this?

Step 1: Remove Records with fertility = NA

g2 <- gapminder %>% subset(is.na(fertility) == FALSE)

Step 2: Add fill to aes

g2 <- gapminder %>% subset(is.na(fertility) == FALSE)
ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar() +
  labs(x = "Continent", y = "Frequency", 
       title = "Absolute Frequency of Responses in Gapminder Data") + 
  scale_y_continuous(labels = comma) 

Step 3: Stacked to Side-by-Side via position=

g2 <- gapminder %>% subset(is.na(fertility) == FALSE)
ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar(position = "dodge") +
  labs(x = "Continent", y = "Frequency", 
       title = "Absolute Frequency of Responses in Gapminder Data") + 
  scale_y_continuous(labels = comma)

Step 4: Add Subtitle and Fix Legend

g2 <- gapminder %>% subset(is.na(fertility) == FALSE)
ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar(position = "dodge") +
  labs(x = "Continent", y = "Frequency", 
       title = "Absolute Frequency of Responses in Gapminder Data", 
       subtitle = "Grouped by Total Fertility Rate < 2.1") + 
  scale_y_continuous(labels = comma) + 
  scale_fill_discrete(name = "TFR<2.1", labels = c("No", "Yes"))

faceting Your Graph

g2 <- gapminder %>% subset(is.na(fertility) == FALSE & year > 2011)
ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar(position = "dodge") +
  labs(x = "Continent", y = "Frequency", 
       title = "Absolute Frequency of Responses in Gapminder Data", 
       subtitle = "Grouped by Total Fertility Rate < 2.1") + 
  scale_y_continuous(labels = comma) + 
  scale_fill_discrete(name = "TFR<2.1", labels = c("No", "Yes")) + 
  facet_wrap(~year)

Syntax of facet

  • facet_wrap(~var)
    • group data using levels of var
    • separate plot for each group
    • common axes
    • nrow and ncol options let you specify size
  • facet_grid(var1 ~ var2)
    • levels of var1 create rows
    • levels of var2 create columns
    • force single row/column with . ~ var2 or var1 ~ .

Using facet_grid

  • Note: I do not advocate this as a good graph!
g3 <- gapminder %>% subset(is.na(fertility) == FALSE & year > 2013)
ggplot(data = g3, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar(position = "fill") +
  facet_grid((population > 1000000)~year)

Graphing Continuous Data

  • Several common univariate graphs
    • Histogram
    • Density
    • Boxplot
  • New functions, but same approach!

Creating a Histogram

c <- ggplot(data = gapminder, aes(x = fertility))
c + geom_histogram(na.rm = TRUE)

Modifications

  • I think R histogram defaults are ugly! (Even if you like them, often need to change some settings!)
    • binwidth width of bins!
    • fill fill color
    • color outline color
    • linetype dotted, dashed, etc.
    • size thickness of the outline
    • alpha transparency of fill
  • Other options available for even more customizations!

Modifying our Histogram

  • Please never make a graph that looks like this.
c + geom_histogram(color = "blue", linetype = 2, linewidth = 0.75,
                   fill = "#FF0000", alpha = 0.5, 
                   binwidth = 0.25, na.rm = TRUE)

Adding Densities

  • Densities added with geom_density
  • Use kernel= option to select density type
  • Some new options, but mostly same as geom_histogram
    • adjust controls how much smoothing
  • Kernel smoother just smooths out the boxes of a histogram
  • How exactly they smooth is beyond our scope for today!

First, Just the Density

ggplot(gapminder, aes(x=fertility)) + 
  geom_density(kernel = "gaussian", na.rm = TRUE)

Testing out adjust

ggplot(gapminder, aes(x=fertility)) + 
  geom_density(kernel = "gaussian", 
               adjust = .1, 
               na.rm = TRUE)

ggplot(gapminder, aes(x=fertility)) + 
  geom_density(kernel = "gaussian", 
               adjust = 10, 
               na.rm = TRUE)

Histogram + (New) Density

ggplot(gapminder, aes(x=fertility)) + 
  geom_histogram(aes(y = after_stat(density)), alpha = 0.5, 
                   binwidth = 0.25, na.rm = TRUE) + 
  geom_density(kernel = "triangular", linewidth = 1.25, color = "red", na.rm = TRUE)

Grouping with Histograms

  • Default is position = "stacked" which is … problematic
g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013)
ggplot(g4, aes(x=fertility, fill=factor(year))) + 
  geom_histogram(alpha=.5)

Grouping with Densities

  • No more stacking!
  • Heights make sense, but patterns hard to discern
g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013)
ggplot(g4, aes(x=fertility, fill=factor(year))) + 
  geom_density(alpha=.5)

Grouping with Overlain Histograms and Densities

  • Oh no. Histogram uses "stack" but density uses "identity"
g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013)
ggplot(g4, aes(x=fertility, fill=as.factor(year))) + 
  geom_histogram(aes(y = after_stat(density)), bins = 50) + 
  geom_density(alpha=.5)  

“Proper” Grouping

  • Match the position across layers
g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013)
ggplot(g4, aes(x=fertility, fill=factor(year))) + 
  geom_histogram(aes(y = after_stat(density)), bins = 50) + 
  geom_density(alpha=.5, position = "stack")  

Box Plot

  • Box plot fairly straight forward
    • Options like alpha and fill as well as lower, middle, and upper
g5 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2009)
c <- ggplot(g5, aes(x=factor(year), y = fertility))
c + geom_boxplot(fill="grey") 

      labs(x="Year")
## $x
## [1] "Year"
## 
## attr(,"class")
## [1] "labels"

Box Plot + Points

  • Overlay points with geom_jitter by adding this layer second
g5 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2009)
c <- ggplot(g5, aes(x=factor(year), y = fertility))
c + geom_boxplot(fill="grey") + 
      geom_jitter(width=.15, color = "blue") + labs(x="Year")

Scatter

  • Great for joint inspection of numeric variables
g6 <- gapminder %>% subset(is.na(fertility) == FALSE & 
                           is.na(infant_mortality) == FALSE &
                           year > 2012)
ggplot(g6, aes(x=fertility, y=infant_mortality)) + 
  geom_point(color="blue", shape = 3, size=2, stroke = .5)

Grouping: Code

  • Use aes to set data-driven aesthetics
  • color and shape now cycle based on data
ggplot(g6, aes(x=fertility, y=infant_mortality)) + 
  geom_point(size=2, stroke = .5, aes(shape = factor(year), color = continent)) + 
  labs(x="Total Fertility Rate", y = "Infant Mortality Rate") + 
  scale_shape_discrete(name = "Year", labels = c("'13", "'14", "'15")) + 
  scale_color_discrete(name = "Continent")

Grouping: Results

ggplot(g6, aes(x=fertility, y=infant_mortality)) + 
  geom_point(size=2, stroke = .5, aes(shape = factor(year), color = continent)) + 
  labs(x="Total Fertility Rate", y = "Infant Mortality Rate") + 
  scale_shape_discrete(name = "Year", labels = c("'13", "'14", "'15")) + 
  scale_color_discrete(name = "Continent")

Scatter + Trend: Code

  • Choosing the right model is stats, not programming! :D

  • Use col to name the models for easy referencing!

    • Careful – those names are the default labels!
ggplot(g6, aes(x = fertility, y = infant_mortality)) +
  geom_point() +  
  geom_smooth(aes(col = "loess")) +
  geom_smooth(method = lm, aes(col = "bob")) + 
  scale_colour_manual(name = 'Smoother', 
                      values =c('bob'='red', 'loess'='purple'), 
                      labels = c('Linear','GAM'), guide = 'legend')

Scatter + Trend: Results

ggplot(g6, aes(x = fertility, y = infant_mortality)) +
  geom_point() +  
  geom_smooth(aes(col = "loess")) +
  geom_smooth(method = lm, aes(col = "bob")) + 
  scale_colour_manual(name = 'Smoother', values =c('bob'='red', 'loess'='purple'), 
                      labels = c('Linear','GAM'), guide = 'legend')

Recap

  • ggplot2 just one way to do graphics

  • Works by adding layers and changing aesthetics

    • Order of layers very important!
  • General plan: canvas + graph + labels/legends/etc.

  • Graphs can be quite tedious, so plan ahead!

Activity #4

  • Produce the following three graphs
    • Graph 1 uses the starwars data set from the tidyverse
    • Graphs 2 and 3 use the iris data set from base R

Activity 4 - Graph 2

Activity 4 - Graph 3

The End

  • Thank you!

  • Questions?