Exploratory Data Analysis Using R Markdown

August 9, 2023

Course Schedule

Daily agenda:

9:30 - 10:35 Morning Session #1
10:35 - 10:50 Morning Break
10:50 - 11:55 Morning Session #2
11:55 - 1:10 Lunch
1:10 - 2:15 Afternoon Session #1
2:15 - 2:30 Afternoon Break
2:30 - 3:35 Afternoon Session #2
3:35 - 3:40 Feedback/Q&A

Outline

R Markdown Basics
R Markdown Options: Part 1
R Markdown Options: Part 2
R Markdown with Numerical Summaries
R Markdown with Graphics

Data Analysis Project Overview

Prepare data
- Read
- Manipulate
- Reshape
Explore data
- Univariate Summaries
- Multivariate summaries
- Model

Documentation
- Explain process
- Explain findings
Usability/reproducibility
- Collocate code/output/notes

RStudio - Project Feature

Often working for multiple clients/on multiple projects at one time
Takes time to manage data/programs/analyses/results separately

Project in RStudio: way to divide work into multiple silos. Each with own:
- Working directory
- Workspace
- History
- Source documents

R Studio - Project

Easy to create!

Can save workspace, etc. and pick up right where you left off!
Let’s create one for today!

R Markdown Basics

Integrate data management, analysis, documentation, results

Doesn’t matter how great your analysis is unless you can explain it to others :)

Need to communicate results effectively!

(Traditional) Basic Use of R
- Write a script with all code
- Execute code via console
- Outside R: present results

Using a Notebook Instead

Notebook tools allow
- Multiple languages (R, Python, Java, C, etc.)
- Multiple output styles (PDF, HTML, Presentation, etc.)
- Collocated code/analysis/results

May have heard of JUPYTER notebooks

R Markdown - built in notebook for RStudio

RMarkdown and (Data) Science

As scientist/consultant I tend to interface with three “groups”

Clients/subject-matter experts: non-stats/non-data science so you need conclusions not code
Colleagues: Code, results, process all equally important
Past/Present/Future Me: Documentation, documentation, documentation

Examples of markdown documents

R Markdown Basics: Vocabulary

HTML (HyperText Mark-up Language) is common concept
Plain text that browser (e.g., Chrome) interprets and renders
Flat file with .html extension

HTML (HyperText Mark-up Language) is common concept
- Plain text that browser (e.g., Chrome) interprets and renders
- Flat file with .html extension

RMarkdown is a specific markup language
- Easier syntax
- Not as powerful
- Flat file with .Rmd extension

R Markdown Basics: File Contents

Markdown is not R-specific: just converts plain text to HTML

R Markdown just implements Markdown language in R
- Header that sets parameters for the Rmarkdown file

Plain text chunks

Code chunks
- Separate from other text with ```

Comments
- <!- - This is how you style a comment - - >

Mix in HTML/CSS too if you know that!

R Markdown Basics: Creating an .Rmd

R Studio makes it easy!

R Markdown Basics: Output Type

Commonly used document types can be created

R Markdown Basics: Presentations

Slide presentations

R Markdown Basics: Header

Top section: YAML header

---
title: "Untitled"
author: "Jonathan W. Duggins"
date: "August 10, 2022"
output: html_document
---

Define settings for document

Author, Title, etc.

Output type/Options

R Markdown Basics: Code Chunk

All other set-up markdown code, r code, plain text is after YAML

Start code chunk with ```{r} or CTRL/CMD + Alt + I

Can specify options on individual code chunks
- toyData is title of chunk
- echo = TRUE prints chunk contents (code) to output
- Many other options available

R Markdown Basics: Text + Markdown Chunk

### R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax
for authoring HTML, PDF, and MS Word documents. For more details on
using R Markdown see [RStudio](http://rmarkdown.rstudio.com) at <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that
includes _both_ content as well as the output of any embedded R code
chunks within the document.

On execution:
- ## becomes a level 3 header
- [](...) becomes hidden link or <...> becomes visible link
- **Knit** or __Knit__ is in bold
- *both* or _both_ is in italics

R Markdown Basics

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see RStudio at http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Where do we go from here?

Expand knowledge of Markdown syntax

Look at “Notebook” feature

Check options for code chunks

Change type of output

Mix in numerical and graphical summaries as we go!

R Markdown Syntax Can Include…

Plain text
End a line with two spaces to start a new paragraph
- Line breaks are not always added when you return!
- Use two spaces and a return!
- Can specify <br> to get line break like HTML

*italics* and _italics_

**bold** and __bold__

superscript^2^ becomes superscript²

~~strikethrough~~ becomes ~~strikethrough~~

Links, Headers, and Code, oh my!

[link](https://www.rstudio.com/) becomes link

# Header 1 becomes a large font header

## Header 2 is slightly smaller header (up to 6 levels!)

Use of headers can automatically create a Table of Contents!

`code` becomes code

Lists!

Can do lists: be sure to end each line with two spaces!
- Indent sub lists four spaces

* unordered list  
* item 2  

  + sub-item 1  
  + sub-item 2  

1. ordered list  
1. item 2  

  + sub-item 1  
  + sub-item 2

unordered list
item 2
- sub-item 1
- sub-item 2

ordered list
item 2
- sub-item 1
- sub-item 2

A Bit of HTML: Code

Some HTML can be helpful!

<div style = "float: left; width: 50%">

- unordered list

</div>

<div style = "float: right; width: 50%">

1. ordered list

</div>

A Bit of HTML: Results

Some HTML can be helpful!

unordered list

ordered list

Text Tables

Table Header | Second Header | Col 3

------------- | ------------- | -----------

Table Cell | Cell (1, 2) | Cell (1, 3)

Cell (2, 1) | Cell (2, 2) | Cell (2, 3)

Table Header	Second Header	Col 3
Table Cell	Cell (1, 2)	Cell (1, 3)
Cell (2, 1)	Cell (2, 2)	Cell (2, 3)

Activity #1

Open the Activity1StarterCode.Rmd file from our resources
Use the RMarkdown syntax we’ve learned so far to edit the document so that it produces the output shown in EdaActivity1.html

Outline

R Markdown Basics
R Markdown Options: Part 1
R Markdown Options: Part 2
R Markdown with Numerical Summaries
R Markdown with Graphics

Including Code Chunks & Inline Code

We’ve already seen how to include an R code chunk:

Add chunk via Ctrl/Cmd + Alt + I or by typing
```{r}
code
```

Can use code inline: e.g., ToothGrowth has 60 observations

Begin with `r, enter code, & end with another back-tick
- ToothGrowth has `r length(iris$Sepal.Length)` observations

Notebook Functionality: Single Code Chunk

Execute code with Cmd/Ctrl + Shift + Enter or with “Run”

Results show up in editor!

Notebook Functionality: Across Chunks

Allows for quick iteration within a chunk: editing and re-executing - when you are happy, you move on and start a new chunk.

Can run all code chunks with Ctrl/Cmd + Alt + R

Can develop code and record your thoughts - similar to classic lab notebook in the physical sciences

Code Chunk Options

Many options depending on chunk purpose!

Hide/show code with echo = FALSE/TRUE

Evaluate with eval = TRUE/FALSE

Include = FALSE is equivalent to echo = FALSE, eval = TRUE

message = TRUE/FALSE and warning = TRUE/FALSE can turn on/off displaying messages/warnings

error = TRUE allows file to be created with code that has an error

Set-Up Code

Chunk immediately following YAML; e.g.,

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(huxtable)
```

Load any packages you need

Can set global options for all chunks

Local option supercedes global option

Caching Code Chunks

In a large analysis it may take a long time to run code chunks/knit your document

Can “Cache” results! Code will only rerun if it has changed.

Use cache = TRUE in code chunk definition

Delete folders created to rerun everything

Adding Images

Adding images in markdown: ![](path/to/file)

Not ideal… difficult to control size/scale

Better way to add images – use R function!

knitr package has include_graphics function
- (It’s what I used to include all those screenshots you’ve seen today!)

Use knitr or code chunk options to control size/scale!

Ex:
```{r graphics, out.width = “800px”, echo = FALSE} knitr::include_graphics(path/to/file)
```

Adding Equations

Inline equation: $A = \pi*r^{2}$ becomes $A = \pi*r^{2}$

Block equation $$A = \pi*r^{2}$$ becomes \[A = \pi*r^{2}\]

Outputting equations for HTML is done through MathJax (javascript)

For PDFs it is done through LaTeX (may need to install)

Tables with `huxtable`

Data tables generally look a little underwhelming for presentations…

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Tables with `huxtable`

But look better with help from packages like kableExtra or huxtable!

as_hux(summary(cars),add_rownames = FALSE) %>% set_bold(row=1,col=everywhere) %>%
  set_font_size(row=1,,24) %>% set_all_padding(1) %>% theme_striped()

speed	dist
Min. : 4.0	Min. : 2.00
1st Qu.:12.0	1st Qu.: 26.00
Median :15.0	Median : 36.00
Mean :15.4	Mean : 42.98
3rd Qu.:19.0	3rd Qu.: 56.00
Max. :25.0	Max. :120.00

Selecting an Output Type

Change output type in YAML and use CTRL+SHIFT+K to knit to declared type

output: html_document

Selecting an Output Type

Change output type in YAML and use CTRL+SHIFT+K to knit to declared type

output: html_document

Use code explicitly:

rmarkdown::render("file.Rmd", output_format = "word_document")

Selecting an Output Type

Change output type in YAML and use CTRL+SHIFT+K to knit to declared type

output: html_document

Use code explicitly:

rmarkdown::render("file.Rmd", output_format = "word_document")

Use Knit menu:

HTML Output Options

For HTML, you can include Table of Contents with options

output:
  html_document:
    toc: true
    toc_float: true

HTML Output Options

For HTML, you can include Table of Contents with options

output:
  html_document:
    toc: true
    toc_float: true

For html_documents, another option is to make the code chunks hidden by default, but visible with a click:

output:
  html_document:
    code_folding: hide

Common Outputs

Word

output: word_document
PDF

output: pdf_document

PDF typically done with LaTeX (beyond scope for today)
- Many PDF options

Producing Presentations

Presentations/Slides (new slides with ##)

output: ioslides_presentation - HTML presentation

output: slidy_presentation - HTML presentation

output: beamer_presentation - PDF presentation with LaTeX Beamer

output: powerpoint_presentation - PowerPoint

Interactivity with Leaflet

HTML documents inherently interactive

Widgets can be included

library(leaflet)
leaflet() %>%
  setView(-87.6553,41.9485 , zoom = 16) %>% 
  addTiles() %>%
  addMarkers(-87.6553,41.9485 , popup = "Wrigley!")

Interactivity with Leaflet

Interactivity with DataTables

Interactive tables with DT library

library(DT)
datatable(ToothGrowth)

Interactivity with DataTables

Interactivity with JavaScript

3d scatterplots with rthreejs package

Interactivity with JavaScript

Interactivity Summary

Previous interactivity happened in the browser

Great because anyone can access with a browser

Bad because you can’t have as much functionality as you want…

Shiny allows for interactivity with R!
- Major con: Requires R running somewhere
- R Shiny beyond scope of this course
- Covered in other sessions later in the week

Recap

RMarkdown combines two languages: R + Markdown

Power of computing with R

Flexibility of document creation with Markdown

Options/best practices vary by output type

My approach: select output type, create/refine/document r chunks, add in aesthetics last!

Activity #2

Open Activity2StarterCode.Rmd
Modify it to produce output you see in EdaActivity2.html

Outline

R Markdown Basics
R Markdown Options: Part 1
R Markdown Options: Part 2
R Markdown with Numerical Summaries
R Markdown with Graphics

Types of Data

Numeric: Values are numbers with magnitude
- May be discrete or continuous
- Examples: Number of tattoos (discrete), Height (continuous)
- Not Examples: Rating (0 to 5), Zip Code,

Types of Data

Numeric: Values are numbers with magnitude
- May be discrete or continuous
- Examples: Number of tattoos (discrete), Height (continuous)
- Not Examples: Rating (0 to 5), Zip Code,
Categorical: Values levels/categories from a list
- May be ordinal or nominal
- Ordinal Examples: Likert rating (0 to 5), Size (Small, Medium, Large)
- Nominal Examples: Job Title, College Major, Gender

Types of Data

Numeric: Values are numbers with magnitude
- May be discrete or continuous
- Examples: Number of tattoos (discrete), Height (continuous)
- Not Examples: Rating (0 to 5), Zip Code,
Categorical: Values levels/categories from a list
- May be ordinal or nominal
- Ordinal Examples: Likert rating (0 to 5), Size (Small, Medium, Large)
- Nominal Examples: Job Title, College Major, Gender
Analysis tools are specific to variable type

Possible Numeric Analyses

Common Goals:
- Summarize distribution: pattern and frequency of variable’s values
- Communicate summary: tables (now) and graphs (later)

Possible Numeric Analyses

Common Goals:
- Summarize distribution: pattern and frequency of variable’s values
- Communicate summary: tables (now) and graphs (later)
Univariate: investigate one variable at at time
- Numeric data: mean, median, variance, quantiles, skewness, etc.
- Categorical data: frequency, relative frequency, cumulative freq and rel. freq.

Possible Numeric Analyses

Common Goals:
- Summarize distribution: pattern and frequency of variable’s values
- Communicate summary: tables (now) and graphs (later)
Univariate: investigate one variable at at time
- Numeric data: mean, median, variance, quantiles, skewness, etc.
- Categorical data: frequency, relative frequency, cumulative freq and rel. freq.
Multivariate: investigate joint relationship of variables
- Numeric: Correlation/covariance
- Categorical: contingency tables

Know Your Data!

#install.packages(c("dslabs", "pander"))
library(dslabs)
library(pander)
glimpse(gapminder)

Know Your Data!

#install.packages(c("dslabs", "pander"))
library(dslabs)
library(pander)
glimpse(gapminder)

## Rows: 10,545
## Columns: 9
## $ country          <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
## $ year             <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
## $ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, 37.…
## $ life_expectancy  <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 70.8…
## $ fertility        <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2.70,…
## $ population       <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 1867396,…
## $ gdp              <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966778…
## $ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, Ame…
## $ region           <fct> Southern Europe, Northern Africa, Middle Africa, Cari…

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>%

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(
              
              
              
              
              
                        
             )

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(
              Avg = mean(infant_mortality, na.rm = TRUE),
              
              
              
              
              
             )

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(
              Avg = mean(infant_mortality, na.rm = TRUE),
              SD = sd(infant_mortality, na.rm = TRUE),
              Med = median(infant_mortality, na.rm = TRUE),
              IQR = IQR(infant_mortality, na.rm = TRUE),
              mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) -
                        quantile(infant_mortality, 0.05, na.rm = TRUE)
             )

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(var = "Infant Mortality",
              Avg = mean(infant_mortality, na.rm = TRUE),
              SD = sd(infant_mortality, na.rm = TRUE),
              Med = median(infant_mortality, na.rm = TRUE),
              IQR = IQR(infant_mortality, na.rm = TRUE),
              mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) -
                        quantile(infant_mortality, 0.05, na.rm = TRUE)
             )

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(var = "Infant Mortality",
              Avg = mean(infant_mortality, na.rm = TRUE),
              SD = sd(infant_mortality, na.rm = TRUE),
              Med = median(infant_mortality, na.rm = TRUE),
              IQR = IQR(infant_mortality, na.rm = TRUE),
              mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) -
                        quantile(infant_mortality, 0.05, na.rm = TRUE)
             ) %>%
      pander()

Choosing Your Summary Statistics

gapminder %>% 
  select(infant_mortality) %>% 
    summarise(var = "Infant Mortality",
              Avg = mean(infant_mortality, na.rm = TRUE),
              SD = sd(infant_mortality, na.rm = TRUE),
              Med = median(infant_mortality, na.rm = TRUE),
              IQR = IQR(infant_mortality, na.rm = TRUE),
              mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) -
                        quantile(infant_mortality, 0.05, na.rm = TRUE)
             ) %>%
      pander()

var	Avg	SD	Med	IQR	mid90
Infant Mortality	55.31	47.73	41.5	69.1	143

Displaying Multiple Univariate Analyses

var	Avg	SD	Med	IQR	mid90
Infant Mortality	55.31	47.73	41.5	69.1	143

var	Avg	SD	Med	IQR	mid90
Life Expectancy	64.81	10.67	67.54	15.5	34.05

Combining Multiple Univariate Analyses

var	Avg	SD	Med	IQR	mid90
Infant Mortality	55.31	47.73	41.5	69.1	143
Life Expectancy	64.81	10.67	67.54	15.5	34.05
Fertility	4.084	2.027	3.75	3.8	5.811

Highlights of Summarize/Summarise

Combine multiple analyses using:

Built in analysis functions: var, IQR, quantile, etc.

Expressions: e.g., 95th percentile - 5th percentile

User-defined functions

Constants

Essentially, anything that returns a single value

Grouped Analysis

gapminder %>% 
  select(continent, infant_mortality) %>% 
    group_by(continent) %>% 
      summarise(Avg = mean(infant_mortality, na.rm=TRUE), 
                SD = sd(infant_mortality, na.rm=TRUE), 
                Med = median(infant_mortality, na.rm=TRUE), 
                IQR = IQR(infant_mortality, na.rm=TRUE), 
                Mid90 = quantile(infant_mortality,.95,na.rm=TRUE) - 
                          quantile(infant_mortality,.05,na.rm=TRUE))

continent	Avg	SD	Med	IQR	Mid90
Africa	95.1	43.9	93.4	62.5	148
Americas	42.9	34.6	30.8	39.5	110
Asia	55.3	46.9	43.1	59	146
Europe	15.3	14.2	11.2	13.7	38.5
Oceania	39.1	29.1	29.1	35.9	94.7

Goal: Build a Summary Table

How can we create the following?

continent	Rank	Count	Relative Frequency	Cumulative Relative Frequency
Africa	1	2,907	27.6	27.6
Americas	4	2,052	19.5	47
Asia	2	2,679	25.4	72.4
Europe	3	2,223	21.1	93.5
Oceania	5	684	6.5	100

Step 1: Count Within Group

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n())

continent	Count
Africa	2907
Americas	2052
Asia	2679
Europe	2223
Oceania	684

Step 2: Calculate Relative Frequency Variables

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
        mutate(.rf = Count/sum(Count), #relative freq
               .crf = cumsum(.rf)) #cumulative relative freq

continent	Count	.rf	.crf
Africa	2907	0.276	0.276
Americas	2052	0.195	0.47
Asia	2679	0.254	0.724
Europe	2223	0.211	0.935
Oceania	684	0.0649	1

Step 3: Determine Row Rank

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      mutate(.rf = Count/sum(Count),
             .crf = cumsum(.rf),
             Rank = row_number(desc(.rf))) #row rank, descending order

continent	Count	.rf	.crf	Rank
Africa	2907	0.276	0.276	1
Americas	2052	0.195	0.47	4
Asia	2679	0.254	0.724	2
Europe	2223	0.211	0.935	3
Oceania	684	0.0649	1	5

Step 4: Clean Up Derived Columns

#install.packages("scales")
#library(scales)
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      mutate(.rf = Count/sum(Count), .crf = cumsum(.rf), Rank = row_number(desc(.rf))) %>%
        mutate(Count = comma(Count), #format with commas
               `Relative Frequency` = round(100*.rf,1), #note the names!
               `Cumulative Relative Frequency` = round(100*.crf,1))

continent	Count	.rf	.crf	Rank	Relative Frequency	Cumulative Relative Frequency
Africa	2,907	0.276	0.276	1	27.6	27.6
Americas	2,052	0.195	0.47	4	19.5	47
Asia	2,679	0.254	0.724	2	25.4	72.4
Europe	2,223	0.211	0.935	3	21.1	93.5
Oceania	684	0.0649	1	5	6.5	100

Step 5: Select and Order Columns

#install.packages("scales")
#library(scales)
gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      mutate(.rf = Count/sum(Count), .crf = cumsum(.rf), Rank = row_number(desc(.rf))) %>%
        mutate(Count = comma(Count), `Relative Frequency` = round(100*.rf,1), `Cumulative Relative Frequency` = round(100*.crf,1)) %>%
          select(continent, Rank, !(starts_with(".")))

continent	Rank	Count	Relative Frequency	Cumulative Relative Frequency
Africa	1	2,907	27.6	27.6
Americas	4	2,052	19.5	47
Asia	2	2,679	25.4	72.4
Europe	3	2,223	21.1	93.5
Oceania	5	684	6.5	100

Frequency Table- Option #2

#install.packages("clean")
#library(clean)
gapminder %>% select(continent) %>% freq()

item	count	percent	cum_count	cum_percent
Africa	2907	0.276	2907	0.276
Asia	2679	0.254	5586	0.53
Europe	2223	0.211	7809	0.741
Americas	2052	0.195	9861	0.935
Oceania	684	0.0649	10545	1

Option #1 is More Flexible!

tidyverse offers many other summary tools
Say we wanted to create the following table

continent	Count	Difference	CumDiff
Africa	2907		0
Asia	2679	-228	-228
Europe	2223	-456	-684
Americas	2052	-171	-855
Oceania	684	-1368	-2.22e+03

Step 1: Group + Summarize + Arrange

Count can be used to derive other columns

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count))

continent	Count
Africa	2907
Asia	2679
Europe	2223
Americas	2052
Oceania	684

Step 2: Difference

lag(x,n) looks back n entries in x

lead(x,n) looks ahead

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1))

continent	Count	Difference
Africa	2907
Asia	2679	-228
Europe	2223	-456
Americas	2052	-171
Oceania	684	-1368

Step 3: Cumulative Difference

Uh oh. What happened?

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1),
               CumDiff = cumsum(Difference))

continent	Count	Difference
Africa	2907
Asia	2679	-228
Europe	2223	-456
Americas	2052	-171
Oceania	684	-1368

Aside: `cumsum` function

Need to remove NA entries

No na.rm option!

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1),
               CumDiff = cumsum(Difference, na.rm = TRUE))

## Error in `mutate()`:
## ℹ In argument: `CumDiff = cumsum(Difference, na.rm = TRUE)`.
## Caused by error in `cumsum()`:
## ! 2 arguments passed to 'cumsum' which requires 1

Step 3 (again): Try Using a function!

na.omit to the rescue?

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1),
               CumDiff = cumsum(na.omit(Difference)))

## Error in `mutate()`:
## ℹ In argument: `CumDiff = cumsum(na.omit(Difference))`.
## Caused by error:
## ! `CumDiff` must be size 5 or 1, not 4.

Step 3 (yet again): Using Logic!

Replace missing values with zero

gapminder %>%
  group_by(continent) %>%
    summarise(Count = n()) %>%
      arrange(desc(Count)) %>% 
        mutate(Difference = Count - lag(Count,1),
               CumDiff = cumsum(ifelse(is.na(Difference),0,Difference)))

continent	Count	Difference	CumDiff
Africa	2907		0
Asia	2679	-228	-228
Europe	2223	-456	-684
Americas	2052	-171	-855
Oceania	684	-1368	-2.22e+03

Frequency Table - Option #3

gapminder %>% select(continent) %>% table()

## continent
##   Africa Americas     Asia   Europe  Oceania 
##     2907     2052     2679     2223      684

Quick Recap

Many ways to create a frequency table

In general, many ways to program everything!

Choose best package/function and get it done!

tidyverse popular because
- consistent syntax and approach
- actively maintained

tidyverse not all-inclusive though, branch out!

On To Bivariate Statistics!

Two-way tables and beyond
Covariance
Correlation
- Pearson
- Spearman

Two-Way Tables

Add TFLcutoff via mutate
Use table with two arguments table(row,column)

g<- gapminder %>% mutate(TFLcutoff = ifelse(fertility<=2.1,"At or below", "Exceeds"))
table(g$continent, g$TFLcutoff)

##           
##            At or below Exceeds
##   Africa            34    2822
##   Americas         326    1688
##   Asia             436    2196
##   Europe          1493     691
##   Oceania           64     608

Can add more variables to get three-way tables, etc.

Covariance

gapminder[complete.cases(gapminder),] %>% 
  select(infant_mortality:population) %>% 
    cov()

##                  infant_mortality life_expectancy     fertility    population
## infant_mortality       2170.30577      -446.48580  7.892329e+01  8.829752e+06
## life_expectancy        -446.48580       108.41315 -1.734594e+01  9.496002e+06
## fertility                78.92329       -17.34594  4.020522e+00 -2.272713e+07
## population          8829751.70792   9496002.35247 -2.272713e+07  1.380468e+16

What is complete.cases for?

Pearson Correlation

gapminder[complete.cases(gapminder),] %>% 
  select(infant_mortality:population) %>% 
    cor() #Pearson is default method

##                  infant_mortality life_expectancy   fertility   population
## infant_mortality       1.00000000    -0.920462741  0.84489650  0.001613150
## life_expectancy       -0.92046274     1.000000000 -0.83083660  0.007762232
## fertility              0.84489650    -0.830836601  1.00000000 -0.096469514
## population             0.00161315     0.007762232 -0.09646951  1.000000000

Other Correlation Methods

gapminder[complete.cases(gapminder),] %>% 
  select(infant_mortality:population) %>% 
    cor(method = "spearman")

##                  infant_mortality life_expectancy   fertility  population
## infant_mortality       1.00000000     -0.93578029  0.88870740  0.01592765
## life_expectancy       -0.93578029      1.00000000 -0.83688091  0.04472858
## fertility              0.88870740     -0.83688091  1.00000000 -0.09106856
## population             0.01592765      0.04472858 -0.09106856  1.00000000

P-values!

What kind of course would this be if we didn’t see at least one p-value?!

install.packages("Hmisc")
library(Hmisc)

This gives access to rcorr function which computes correlations and p-values

Example of `rcorr`

gapminder[complete.cases(gapminder),] %>% 
  select(infant_mortality:population) %>% as.matrix() %>% rcorr()

##                  infant_mortality life_expectancy fertility population
## infant_mortality             1.00           -0.92      0.84       0.00
## life_expectancy             -0.92            1.00     -0.83       0.01
## fertility                    0.84           -0.83      1.00      -0.10
## population                   0.00            0.01     -0.10       1.00
## 
## n= 7139 
## 
## 
## P
##                  infant_mortality life_expectancy fertility population
## infant_mortality                  0.0000          0.0000    0.8916    
## life_expectancy  0.0000                           0.0000    0.5120    
## fertility        0.0000           0.0000                    0.0000    
## population       0.8916           0.5120          0.0000

Recap

Analyses depend on data type: numeric vs categorical
Select (or create!) function to get summary statistic of interest
As with most things, multiple ways to get frequencies/contingency tables, numeric summaries, etc.

Activity #3

Use the built-in data frame, iris
Summarize the categorical variable, Species
Compute the five number summary (min, Q1, median, Q3, max) for each of the length variables
Compute the mean and standard deviation for the width variables
Determine the Spearman correlation matrix for the four numeric variables
Advanced #1: Get a two-way table with Species as the column variable and with rows based on whether Sepal.Length<=5.8
Advanced #2: Find all observations that have a Sepal.Length that is more than two standard deviations away from its mean.

Outline

R Markdown Basics
R Markdown Options: Part 1
R Markdown Options: Part 2
R Markdown with Numerical Summaries
R Markdown with Graphics

Graphics via `ggplot2`

ggplot2 often associated with tidyverse
- Alternative to Base R or lattice plotting
- Good entry point for R graphics
- Cheatsheet (PDF)
- Chapter 7 of this online book compares systems

Graphics via `ggplot2`

ggplot2 often associated with tidyverse
- Alternative to Base R or lattice plotting
- Good entry point for R graphics
- Cheatsheet (PDF)
- Chapter 7 of this online book compares systems
ggplot2 works in layers
- Use ggplot(data = data_frame) to prepare a plot area (canvas)
- Add layers using additional code
- Final graph is produced by displaying all layers

`ggplot2` Layers

ggplot(data = data_frame) prepares the canvas
Layer examples:
- geoms are geometric objects like bars, histograms, lines, points, text
- labs controls graph and axis labels
Apply/modify layer settings by:
- using aes to control aesthetics within a layer
- using additional functions to change how a previous layer is rendered
Use facet to build panel of graphs

Goal: Build a Bar Chart

How do we build the following plot?

Step 1: Canvas

ggplot(data = gapminder)

Step 2: Adding `aes`

ggplot(data = gapminder, aes(x = continent))

Step 3: Add `geom_bar` Layer

ggplot(data = gapminder, aes(x = continent)) + 
  geom_bar()

Aside

Default stat is count, i.e., geom_bar(stat = count)
- Often want other stats though
- I find it easiest to pre-summarize

If you’ve pre-summarized to find, say, IQR
- aes(x = continent, y = IQR)
- geom_bar(stat = identity)

If you plan to make multiple plots, you can save layers
- g <- ggplot(data = gapminder) saves just the canvas

Step 4: Add Labels

ggplot(data = gapminder, aes(x = continent)) + 
  geom_bar() + 
  labs(x = "Continent", y = "Frequency", title = "Absolute Frequency of Responses in Gapminder Data")

Step 5: Adjust Y-Axis Format

Recap

ggplot for canvas
geom_bar for graph of choice
labs for labels
scale_*_* to adjust axis elements
- scale_y_continuous makes changes to a y axis with continuous variable

Building a Grouped Bar Chart

How do we updated our previous bar chart to look like this?

Step 1: Remove Records with `fertility = NA`

g2 <- gapminder %>% subset(is.na(fertility) == FALSE)

Step 2: Add `fill` to `aes`

g2 <- gapminder %>% subset(is.na(fertility) == FALSE)
ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar() +
  labs(x = "Continent", y = "Frequency", 
       title = "Absolute Frequency of Responses in Gapminder Data") + 
  scale_y_continuous(labels = comma)

Step 3: Stacked to Side-by-Side via `position=`

g2 <- gapminder %>% subset(is.na(fertility) == FALSE)
ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar(position = "dodge") +
  labs(x = "Continent", y = "Frequency", 
       title = "Absolute Frequency of Responses in Gapminder Data") + 
  scale_y_continuous(labels = comma)

Step 4: Add Subtitle and Fix Legend

g2 <- gapminder %>% subset(is.na(fertility) == FALSE)
ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar(position = "dodge") +
  labs(x = "Continent", y = "Frequency", 
       title = "Absolute Frequency of Responses in Gapminder Data", 
       subtitle = "Grouped by Total Fertility Rate < 2.1") + 
  scale_y_continuous(labels = comma) + 
  scale_fill_discrete(name = "TFR<2.1", labels = c("No", "Yes"))

`facet`ing Your Graph

g2 <- gapminder %>% subset(is.na(fertility) == FALSE & year > 2011)
ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar(position = "dodge") +
  labs(x = "Continent", y = "Frequency", 
       title = "Absolute Frequency of Responses in Gapminder Data", 
       subtitle = "Grouped by Total Fertility Rate < 2.1") + 
  scale_y_continuous(labels = comma) + 
  scale_fill_discrete(name = "TFR<2.1", labels = c("No", "Yes")) + 
  facet_wrap(~year)

Syntax of `facet`

Using `facet_grid`

Note: I do not advocate this as a good graph!

g3 <- gapminder %>% subset(is.na(fertility) == FALSE & year > 2013)
ggplot(data = g3, aes(x = continent, fill = (fertility<2.1))) + 
  geom_bar(position = "fill") +
  facet_grid((population > 1000000)~year)

Graphing Continuous Data

Several common univariate graphs
- Histogram
- Density
- Boxplot
New functions, but same approach!

Creating a Histogram

c <- ggplot(data = gapminder, aes(x = fertility))
c + geom_histogram(na.rm = TRUE)

Modifications

I think R histogram defaults are ugly! (Even if you like them, often need to change some settings!)
- binwidth width of bins!
- fill fill color
- color outline color
- linetype dotted, dashed, etc.
- size thickness of the outline
- alpha transparency of fill

Other options available for even more customizations!

Modifying our Histogram

Please never make a graph that looks like this.

c + geom_histogram(color = "blue", linetype = 2, linewidth = 0.75,
                   fill = "#FF0000", alpha = 0.5, 
                   binwidth = 0.25, na.rm = TRUE)

Adding Densities

Densities added with geom_density

Use kernel= option to select density type

Some new options, but mostly same as geom_histogram
- adjust controls how much smoothing

Kernel smoother just smooths out the boxes of a histogram

How exactly they smooth is beyond our scope for today!

First, Just the Density

ggplot(gapminder, aes(x=fertility)) + 
  geom_density(kernel = "gaussian", na.rm = TRUE)

Testing out `adjust`

ggplot(gapminder, aes(x=fertility)) + 
  geom_density(kernel = "gaussian", 
               adjust = .1, 
               na.rm = TRUE)

ggplot(gapminder, aes(x=fertility)) + 
  geom_density(kernel = "gaussian", 
               adjust = 10, 
               na.rm = TRUE)

Histogram + (New) Density

ggplot(gapminder, aes(x=fertility)) + 
  geom_histogram(aes(y = after_stat(density)), alpha = 0.5, 
                   binwidth = 0.25, na.rm = TRUE) + 
  geom_density(kernel = "triangular", linewidth = 1.25, color = "red", na.rm = TRUE)

Grouping with Histograms

Default is position = "stacked" which is … problematic

g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013)
ggplot(g4, aes(x=fertility, fill=factor(year))) + 
  geom_histogram(alpha=.5)

Grouping with Densities

No more stacking!

Heights make sense, but patterns hard to discern

g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013)
ggplot(g4, aes(x=fertility, fill=factor(year))) + 
  geom_density(alpha=.5)

Grouping with Overlain Histograms and Densities

Oh no. Histogram uses "stack" but density uses "identity"

g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013)
ggplot(g4, aes(x=fertility, fill=as.factor(year))) + 
  geom_histogram(aes(y = after_stat(density)), bins = 50) + 
  geom_density(alpha=.5)

“Proper” Grouping

Match the position across layers

g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013)
ggplot(g4, aes(x=fertility, fill=factor(year))) + 
  geom_histogram(aes(y = after_stat(density)), bins = 50) + 
  geom_density(alpha=.5, position = "stack")

Box Plot

Box plot fairly straight forward
- Options like alpha and fill as well as lower, middle, and upper

g5 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2009)
c <- ggplot(g5, aes(x=factor(year), y = fertility))
c + geom_boxplot(fill="grey")

      labs(x="Year")

## $x
## [1] "Year"
## 
## attr(,"class")
## [1] "labels"

Box Plot + Points

Overlay points with geom_jitter by adding this layer second

g5 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2009)
c <- ggplot(g5, aes(x=factor(year), y = fertility))
c + geom_boxplot(fill="grey") + 
      geom_jitter(width=.15, color = "blue") + labs(x="Year")

Scatter

Great for joint inspection of numeric variables

g6 <- gapminder %>% subset(is.na(fertility) == FALSE & 
                           is.na(infant_mortality) == FALSE &
                           year > 2012)
ggplot(g6, aes(x=fertility, y=infant_mortality)) + 
  geom_point(color="blue", shape = 3, size=2, stroke = .5)

Grouping: Code

Use aes to set data-driven aesthetics

color and shape now cycle based on data

ggplot(g6, aes(x=fertility, y=infant_mortality)) + 
  geom_point(size=2, stroke = .5, aes(shape = factor(year), color = continent)) + 
  labs(x="Total Fertility Rate", y = "Infant Mortality Rate") + 
  scale_shape_discrete(name = "Year", labels = c("'13", "'14", "'15")) + 
  scale_color_discrete(name = "Continent")

Grouping: Results

ggplot(g6, aes(x=fertility, y=infant_mortality)) + 
  geom_point(size=2, stroke = .5, aes(shape = factor(year), color = continent)) + 
  labs(x="Total Fertility Rate", y = "Infant Mortality Rate") + 
  scale_shape_discrete(name = "Year", labels = c("'13", "'14", "'15")) + 
  scale_color_discrete(name = "Continent")

Scatter + Trend: Code

Choosing the right model is stats, not programming! :D
Use col to name the models for easy referencing!
- Careful – those names are the default labels!

ggplot(g6, aes(x = fertility, y = infant_mortality)) +
  geom_point() +  
  geom_smooth(aes(col = "loess")) +
  geom_smooth(method = lm, aes(col = "bob")) + 
  scale_colour_manual(name = 'Smoother', 
                      values =c('bob'='red', 'loess'='purple'), 
                      labels = c('Linear','GAM'), guide = 'legend')

Scatter + Trend: Results

ggplot(g6, aes(x = fertility, y = infant_mortality)) +
  geom_point() +  
  geom_smooth(aes(col = "loess")) +
  geom_smooth(method = lm, aes(col = "bob")) + 
  scale_colour_manual(name = 'Smoother', values =c('bob'='red', 'loess'='purple'), 
                      labels = c('Linear','GAM'), guide = 'legend')

Recap

ggplot2 just one way to do graphics
Works by adding layers and changing aesthetics
- Order of layers very important!
General plan: canvas + graph + labels/legends/etc.
Graphs can be quite tedious, so plan ahead!

Activity #4

Produce the following three graphs
- Graph 1 uses the starwars data set from the tidyverse
- Graphs 2 and 3 use the iris data set from base R

Activity 4 - Graph 2

Activity 4 - Graph 3

The End

Thank you!
Questions?

Course Schedule

Outline

Data Analysis Project Overview

RStudio - Project Feature

R Studio - Project

R Markdown Basics

Using a Notebook Instead

RMarkdown and (Data) Science

R Markdown Basics: Vocabulary

R Markdown Basics: File Contents

R Markdown Basics: Creating an .Rmd

R Markdown Basics: Output Type

R Markdown Basics: Presentations

R Markdown Basics: Header

R Markdown Basics: Code Chunk

R Markdown Basics: Text + Markdown Chunk

R Markdown Basics

R Markdown

Where do we go from here?

R Markdown Syntax Can Include…

Links, Headers, and Code, oh my!

Lists!

A Bit of HTML: Code

A Bit of HTML: Results

Text Tables

Activity #1

Outline

Including Code Chunks & Inline Code

Notebook Functionality: Single Code Chunk

Notebook Functionality: Across Chunks

Code Chunk Options

Set-Up Code

Caching Code Chunks

Adding Images

Adding Equations

Tables with huxtable

Tables with huxtable

Selecting an Output Type

Selecting an Output Type

Selecting an Output Type

HTML Output Options

HTML Output Options

Common Outputs

Producing Presentations

Interactivity with Leaflet

Interactivity with Leaflet

Interactivity with DataTables

Interactivity with DataTables

Interactivity with JavaScript

Interactivity with JavaScript

Interactivity Summary

Recap

Activity #2

Outline

Types of Data

Types of Data

Types of Data

Possible Numeric Analyses

Possible Numeric Analyses

Possible Numeric Analyses

Know Your Data!

Know Your Data!

Choosing Your Summary Statistics

Choosing Your Summary Statistics

Choosing Your Summary Statistics

Choosing Your Summary Statistics

Choosing Your Summary Statistics

Choosing Your Summary Statistics

Choosing Your Summary Statistics

Displaying Multiple Univariate Analyses

Combining Multiple Univariate Analyses

Highlights of Summarize/Summarise

Grouped Analysis

Goal: Build a Summary Table

Step 1: Count Within Group

Step 2: Calculate Relative Frequency Variables

Step 3: Determine Row Rank

Step 4: Clean Up Derived Columns

Step 5: Select and Order Columns

Frequency Table- Option #2

Tables with `huxtable`

Tables with `huxtable`

Aside: `cumsum` function

Example of `rcorr`

Graphics via `ggplot2`

Graphics via `ggplot2`

`ggplot2` Layers

Step 2: Adding `aes`

Step 3: Add `geom_bar` Layer

Step 1: Remove Records with `fertility = NA`

Step 2: Add `fill` to `aes`

Step 3: Stacked to Side-by-Side via `position=`

`facet`ing Your Graph

Syntax of `facet`

Using `facet_grid`

Testing out `adjust`