Daily agenda:
- 9:30 - 10:35 Morning Session #1
- 10:35 - 10:50 Morning Break
- 10:50 - 11:55 Morning Session #2
- 11:55 - 1:10 Lunch
- 1:10 - 2:15 Afternoon Session #1
- 2:15 - 2:30 Afternoon Break
- 2:30 - 3:35 Afternoon Session #2
- 3:35 - 3:40 Feedback/Q&A
August 9, 2023
Daily agenda:
R Markdown Basics
R Markdown Options: Part 1
R Markdown Options: Part 2
R Markdown with Numerical Summaries
R Markdown with Graphics
Prepare data
Explore data
Documentation
Usability/reproducibility
Often working for multiple clients/on multiple projects at one time
Takes time to manage data/programs/analyses/results separately
Project in RStudio: way to divide work into multiple silos. Each with own:
Working directory
Workspace
History
Source documents
Can save workspace, etc. and pick up right where you left off!
Let’s create one for today!
(Traditional) Basic Use of R
Write a script with all code
Execute code via console
Outside R: present results
Notebook tools allow
As scientist/consultant I tend to interface with three “groups”
- - -
--- title: "Untitled" author: "Jonathan W. Duggins" date: "August 10, 2022" output: html_document ---
### R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see [RStudio](http://rmarkdown.rstudio.com) at <http://rmarkdown.rstudio.com>. When you click the **Knit** button a document will be generated that includes _both_ content as well as the output of any embedded R code chunks within the document.
On execution:
##
becomes a level 3 header[](...)
becomes hidden link or <...>
becomes visible link**Knit**
or __Knit__
is in bold*both*
or _both_
is in italicsThis is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see RStudio at http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
Plain text
End a line with two spaces to start a new paragraph
Line breaks are not always added when you return!
Use two spaces and a return!
Can specify <br>
to get line break like HTML
*italics*
and _italics_
**bold**
and __bold__
superscript^2^
becomes superscript2~~strikethrough~~
becomes [link](https://www.rstudio.com/)
becomes link# Header 1
becomes a large font header## Header 2
is slightly smaller header (up to 6 levels!)code
* unordered list * item 2 + sub-item 1 + sub-item 2 1. ordered list 1. item 2 + sub-item 1 + sub-item 2
unordered list
item 2
ordered list
item 2
<div style = "float: left; width: 50%">
- unordered list
</div>
<div style = "float: right; width: 50%">
1. ordered list
</div>
Table Header | Second Header | Col 3
------------- | ------------- | -----------
Table Cell | Cell (1, 2) | Cell (1, 3)
Cell (2, 1) | Cell (2, 2) | Cell (2, 3)
Table Header | Second Header | Col 3 |
---|---|---|
Table Cell | Cell (1, 2) | Cell (1, 3) |
Cell (2, 1) | Cell (2, 2) | Cell (2, 3) |
Open the Activity1StarterCode.Rmd
file from our resources
Use the RMarkdown syntax we’ve learned so far to edit the document so that it produces the output shown in EdaActivity1.html
R Markdown Basics
R Markdown Options: Part 1
R Markdown Options: Part 2
R Markdown with Numerical Summaries
R Markdown with Graphics
We’ve already seen how to include an R code chunk:
Ctrl/Cmd + Alt + I
or by typinglength(iris$Sepal.Length)
` observationsCmd/Ctrl + Shift + Enter
or with “Run”Ctrl/Cmd + Alt + R
echo = FALSE/TRUE
eval = TRUE/FALSE
Include = FALSE
is equivalent to echo = FALSE, eval = TRUE
message = TRUE/FALSE
and warning = TRUE/FALSE
can turn on/off displaying messages/warningserror = TRUE
allows file to be created with code that has an error```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(huxtable)
```
In a large analysis it may take a long time to run code chunks/knit your document
cache = TRUE
in code chunk definitionAdding images in markdown: ![](path/to/file)
knitr
package has include_graphics
function
Adding Equations
$A = \pi*r^{2}$
becomes \(A = \pi*r^{2}\)$$A = \pi*r^{2}$$
becomes \[A = \pi*r^{2}\]huxtable
Data tables generally look a little underwhelming for presentations…
summary(cars)
## speed dist ## Min. : 4.0 Min. : 2.00 ## 1st Qu.:12.0 1st Qu.: 26.00 ## Median :15.0 Median : 36.00 ## Mean :15.4 Mean : 42.98 ## 3rd Qu.:19.0 3rd Qu.: 56.00 ## Max. :25.0 Max. :120.00
huxtable
But look better with help from packages like kableExtra
or huxtable
!
as_hux(summary(cars),add_rownames = FALSE) %>% set_bold(row=1,col=everywhere) %>% set_font_size(row=1,,24) %>% set_all_padding(1) %>% theme_striped()
speed | dist |
---|---|
Min. : 4.0 | Min. : 2.00 |
1st Qu.:12.0 | 1st Qu.: 26.00 |
Median :15.0 | Median : 36.00 |
Mean :15.4 | Mean : 42.98 |
3rd Qu.:19.0 | 3rd Qu.: 56.00 |
Max. :25.0 | Max. :120.00 |
output: html_document
output: html_document
rmarkdown::render("file.Rmd", output_format = "word_document")
output: html_document
rmarkdown::render("file.Rmd", output_format = "word_document")
For HTML, you can include Table of Contents with options
output: html_document: toc: true toc_float: true
For HTML, you can include Table of Contents with options
output: html_document: toc: true toc_float: true
For html_documents, another option is to make the code chunks hidden by default, but visible with a click:
output: html_document: code_folding: hide
Word
output: word_document
output: pdf_document
Presentations/Slides (new slides with ##
)
output: ioslides_presentation
- HTML presentationoutput: slidy_presentation
- HTML presentationoutput: beamer_presentation
- PDF presentation with LaTeX Beameroutput: powerpoint_presentation
- PowerPointHTML documents inherently interactive
library(leaflet) leaflet() %>% setView(-87.6553,41.9485 , zoom = 16) %>% addTiles() %>% addMarkers(-87.6553,41.9485 , popup = "Wrigley!")
Interactive tables with DT
library
library(DT) datatable(ToothGrowth)
rthreejs
packagePrevious interactivity happened in the browser
Open Activity2StarterCode.Rmd
Modify it to produce output you see in EdaActivity2.html
R Markdown Basics
R Markdown Options: Part 1
R Markdown Options: Part 2
R Markdown with Numerical Summaries
R Markdown with Graphics
#install.packages(c("dslabs", "pander")) library(dslabs) library(pander) glimpse(gapminder)
#install.packages(c("dslabs", "pander")) library(dslabs) library(pander) glimpse(gapminder)
## Rows: 10,545 ## Columns: 9 ## $ country <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"… ## $ year <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,… ## $ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, 37.… ## $ life_expectancy <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 70.8… ## $ fertility <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2.70,… ## $ population <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 1867396,… ## $ gdp <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966778… ## $ continent <fct> Europe, Africa, Africa, Americas, Americas, Asia, Ame… ## $ region <fct> Southern Europe, Northern Africa, Middle Africa, Cari…
gapminder %>% select(infant_mortality) %>%
gapminder %>% select(infant_mortality) %>% summarise( )
gapminder %>% select(infant_mortality) %>% summarise( Avg = mean(infant_mortality, na.rm = TRUE), )
gapminder %>% select(infant_mortality) %>% summarise( Avg = mean(infant_mortality, na.rm = TRUE), SD = sd(infant_mortality, na.rm = TRUE), Med = median(infant_mortality, na.rm = TRUE), IQR = IQR(infant_mortality, na.rm = TRUE), mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) - quantile(infant_mortality, 0.05, na.rm = TRUE) )
gapminder %>% select(infant_mortality) %>% summarise(var = "Infant Mortality", Avg = mean(infant_mortality, na.rm = TRUE), SD = sd(infant_mortality, na.rm = TRUE), Med = median(infant_mortality, na.rm = TRUE), IQR = IQR(infant_mortality, na.rm = TRUE), mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) - quantile(infant_mortality, 0.05, na.rm = TRUE) )
gapminder %>% select(infant_mortality) %>% summarise(var = "Infant Mortality", Avg = mean(infant_mortality, na.rm = TRUE), SD = sd(infant_mortality, na.rm = TRUE), Med = median(infant_mortality, na.rm = TRUE), IQR = IQR(infant_mortality, na.rm = TRUE), mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) - quantile(infant_mortality, 0.05, na.rm = TRUE) ) %>% pander()
gapminder %>% select(infant_mortality) %>% summarise(var = "Infant Mortality", Avg = mean(infant_mortality, na.rm = TRUE), SD = sd(infant_mortality, na.rm = TRUE), Med = median(infant_mortality, na.rm = TRUE), IQR = IQR(infant_mortality, na.rm = TRUE), mid90 = quantile(infant_mortality,0.95, na.rm = TRUE) - quantile(infant_mortality, 0.05, na.rm = TRUE) ) %>% pander()
var | Avg | SD | Med | IQR | mid90 |
---|---|---|---|---|---|
Infant Mortality | 55.31 | 47.73 | 41.5 | 69.1 | 143 |
var | Avg | SD | Med | IQR | mid90 |
---|---|---|---|---|---|
Infant Mortality | 55.31 | 47.73 | 41.5 | 69.1 | 143 |
var | Avg | SD | Med | IQR | mid90 |
---|---|---|---|---|---|
Life Expectancy | 64.81 | 10.67 | 67.54 | 15.5 | 34.05 |
var | Avg | SD | Med | IQR | mid90 |
---|---|---|---|---|---|
Infant Mortality | 55.31 | 47.73 | 41.5 | 69.1 | 143 |
Life Expectancy | 64.81 | 10.67 | 67.54 | 15.5 | 34.05 |
Fertility | 4.084 | 2.027 | 3.75 | 3.8 | 5.811 |
Combine multiple analyses using:
var
, IQR
, quantile
, etc.gapminder %>% select(continent, infant_mortality) %>% group_by(continent) %>% summarise(Avg = mean(infant_mortality, na.rm=TRUE), SD = sd(infant_mortality, na.rm=TRUE), Med = median(infant_mortality, na.rm=TRUE), IQR = IQR(infant_mortality, na.rm=TRUE), Mid90 = quantile(infant_mortality,.95,na.rm=TRUE) - quantile(infant_mortality,.05,na.rm=TRUE))
continent | Avg | SD | Med | IQR | Mid90 |
---|---|---|---|---|---|
Africa | 95.1 | 43.9 | 93.4 | 62.5 | 148 |
Americas | 42.9 | 34.6 | 30.8 | 39.5 | 110 |
Asia | 55.3 | 46.9 | 43.1 | 59 | 146 |
Europe | 15.3 | 14.2 | 11.2 | 13.7 | 38.5 |
Oceania | 39.1 | 29.1 | 29.1 | 35.9 | 94.7 |
How can we create the following?
continent | Rank | Count | Relative Frequency | Cumulative Relative Frequency |
---|---|---|---|---|
Africa | 1 | 2,907 | 27.6 | 27.6 |
Americas | 4 | 2,052 | 19.5 | 47 |
Asia | 2 | 2,679 | 25.4 | 72.4 |
Europe | 3 | 2,223 | 21.1 | 93.5 |
Oceania | 5 | 684 | 6.5 | 100 |
gapminder %>% group_by(continent) %>% summarise(Count = n())
continent | Count |
---|---|
Africa | 2907 |
Americas | 2052 |
Asia | 2679 |
Europe | 2223 |
Oceania | 684 |
gapminder %>% group_by(continent) %>% summarise(Count = n()) %>% mutate(.rf = Count/sum(Count), #relative freq .crf = cumsum(.rf)) #cumulative relative freq
continent | Count | .rf | .crf |
---|---|---|---|
Africa | 2907 | 0.276 | 0.276 |
Americas | 2052 | 0.195 | 0.47 |
Asia | 2679 | 0.254 | 0.724 |
Europe | 2223 | 0.211 | 0.935 |
Oceania | 684 | 0.0649 | 1 |
gapminder %>% group_by(continent) %>% summarise(Count = n()) %>% mutate(.rf = Count/sum(Count), .crf = cumsum(.rf), Rank = row_number(desc(.rf))) #row rank, descending order
continent | Count | .rf | .crf | Rank |
---|---|---|---|---|
Africa | 2907 | 0.276 | 0.276 | 1 |
Americas | 2052 | 0.195 | 0.47 | 4 |
Asia | 2679 | 0.254 | 0.724 | 2 |
Europe | 2223 | 0.211 | 0.935 | 3 |
Oceania | 684 | 0.0649 | 1 | 5 |
#install.packages("scales") #library(scales) gapminder %>% group_by(continent) %>% summarise(Count = n()) %>% mutate(.rf = Count/sum(Count), .crf = cumsum(.rf), Rank = row_number(desc(.rf))) %>% mutate(Count = comma(Count), #format with commas `Relative Frequency` = round(100*.rf,1), #note the names! `Cumulative Relative Frequency` = round(100*.crf,1))
continent | Count | .rf | .crf | Rank | Relative Frequency | Cumulative Relative Frequency |
---|---|---|---|---|---|---|
Africa | 2,907 | 0.276 | 0.276 | 1 | 27.6 | 27.6 |
Americas | 2,052 | 0.195 | 0.47 | 4 | 19.5 | 47 |
Asia | 2,679 | 0.254 | 0.724 | 2 | 25.4 | 72.4 |
Europe | 2,223 | 0.211 | 0.935 | 3 | 21.1 | 93.5 |
Oceania | 684 | 0.0649 | 1 | 5 | 6.5 | 100 |
#install.packages("scales") #library(scales) gapminder %>% group_by(continent) %>% summarise(Count = n()) %>% mutate(.rf = Count/sum(Count), .crf = cumsum(.rf), Rank = row_number(desc(.rf))) %>% mutate(Count = comma(Count), `Relative Frequency` = round(100*.rf,1), `Cumulative Relative Frequency` = round(100*.crf,1)) %>% select(continent, Rank, !(starts_with(".")))
continent | Rank | Count | Relative Frequency | Cumulative Relative Frequency |
---|---|---|---|---|
Africa | 1 | 2,907 | 27.6 | 27.6 |
Americas | 4 | 2,052 | 19.5 | 47 |
Asia | 2 | 2,679 | 25.4 | 72.4 |
Europe | 3 | 2,223 | 21.1 | 93.5 |
Oceania | 5 | 684 | 6.5 | 100 |
#install.packages("clean") #library(clean) gapminder %>% select(continent) %>% freq()
item | count | percent | cum_count | cum_percent |
---|---|---|---|---|
Africa | 2907 | 0.276 | 2907 | 0.276 |
Asia | 2679 | 0.254 | 5586 | 0.53 |
Europe | 2223 | 0.211 | 7809 | 0.741 |
Americas | 2052 | 0.195 | 9861 | 0.935 |
Oceania | 684 | 0.0649 | 10545 | 1 |
tidyverse
offers many other summary tools
Say we wanted to create the following table
continent | Count | Difference | CumDiff |
---|---|---|---|
Africa | 2907 | 0 | |
Asia | 2679 | -228 | -228 |
Europe | 2223 | -456 | -684 |
Americas | 2052 | -171 | -855 |
Oceania | 684 | -1368 | -2.22e+03 |
Count
can be used to derive other columnsgapminder %>% group_by(continent) %>% summarise(Count = n()) %>% arrange(desc(Count))
continent | Count |
---|---|
Africa | 2907 |
Asia | 2679 |
Europe | 2223 |
Americas | 2052 |
Oceania | 684 |
lag(x,n)
looks back n
entries in x
lead(x,n)
looks aheadgapminder %>% group_by(continent) %>% summarise(Count = n()) %>% arrange(desc(Count)) %>% mutate(Difference = Count - lag(Count,1))
continent | Count | Difference |
---|---|---|
Africa | 2907 | |
Asia | 2679 | -228 |
Europe | 2223 | -456 |
Americas | 2052 | -171 |
Oceania | 684 | -1368 |
gapminder %>% group_by(continent) %>% summarise(Count = n()) %>% arrange(desc(Count)) %>% mutate(Difference = Count - lag(Count,1), CumDiff = cumsum(Difference))
continent | Count | Difference | CumDiff |
---|---|---|---|
Africa | 2907 | ||
Asia | 2679 | -228 | |
Europe | 2223 | -456 | |
Americas | 2052 | -171 | |
Oceania | 684 | -1368 |
cumsum
functionNA
entriesna.rm
option!gapminder %>% group_by(continent) %>% summarise(Count = n()) %>% arrange(desc(Count)) %>% mutate(Difference = Count - lag(Count,1), CumDiff = cumsum(Difference, na.rm = TRUE))
## Error in `mutate()`: ## ℹ In argument: `CumDiff = cumsum(Difference, na.rm = TRUE)`. ## Caused by error in `cumsum()`: ## ! 2 arguments passed to 'cumsum' which requires 1
na.omit
to the rescue?gapminder %>% group_by(continent) %>% summarise(Count = n()) %>% arrange(desc(Count)) %>% mutate(Difference = Count - lag(Count,1), CumDiff = cumsum(na.omit(Difference)))
## Error in `mutate()`: ## ℹ In argument: `CumDiff = cumsum(na.omit(Difference))`. ## Caused by error: ## ! `CumDiff` must be size 5 or 1, not 4.
gapminder %>% group_by(continent) %>% summarise(Count = n()) %>% arrange(desc(Count)) %>% mutate(Difference = Count - lag(Count,1), CumDiff = cumsum(ifelse(is.na(Difference),0,Difference)))
continent | Count | Difference | CumDiff |
---|---|---|---|
Africa | 2907 | 0 | |
Asia | 2679 | -228 | -228 |
Europe | 2223 | -456 | -684 |
Americas | 2052 | -171 | -855 |
Oceania | 684 | -1368 | -2.22e+03 |
gapminder %>% select(continent) %>% table()
## continent ## Africa Americas Asia Europe Oceania ## 2907 2052 2679 2223 684
tidyverse
popular because
tidyverse
not all-inclusive though, branch out!Two-way tables and beyond
Covariance
Correlation
Add TFLcutoff
via mutate
Use table
with two arguments table(row,column)
g<- gapminder %>% mutate(TFLcutoff = ifelse(fertility<=2.1,"At or below", "Exceeds")) table(g$continent, g$TFLcutoff)
## ## At or below Exceeds ## Africa 34 2822 ## Americas 326 1688 ## Asia 436 2196 ## Europe 1493 691 ## Oceania 64 608
gapminder[complete.cases(gapminder),] %>% select(infant_mortality:population) %>% cov()
## infant_mortality life_expectancy fertility population ## infant_mortality 2170.30577 -446.48580 7.892329e+01 8.829752e+06 ## life_expectancy -446.48580 108.41315 -1.734594e+01 9.496002e+06 ## fertility 78.92329 -17.34594 4.020522e+00 -2.272713e+07 ## population 8829751.70792 9496002.35247 -2.272713e+07 1.380468e+16
complete.cases
for?gapminder[complete.cases(gapminder),] %>% select(infant_mortality:population) %>% cor() #Pearson is default method
## infant_mortality life_expectancy fertility population ## infant_mortality 1.00000000 -0.920462741 0.84489650 0.001613150 ## life_expectancy -0.92046274 1.000000000 -0.83083660 0.007762232 ## fertility 0.84489650 -0.830836601 1.00000000 -0.096469514 ## population 0.00161315 0.007762232 -0.09646951 1.000000000
gapminder[complete.cases(gapminder),] %>% select(infant_mortality:population) %>% cor(method = "spearman")
## infant_mortality life_expectancy fertility population ## infant_mortality 1.00000000 -0.93578029 0.88870740 0.01592765 ## life_expectancy -0.93578029 1.00000000 -0.83688091 0.04472858 ## fertility 0.88870740 -0.83688091 1.00000000 -0.09106856 ## population 0.01592765 0.04472858 -0.09106856 1.00000000
install.packages("Hmisc") library(Hmisc)
rcorr
function which computes correlations and p-valuesrcorr
gapminder[complete.cases(gapminder),] %>% select(infant_mortality:population) %>% as.matrix() %>% rcorr()
## infant_mortality life_expectancy fertility population ## infant_mortality 1.00 -0.92 0.84 0.00 ## life_expectancy -0.92 1.00 -0.83 0.01 ## fertility 0.84 -0.83 1.00 -0.10 ## population 0.00 0.01 -0.10 1.00 ## ## n= 7139 ## ## ## P ## infant_mortality life_expectancy fertility population ## infant_mortality 0.0000 0.0000 0.8916 ## life_expectancy 0.0000 0.0000 0.5120 ## fertility 0.0000 0.0000 0.0000 ## population 0.8916 0.5120 0.0000
Analyses depend on data type: numeric vs categorical
Select (or create!) function to get summary statistic of interest
As with most things, multiple ways to get frequencies/contingency tables, numeric summaries, etc.
Use the built-in data frame, iris
Summarize the categorical variable, Species
Compute the five number summary (min, Q1, median, Q3, max) for each of the length
variables
Compute the mean and standard deviation for the width
variables
Determine the Spearman correlation matrix for the four numeric variables
Advanced #1: Get a two-way table with Species
as the column variable and with rows based on whether Sepal.Length<=5.8
Advanced #2: Find all observations that have a Sepal.Length
that is more than two standard deviations away from its mean.
R Markdown Basics
R Markdown Options: Part 1
R Markdown Options: Part 2
R Markdown with Numerical Summaries
R Markdown with Graphics
ggplot2
ggplot2
often associated with tidyverse
lattice
plottingggplot2
ggplot2
often associated with tidyverse
lattice
plottingggplot2
works in layers
ggplot(data = data_frame)
to prepare a plot area (canvas)ggplot2
Layersggplot(data = data_frame)
prepares the canvas
Layer examples:
geoms
are geometric objects like bars, histograms, lines, points, textlabs
controls graph and axis labelsApply/modify layer settings by:
aes
to control aesthetics within a layerUse facet
to build panel of graphs
How do we build the following plot?
ggplot(data = gapminder)
aes
ggplot(data = gapminder, aes(x = continent))
geom_bar
Layerggplot(data = gapminder, aes(x = continent)) + geom_bar()
geom_bar(stat = count)
aes(x = continent, y = IQR)
geom_bar(stat = identity)
g <- ggplot(data = gapminder)
saves just the canvas
g <- ggplot(data = gapminder, aes(x=continent))+geom_bar()
saves basic bar chart
ggplot(data = gapminder, aes(x = continent)) + geom_bar() + labs(x = "Continent", y = "Frequency", title = "Absolute Frequency of Responses in Gapminder Data")
ggplot
for canvas
geom_bar
for graph of choice
labs
for labels
scale_*_*
to adjust axis elements
scale_y_continuous
makes changes to a y axis with continuous variableHow do we updated our previous bar chart to look like this?
fertility = NA
g2 <- gapminder %>% subset(is.na(fertility) == FALSE)
fill
to aes
g2 <- gapminder %>% subset(is.na(fertility) == FALSE) ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + geom_bar() + labs(x = "Continent", y = "Frequency", title = "Absolute Frequency of Responses in Gapminder Data") + scale_y_continuous(labels = comma)
position=
g2 <- gapminder %>% subset(is.na(fertility) == FALSE) ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + geom_bar(position = "dodge") + labs(x = "Continent", y = "Frequency", title = "Absolute Frequency of Responses in Gapminder Data") + scale_y_continuous(labels = comma)
g2 <- gapminder %>% subset(is.na(fertility) == FALSE) ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + geom_bar(position = "dodge") + labs(x = "Continent", y = "Frequency", title = "Absolute Frequency of Responses in Gapminder Data", subtitle = "Grouped by Total Fertility Rate < 2.1") + scale_y_continuous(labels = comma) + scale_fill_discrete(name = "TFR<2.1", labels = c("No", "Yes"))
facet
ing Your Graphg2 <- gapminder %>% subset(is.na(fertility) == FALSE & year > 2011) ggplot(data = g2, aes(x = continent, fill = (fertility<2.1))) + geom_bar(position = "dodge") + labs(x = "Continent", y = "Frequency", title = "Absolute Frequency of Responses in Gapminder Data", subtitle = "Grouped by Total Fertility Rate < 2.1") + scale_y_continuous(labels = comma) + scale_fill_discrete(name = "TFR<2.1", labels = c("No", "Yes")) + facet_wrap(~year)
facet
facet_wrap(~var)
var
nrow
and ncol
options let you specify sizefacet_grid(var1 ~ var2)
var1
create rows
var2
create columns
. ~ var2
or var1 ~ .
facet_grid
g3 <- gapminder %>% subset(is.na(fertility) == FALSE & year > 2013) ggplot(data = g3, aes(x = continent, fill = (fertility<2.1))) + geom_bar(position = "fill") + facet_grid((population > 1000000)~year)
c <- ggplot(data = gapminder, aes(x = fertility)) c + geom_histogram(na.rm = TRUE)
binwidth
width of bins!fill
fill colorcolor
outline colorlinetype
dotted, dashed, etc.size
thickness of the outlinealpha
transparency of fillc + geom_histogram(color = "blue", linetype = 2, linewidth = 0.75, fill = "#FF0000", alpha = 0.5, binwidth = 0.25, na.rm = TRUE)
geom_density
kernel=
option to select density typegeom_histogram
adjust
controls how much smoothing
ggplot(gapminder, aes(x=fertility)) + geom_density(kernel = "gaussian", na.rm = TRUE)
adjust
ggplot(gapminder, aes(x=fertility)) + geom_density(kernel = "gaussian", adjust = .1, na.rm = TRUE)
ggplot(gapminder, aes(x=fertility)) + geom_density(kernel = "gaussian", adjust = 10, na.rm = TRUE)
ggplot(gapminder, aes(x=fertility)) + geom_histogram(aes(y = after_stat(density)), alpha = 0.5, binwidth = 0.25, na.rm = TRUE) + geom_density(kernel = "triangular", linewidth = 1.25, color = "red", na.rm = TRUE)
position = "stacked"
which is … problematicg4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013) ggplot(g4, aes(x=fertility, fill=factor(year))) + geom_histogram(alpha=.5)
g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013) ggplot(g4, aes(x=fertility, fill=factor(year))) + geom_density(alpha=.5)
"stack"
but density uses "identity"
g4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013) ggplot(g4, aes(x=fertility, fill=as.factor(year))) + geom_histogram(aes(y = after_stat(density)), bins = 50) + geom_density(alpha=.5)
position
across layersg4 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2013) ggplot(g4, aes(x=fertility, fill=factor(year))) + geom_histogram(aes(y = after_stat(density)), bins = 50) + geom_density(alpha=.5, position = "stack")
alpha
and fill
as well as lower
, middle
, and upper
g5 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2009) c <- ggplot(g5, aes(x=factor(year), y = fertility)) c + geom_boxplot(fill="grey")
labs(x="Year")
## $x ## [1] "Year" ## ## attr(,"class") ## [1] "labels"
geom_jitter
by adding this layer secondg5 <- gapminder %>% subset(is.na(fertility)==FALSE & year > 2009) c <- ggplot(g5, aes(x=factor(year), y = fertility)) c + geom_boxplot(fill="grey") + geom_jitter(width=.15, color = "blue") + labs(x="Year")
g6 <- gapminder %>% subset(is.na(fertility) == FALSE & is.na(infant_mortality) == FALSE & year > 2012) ggplot(g6, aes(x=fertility, y=infant_mortality)) + geom_point(color="blue", shape = 3, size=2, stroke = .5)
aes
to set data-driven aestheticscolor
and shape
now cycle based on dataggplot(g6, aes(x=fertility, y=infant_mortality)) + geom_point(size=2, stroke = .5, aes(shape = factor(year), color = continent)) + labs(x="Total Fertility Rate", y = "Infant Mortality Rate") + scale_shape_discrete(name = "Year", labels = c("'13", "'14", "'15")) + scale_color_discrete(name = "Continent")
ggplot(g6, aes(x=fertility, y=infant_mortality)) + geom_point(size=2, stroke = .5, aes(shape = factor(year), color = continent)) + labs(x="Total Fertility Rate", y = "Infant Mortality Rate") + scale_shape_discrete(name = "Year", labels = c("'13", "'14", "'15")) + scale_color_discrete(name = "Continent")
Choosing the right model is stats, not programming! :D
Use col
to name the models for easy referencing!
ggplot(g6, aes(x = fertility, y = infant_mortality)) + geom_point() + geom_smooth(aes(col = "loess")) + geom_smooth(method = lm, aes(col = "bob")) + scale_colour_manual(name = 'Smoother', values =c('bob'='red', 'loess'='purple'), labels = c('Linear','GAM'), guide = 'legend')
ggplot(g6, aes(x = fertility, y = infant_mortality)) + geom_point() + geom_smooth(aes(col = "loess")) + geom_smooth(method = lm, aes(col = "bob")) + scale_colour_manual(name = 'Smoother', values =c('bob'='red', 'loess'='purple'), labels = c('Linear','GAM'), guide = 'legend')
ggplot2
just one way to do graphics
Works by adding layers and changing aesthetics
General plan: canvas + graph + labels/legends/etc.
Graphs can be quite tedious, so plan ahead!
starwars
data set from the tidyverse
iris
data set from base R
Thank you!
Questions?