Intro to Programming in R

August 8, 2023

Daily Schedule

9:30 - 10:35 Morning Session #1
10:35 - 10:50 Morning Break
10:50 - 11:55 Morning Session #2
11:55 - 1:10 Lunch
1:10 - 2:15 Afternoon Session #1
2:15 - 2:30 Afternoon Break
2:30 - 3:35 Afternoon Session #2
3:35 - 3:40 Feedback/Q&A

Outline

R packages
Reading and writing data files (readr package)
Subsetting data by rows, columns, or both (dplyr package)
Deriving new variables unconditionally and conditionally (dplyr package)
Introduction to summarizing data

R Packages

Packages = bundle of features/functions
Download from CRAN
Careful: no central group → no constant syntax + no constant quality

Two actions with every new package:

Install: Downloads package to your computer. Only needed once.
- Like local library getting a new book

Load: Allow current R session to access a package. Needed each time you want to use it.
- Like checking that book out from your local library

Installing a Package

Use install.packages("name-of-package-goes-here") to download a package
Use library(name-of-package-goes-here) to use a package in your current session

You’ll see progress and messages in your Console

Common info:
- Locations (red): where on your computer is package going and from where is it being downloaded
- Size (red): Size of the download and how much is downloaded to your computer
- Message (black): package “unpacked” and “sums checked”
- Conflicts: Functions that have the same name as other functions

Aside: Comments on Packages

Do not blindly update packages
R does not ensure backwards compatibility
Updates can cause your program to stop working until you update with new code
Big/important project? → Document which version of packages you used
- This way you can roll back an update to the right package if you need to!

Tidyverse

Collection of packages all designed to work together
- (fairly) consistent syntax and solid quality!

install.packages("tidyverse") #Order your book

library(tidyverse) #Check out your book

If you plan to follow along, run that code!

Tidyverse

Collection of packages all designed to work together
- (fairly) consistent syntax and solid quality!

install.packages("tidyverse") #Order your book

library(tidyverse) #Check out your book

If you plan to follow along, run that code!
Should see the following about conflicts

What’s in the `tidyverse`?

tibble - better data frames
readr - easier to read in external data
tidyr and dplyr - easier to clean up (aka tidy!) your data
Other packages are useful, but not for this intro course

`tibble`

Tibble is a table that has been “tidyverse-d”
Essentially, nicer printing properties + forcing better data management habits
Data created by tidyverse is tibble by default
For OG data frames, use as_tibble

growth <- as_tibble(ToothGrowth)
growth

## # A tibble: 60 × 3
##      len supp   dose
##    <dbl> <fct> <dbl>
##  1   4.2 VC      0.5
##  2  11.5 VC      0.5
##  3   7.3 VC      0.5
##  4   5.8 VC      0.5
##  5   6.4 VC      0.5
##  6  10   VC      0.5
##  7  11.2 VC      0.5
##  8  11.2 VC      0.5
##  9   5.2 VC      0.5
## 10   7   VC      0.5
## # ℹ 50 more rows

Tibble Wasn’t a Data Type!

Concerned we only learned about atomic vector, matrix, data frame, and list?
Those are the fundamental data types!

is.data.frame(growth)

## [1] TRUE

is_tibble(growth) #tidyverse functions tend to use _ and not .

## [1] TRUE

Tibble Wasn’t a Data Type!

Concerned we only learned about atomic vector, matrix, data frame, and list?
Those are the fundamental data types!

is.data.frame(growth)

## [1] TRUE

is_tibble(growth) #tidyverse functions tend to use _ and not .

## [1] TRUE

Think of a tibble like it’s an Instagram filter.
- Still just a data frame underneath, but fancier!

Accessing External Data

Data can come to you in (almost) infinitely many different ways!

External, fully-structured data: SAS, SPSS, Stata, SQL database, MS Access

External, semi-structured data: MS Excel, HTML/XML, even an API

External, unstructured data: delimited, fixed-field, combo platter

Less structure ⇒ more programming work

Structured data doesn’t magically happen; programmers make it happen!

Reading Data with `readr`

Some base functions for reading raw data, but we’ll focus on tidyverse functions
read_csv: files delimited by a comma
read_tsv: files delimited by a tab
read_delim: files delimited by any character (including comma & tab)
read_fwf: files with fixed-width fields
read_table: files “structured” like a table using spaces (not tabs!)

Important Note: tabs and spaces are NOT the same thing!

Reading External Data – Adding Context!

Pulling from a case study on a clinical trial.

Multiple files for Site 1: Lab and Visit data at Baseline and 3, 6, and 9 months.

Each file helps us learn new functions/features of tidyverse

To read raw data, you need a good text editor
- I strongly recommend Visual Studio Code
- Default editor on your computer is usually insufficient

Data available here if you have not already downloaded it
- Make sure you know where you download it!
- Strongly suggest you put it and your code in your working directory

Example 1 - Site 1, Baseline Visit.CSV

Always look at data first!
What do we notice?

Example 1 - Site 1, Baseline Visit.CSV

Always look at data first!
What do we notice?

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv") #Fingers crossed!?

## New names:
## Rows: 127 Columns: 17
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): ...2, M, 22JAN2018, ...6, mm-hg, beats/min, RECLINED, lb dbl (8): 1...1,
## 1...3, 110, 80, 75, 98.4, 192, 2 lgl (1): F
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `1` -> `1...1`
## • `` -> `...2`
## • `1` -> `1...3`
## • `` -> `...6`

Example 1 - Adding Column Names

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", 
                  col_names=c("Subject", "ScreenFailReas", "Screen", "Sex", 
                              "VisitDate", "NotifDate", "SBP", "DBP", "BPUnit",
                              "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit",
                              "Wt", "WtUnit", "Pain"))

## Rows: 128 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): ScreenFailReas, Sex, VisitDate, NotifDate, BPUnit, PulseUnit, Pos, ...
## dbl (8): Subject, Screen, SBP, DBP, Pulse, Temp, Wt, Pain
## lgl (1): TempUnit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Example 1 - Adding Column Names: Round 2

visNames <- c("Subject", "ScreenFailReas", "Screen", "Sex", "VisitDate", "NotifDate", 
              "SBP", "DBP", "BPUnit", "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit", 
              "Wt", "WtUnit", "Pain")

Example 1 - Adding Column Names: Round 2

visNames <- c("Subject", "ScreenFailReas", "Screen", "Sex", "VisitDate", "NotifDate",
              "SBP", "DBP", "BPUnit", "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit",
              "Wt", "WtUnit", "Pain")

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames)

## Rows: 128 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): ScreenFailReas, Sex, VisitDate, NotifDate, BPUnit, PulseUnit, Pos, ...
## dbl (8): Subject, Screen, SBP, DBP, Pulse, Temp, Wt, Pain
## lgl (1): TempUnit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Example 1 - Setting Column Class

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames, 
                  col_types=cols(TempUnit=col_character()))

col_types= option sets column types

cols() function allows you to specify types multiple ways

Var-Name = var-specification allows you to choose for specific column

Other options like col_numeric, col_date, col_factor and many others exist

View the vis00 tibble to confirm TempUnit is no longer FALSE!

Example 1 - More Column Classses: Dates

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames, 
                  col_types=cols(TempUnit=col_character(), 
                          VisitDate=col_date(format="%d%b%Y"), 
                          NotifDate=col_date(format="%d%b%Y")))

Adding in type specifier for the dates - don’t want them to be character!
- Parsing dates can be tedious, but formats are standardized
- Date/time parsing is modelled after regular expressions
- Resource: https://readr.tidyverse.org/reference/parse_datetime.html

Example 2 - Site 1, 3 Month Visit.txt

View raw data to confirm it is tab-delimited & has variable names in first row!

Example 2 - Site 1, 3 Month Visit.txt

View raw data to confirm it is tab-delimited & has variable names in first row!
Now that you know what raw data looks like, read it in!

vis03 <- read_tsv("data files/Site 1, 3 Month Visit.txt", 
           col_types=cols(TempUnit=col_character(), 
                          VisitDate=col_date(format="%d%b%Y"), 
                          NotifDate=col_date(format="%d%b%Y")))

Example 3 - Site 1, 6 Month Visit.txt

View raw data to determine how to read it in

Example 3 - Site 1, 6 Month Visit.txt

View raw data to determine how to read it in

vis06 <- read_delim("data files/Site 1, 6 Month Visit.txt", delim="/",
           col_names = visNames,
           col_types=cols(TempUnit=col_character(), 
                          VisitDate=col_date(format="%d%b%Y"), 
                          NotifDate=col_date(format="%d%b%Y")))

How did R distinguish between “/” as delimiter and “/” in beats/min?

Aside – Fixed-Width Data

Delimited data is easier to parse – delimiters do the heavy lifting!

In fixed-field data, values are always in the same position
- I.e., always begin in the same column

Not necessarily any break between fields!

Take a look at Site 1, 9 Month Visit.dat as an example
- Don’t confuse spaces for delimiters!

Fixed-Width Data: Identifying Field Positions

read_fwf offers multiple options to identify positions
fwf_empty guesses based on empty columns
fwf_widths uses column widths that you provide
fwf_positions uses column positions that you provided

Example 4 - Site 1, 9 Month Visit.dat

myFile<-"data files/Site 1, 9 Month Visit.dat"
vis09 <- read_fwf(myFile, col_positions = fwf_empty(myFile))

Ignoring variable name and type issues, did we get the data we wanted?

Example 4 - Self-Validating the Results

myFile<-"data files/Site 1, 9 Month Visit.dat"
vis09 <- read_fwf(myFile, col_positions = fwf_empty(myFile))
vis09

## # A tibble: 74 × 18
##       X1 X2    X3       X4 X5    X6    X7       X8    X9 X10     X11 X12   X13  
##    <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr>
##  1     1 <NA>  <NA>      1 M     25JU… <NA>    114    82 mm-hg    77 beat… RECL…
##  2     2 <NA>  <NA>      1 F     28JU… <NA>    101    84 mm-hg    79 beat… RECU…
##  3     4 DROP… HBP       0 M     17JU… 17JU…   110    91 mm-hg    89 beat… RECL…
##  4     6 <NA>  <NA>      1 M     25JU… <NA>     82    50 mm-hg    43 beat… RECU…
##  5     8 <NA>  <NA>      1 F     25JU… <NA>     82    64 mm-hg    57 beat… RECU…
##  6     9 <NA>  <NA>      1 M     02AU… <NA>     99    79 mm-hg    83 beat… RECU…
##  7    10 <NA>  <NA>      1 M     22JU… <NA>     99    84 mm-hg    75 beat… RECU…
##  8    11 <NA>  <NA>      1 F     25JU… <NA>     98    88 mm-hg    86 beat… RECL…
##  9    13 <NA>  <NA>      1 F     25JU… <NA>     96    61 mm-hg    57 beat… RECU…
## 10    16 <NA>  <NA>      1 M     24JU… <NA>    108    86 mm-hg    81 beat… RECU…
## # ℹ 64 more rows
## # ℹ 5 more variables: X14 <dbl>, X15 <lgl>, X16 <dbl>, X17 <chr>, X18 <dbl>

Example 4 - Using `fwf_widths`

myFile<-"data files/Site 1, 9 Month Visit.dat"
vis09 <- read_fwf(myFile, 
                  col_positions = fwf_widths(c(4,26,2,2,10,10,4,4,9,4,10,10,6,9,4,9,8),
                                             visNames),
                  col_types=cols(TempUnit=col_character(),
                                 VisitDate=col_date(format="%d%b%Y"), 
                                 NotifDate=col_date(format="%d%b%Y")))
vis09

## # A tibble: 74 × 17
##    Subject ScreenFailReas Screen Sex   VisitDate  NotifDate    SBP   DBP BPUnit
##      <dbl> <chr>           <dbl> <chr> <date>     <date>     <dbl> <dbl> <chr> 
##  1       1 <NA>                1 M     2018-07-25 NA           114    82 mm-hg 
##  2       2 <NA>                1 F     2018-07-28 NA           101    84 mm-hg 
##  3       4 DROPPED: HBP        0 M     2018-07-17 2018-07-17   110    91 mm-hg 
##  4       6 <NA>                1 M     2018-07-25 NA            82    50 mm-hg 
##  5       8 <NA>                1 F     2018-07-25 NA            82    64 mm-hg 
##  6       9 <NA>                1 M     2018-08-02 NA            99    79 mm-hg 
##  7      10 <NA>                1 M     2018-07-22 NA            99    84 mm-hg 
##  8      11 <NA>                1 F     2018-07-25 NA            98    88 mm-hg 
##  9      13 <NA>                1 F     2018-07-25 NA            96    61 mm-hg 
## 10      16 <NA>                1 M     2018-07-24 NA           108    86 mm-hg 
## # ℹ 64 more rows
## # ℹ 8 more variables: Pulse <dbl>, PulseUnit <chr>, Pos <chr>, Temp <dbl>,
## #   TempUnit <chr>, Wt <dbl>, WtUnit <chr>, Pain <dbl>

Widths vs Positions

fwf_widths has to account for spaces between fields and read them as data
fwf_positions can specify (start, end) positions of each variable

Widths vs Positions

fwf_widths has to account for spaces between fields and read them as data
fwf_positions can specify (start, end) positions of each variable
For our data, you can use your text editor to find that those positions are:

Subject	ScreenFailReas	Screen	Sex	VisitDate	NotifDate	SBP	DBP	BPUnit
1-3	5-29	31	33	35-43	45-53	55-57	59-61	63-67

Pulse	PulseUnit	Pos	Temp	TempUnit	Wt	WtUnit	Pain
72-74	76-84	86-94	96-100	102	111-113	115- 116	131

Example 4 - Using `fwf_positions`

Getting those positions is annoying, but is only half the battle

myFile <- "data files/Site 1, 9 Month Visit.dat"
startPos <- c(1, 5, 31, 33, 35, 45, 55, 59, 63, 72, 76, 86, 96, 102, 111, 115, 131)
endPos <- c(3, 29, 31, 33, 43, 53, 57, 61, 67, 74, 84, 94, 100, 102, 113, 116, 131)
vis09 <- read_fwf(myFile, 
                  col_positions = fwf_positions(start = startPos,
                                                end = endPos,
                                                col_names = visNames),
                  col_types = cols(TempUnit=col_character(),
                                   VisitDate=col_date(format="%d%b%Y"),
                                   NotifDate=col_date(format="%d%b%Y")))

How Did We Do?

vis09

## # A tibble: 74 × 17
##    Subject ScreenFailReas Screen Sex   VisitDate  NotifDate    SBP   DBP BPUnit
##      <dbl> <chr>           <dbl> <chr> <date>     <date>     <dbl> <dbl> <chr> 
##  1       1 <NA>                1 M     2018-07-25 NA           114    82 mm-hg 
##  2       2 <NA>                1 F     2018-07-28 NA           101    84 mm-hg 
##  3       4 DROPPED: HBP        0 M     2018-07-17 2018-07-17   110    91 mm-hg 
##  4       6 <NA>                1 M     2018-07-25 NA            82    50 mm-hg 
##  5       8 <NA>                1 F     2018-07-25 NA            82    64 mm-hg 
##  6       9 <NA>                1 M     2018-08-02 NA            99    79 mm-hg 
##  7      10 <NA>                1 M     2018-07-22 NA            99    84 mm-hg 
##  8      11 <NA>                1 F     2018-07-25 NA            98    88 mm-hg 
##  9      13 <NA>                1 F     2018-07-25 NA            96    61 mm-hg 
## 10      16 <NA>                1 M     2018-07-24 NA           108    86 mm-hg 
## # ℹ 64 more rows
## # ℹ 8 more variables: Pulse <dbl>, PulseUnit <chr>, Pos <chr>, Temp <dbl>,
## #   TempUnit <chr>, Wt <dbl>, WtUnit <chr>, Pain <dbl>

Fixed-Field Final Comments

I hate fixed-field data so much. So, so, so much.

Good data practice says fixed data should come with a code book!

Great text editor is a must – again, Visual Studio Code is one of the best

Very common because no delimiters needed → compact storage

Easier to skip variables if you only want to read in part of the data

Reading Data with `readxl` and `haven`

So far, only looked at unstructured data
Data handling/manipulation is a huuuuuuge part of data analysis
- Doesn’t mean it is everyone’s favorite task
- Positive attitude goes a long way :D

What about data that is already semi-/fully-structured?
- readxl package for reading MS Excel files
- haven package for reading from other stat software

Example 5 - Sheets of Site 1 Lab.xlsx

library(readxl) #Part of tidyverse, but not loaded with it
excel_sheets("data files/Site 1 Lab.xlsx") #list all possible sheets

## [1] "Site1Lab00" "Site1Lab03" "Site1Lab06"

Example 5 - Reading a Sheet from Site 1 Lab.xlsx

library(readxl) #Part of tidyverse, but not loaded with it
excel_sheets("data files/Site 1 Lab.xlsx") #list all possible sheets

## [1] "Site1Lab00" "Site1Lab03" "Site1Lab06"

lab00<-read_excel("data files/Site 1 Lab.xlsx", sheet = "Site1Lab00")
#how did we do?
head(lab00)

## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex     ALB
##     <dbl> <dttm>              <lgl>     <lgl>           <dbl> <chr> <dbl>
## 1       1 2018-01-22 00:00:00 NA        NA                  1 M       4.4
## 2       2 2018-01-23 00:00:00 NA        NA                  1 F       4.8
## 3       4 2018-01-20 00:00:00 NA        NA                  1 M       4.3
## 4       6 2018-01-24 00:00:00 NA        NA                  1 M       4.4
## 5       8 2018-01-21 00:00:00 NA        NA                  1 F       4.2
## 6       9 2018-01-27 00:00:00 NA        NA                  1 M       4.5
## # ℹ 12 more variables: AlkPhos <dbl>, ALT <dbl>, AST <dbl>, DBili <dbl>,
## #   GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

## Example 5 - Setting Column Types

myFile <- "data files/Site 1 Lab.xlsx"
colTypes = c("numeric", "date", "date", "text", "numeric", "text", rep("numeric",13))
lab00<-read_excel(myFile, sheet = "Site1Lab00", col_types = colTypes)
lab03<-read_excel(myFile, sheet = "Site1Lab03", col_types = colTypes)
lab06<-read_excel(myFile, sheet = "Site1Lab06", col_types = colTypes)
head(lab00)

## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
## 1       1 2018-01-22 00:00:00 NA        <NA>                1 M    
## 2       2 2018-01-23 00:00:00 NA        <NA>                1 F    
## 3       4 2018-01-20 00:00:00 NA        <NA>                1 M    
## 4       6 2018-01-24 00:00:00 NA        <NA>                1 M    
## 5       8 2018-01-21 00:00:00 NA        <NA>                1 F    
## 6       9 2018-01-27 00:00:00 NA        <NA>                1 M    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

`read_excel` Features

Familiar features like col_names and col_types
Excel-specific features like range

Professional/Personal advice: you should never store data in Excel
- Excel is a spreadsheet tool, not a database
- If you want a Microsoft tool, then use MS Access

Example 6 - Site1Lab09.sas7bdat

library(haven)
lab09<-read_sas("data files/Site1Lab09.sas7bdat")

Just like reading from any third-party software, some special features
Notably, SAS variables usually have labels
- Adds a new attribute when you investigate the tibble

Reading Data Wrap Up

Scratching the surface of data reading
Infinite* ways to “creatively” store data ⇒ many combinations of our tools

Worst case scenario, read entire file into a single string and parse it by hand
- Definitely not a task for new programmers!

Writing data is as easy as using write_* functions instead of read_*

Activity #1

Three versions of a data set regarding patients’ blood measurements have been provided. Each file contains the same data, just presented differently.

Read in bloodv1.dat
Read in bloodv2.prn (Hint: use skip= to skip any rows you do not want to read.)
Read in bloodv3.xlsx

Outline

R packages
Reading and writing data files (readr package)
Subsetting data by rows, columns, or both (dplyr package)
Deriving new variables unconditionally and conditionally (dplyr package)
Introduction to summarizing data

Types of Manipulations

Subset by:
- filtering out rows
- selecting columns

Combine data sets:
- vertically
- horizontally

Reshape data sets:
- tall (stacked) to wide (unstacked)
- wide (unstacked) to tall (stacked)

Review Logical Filtering

Already seen how to filter an object based on a Boolean (logical) vector

lab06$ScreenFailReas == "PREG."

##  [1]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [16]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [31]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA TRUE   NA   NA
## [46]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [61]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA TRUE   NA

Review Logical Filtering

Already seen how to filter an object based on a Boolean (logical) vector

sf<-lab06$Screen == 0
lab06[sf,]

## # A tibble: 2 × 19
##   Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
## 1      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
## 2     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Example of `filter`

filter(lab06, Screen == 0) #filter provided by dplyr package

## # A tibble: 2 × 19
##   Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
## 1      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
## 2     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Another Example of `filter`

Get to use those comparsion and Boolean operators!

filter(lab06, Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0))

## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
## 1      13 2018-07-25 00:00:00 NA                  <NA>                1 F    
## 2      18 2018-08-02 00:00:00 NA                  <NA>                1 M    
## 3      46 2018-08-25 00:00:00 NA                  <NA>                1 M    
## 4      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
## 5      88 2018-09-11 00:00:00 NA                  <NA>                1 M    
## 6     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

`slice` vs. `filter`

filter is logical while slice is positional

slice(lab06, 12:16)

## # A tibble: 5 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
## 1      19 2018-08-02 00:00:00 NA        <NA>                1 M    
## 2      20 2018-07-29 00:00:00 NA        <NA>                1 M    
## 3      24 2018-08-02 00:00:00 NA        <NA>                1 M    
## 4      25 2018-08-03 00:00:00 NA        <NA>                1 M    
## 5      26 2018-07-27 00:00:00 NA        <NA>                1 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Slice Features

slice does have cool helper functions

## slice_head(lab06, n=5) #grab the top 5 observations
## slice_tail(lab06, n=10) #grab the bottom 10 observations
## slice_sample(lab06, n=3) #simple random sample of 3 rows
## slice_sample(lab06, prop=0.25) #sample 25% of the data set
slice_min(lab06, ALB) #also, slice_max

## # A tibble: 1 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
## 1      90 2018-09-12 00:00:00 NA        <NA>                1 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Arranging Rows

Arranging rows is as simple as using arrange

arrange(lab06,Screen,Sex,VisitDate)

## # A tibble: 72 × 19
##    Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##      <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
##  1      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
##  2     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
##  3      17 2018-07-24 00:00:00 NA                  <NA>                1 F    
##  4       8 2018-07-25 00:00:00 NA                  <NA>                1 F    
##  5      11 2018-07-25 00:00:00 NA                  <NA>                1 F    
##  6      13 2018-07-25 00:00:00 NA                  <NA>                1 F    
##  7      26 2018-07-27 00:00:00 NA                  <NA>                1 F    
##  8       2 2018-07-28 00:00:00 NA                  <NA>                1 F    
##  9      27 2018-08-03 00:00:00 NA                  <NA>                1 F    
## 10      30 2018-08-03 00:00:00 NA                  <NA>                1 F    
## # ℹ 62 more rows
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Descending Order?

arrange(lab06,Screen,Sex,desc(VisitDate)) #Flip order of a variable with desc()

## # A tibble: 72 × 19
##    Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##      <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
##  1     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
##  2      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
##  3     126 2018-10-15 00:00:00 NA                  <NA>                1 F    
##  4     131 2018-10-15 00:00:00 NA                  <NA>                1 F    
##  5     122 2018-10-12 00:00:00 NA                  <NA>                1 F    
##  6     115 2018-10-07 00:00:00 NA                  <NA>                1 F    
##  7     114 2018-10-03 00:00:00 NA                  <NA>                1 F    
##  8     111 2018-09-30 00:00:00 NA                  <NA>                1 F    
##  9     103 2018-09-25 00:00:00 NA                  <NA>                1 F    
## 10     101 2018-09-24 00:00:00 NA                  <NA>                1 F    
## # ℹ 62 more rows
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Combining Functions

Obviously, you might want to filter and arrange your data!
Typical approach is to nest functions

arrange(filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)),Screen, Sex, desc(VisitDate))

## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
## 1     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
## 2      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
## 3      13 2018-07-25 00:00:00 NA                  <NA>                1 F    
## 4      88 2018-09-11 00:00:00 NA                  <NA>                1 M    
## 5      46 2018-08-25 00:00:00 NA                  <NA>                1 M    
## 6      18 2018-08-02 00:00:00 NA                  <NA>                1 M    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Combining Functions

Difficult for some to read, often space it this way
But… more functions ⇒???

arrange(
        filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)),
        Screen, Sex, desc(VisitDate)
       )

Piping/Chaining

tidyverse offers ability to pipe or chain functions

Output from one function is sent as input into next function
- Think links in a chain

Piping/Chaining

tidyverse offers ability to pipe or chain functions
Output from one function is sent as input into next function
- Think links in a chain

#filter is your first link - we know it outputs a subset
filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)) %>% #chain operator!
  arrange(Screen, Sex, desc(VisitDate))
  #arrange is your next link
  #note -- no first argument anymore, we start with the **2nd** argument

What About Columns?

select(lab06, Subject, VisitDate, CGluc, UGluc, Preg) #no chaining

lab06 %>% select(Subject, VisitDate, CGluc, UGluc, Preg) #same thing, with chaining

## # A tibble: 72 × 5
##    Subject VisitDate           CGluc UGluc  Preg
##      <dbl> <dttm>              <dbl> <dbl> <dbl>
##  1       1 2018-07-25 00:00:00 110       1     0
##  2       2 2018-07-28 00:00:00 102       0     0
##  3       6 2018-07-25 00:00:00  97.2     0     0
##  4       8 2018-07-25 00:00:00 113.      1     0
##  5       9 2018-08-02 00:00:00 106.      0     0
##  6      10 2018-07-22 00:00:00 103.      0     0
##  7      11 2018-07-25 00:00:00 103       0     0
##  8      13 2018-07-25 00:00:00 103.      0     0
##  9      16 2018-07-24 00:00:00 113.      1     0
## 10      17 2018-07-24 00:00:00 108.      0     0
## # ℹ 62 more rows

Helpful Functions for Selecting Columns

After lab06 %>% use:

select(starts_with("Screen")) #All variables starting with "Screen"
select(ends_with("date")) #All variables ending with "date"
select(starts_with("Screen") | ends_with("date")) #Either of the above
select(!(starts_with("Screen"))) #All variables that DO NOT start with "Screen"
select(contains('gluc')) #All variables contianing string 'gluc' in name
select(ALT:Prot) #All variables from ALT to Prot in data frame
select(where(is.numeric)) # Applies is.numeric to all variables and returns when TRUE

What Comes Next?

Reading data in is only half the battle

Still need to manipulate data sets in other ways:
- Concatenate/stack records from different data sets
- Merge/join records across data sets
- Clean/tidy data
- Derive new variables unconditionally
- Derive new variables conditionally

Concatenation

Concatenation: vertically combining data sets by stacking records “on top” of each other

dplyr provides bind_rows to get it done

Important to consider certain metadata:
- Names of columns
- Data types of columns

Example of Concatenation

#I want to use small subsets so we can see what is happening!
set1 <- lab00 %>% slice(1:3) #first 3 rows of lab00
set2 <- lab03 %>% slice(1:3) #first 3 rows of lab03
bind_rows(set1, set2) #3 from lab00 + 3 from lab03 = 6, vertically combined, records

## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
## 1       1 2018-01-22 00:00:00 NA        <NA>                1 M    
## 2       2 2018-01-23 00:00:00 NA        <NA>                1 F    
## 3       4 2018-01-20 00:00:00 NA        <NA>                1 M    
## 4       1 2018-04-24 00:00:00 NA        <NA>                1 M    
## 5       2 2018-04-26 00:00:00 NA        <NA>                1 F    
## 6       4 2018-04-19 00:00:00 NA        <NA>                1 M    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Concatenating Mismatched Columns

set1 <- lab00 %>% select(1:2) %>% slice(1:3) #first 2 columns and 3 rows of lab00
set2 <- lab03 %>% select(c(5,7)) %>% slice(1:3) #first 3 rows of columns 5 & 7 from lab03
bind_rows(set1, set2) # six mismatched records

## # A tibble: 6 × 4
##   Subject VisitDate           Screen   ALB
##     <dbl> <dttm>               <dbl> <dbl>
## 1       1 2018-01-22 00:00:00     NA  NA  
## 2       2 2018-01-23 00:00:00     NA  NA  
## 3       4 2018-01-20 00:00:00     NA  NA  
## 4      NA NA                       1   4.4
## 5      NA NA                       1   4.7
## 6      NA NA                       1   4.2

Combining Horizontally: Merges

Merges also called match-merges or joins

Concatenations are simple operations: layer records from data frame 1 on top of data frame 2

Merges are more complicated because we need rules for combining records

Only appropriate when records can be linked
- I.e., record in Table A and record in Table B combine to form single record

Lots of flavors of join: inner, left, right, full, semi, anti

Building Toy Data Sets

Easier to visualize with small data sets

#rows 1 - 5 of Subject, CGluc, UGluc in Lab00
tableA <- lab00 %>% slice(1:5) %>% select(Subject|contains("gluc"))
#rows 3 - 7 of Subject, SBP, DBP in Vis00
tableB <- vis00 %>% slice(3:7) %>% select(Subject|ends_with("bp"))

Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

Inner Join - Implicit Key Variables

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table B
Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

inner_join(tableA, tableB)

Inner Join - Implicit Key Variables

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table B
Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

inner_join(tableA, tableB)

Join
Subject	CGluc	UGluc	SBP	DBP
4	102	0	106	89
6	101	0	78	49

How did R decide which A and B records to combine?

Inner Join - Explicit Key Variables

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table B
Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

inner_join(tableA, tableB, 
           by = "Subject")

Join
Subject	CGluc	UGluc	SBP	DBP
4	102	0	106	89
6	101	0	78	49

What records are included in an inner join?

Joining by Differently-Named Variables

tableC <- tableB %>% rename(Subj = Subject)

Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Subj	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

Joining by Differently-Named Variables

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table C
Subj	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

inner_join(tableA, tableC, 
           by = c("Subject"="Subj"))

Join
Subject	CGluc	UGluc	SBP	DBP
4	102	0	106	89
6	101	0	78	49

Left Join

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table B
Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

left_join(tableA, tableB, 
           by = "Subject")

Join
Subject	CGluc	UGluc	SBP	DBP
1	106	0	NA	NA
2	101	0	NA	NA
4	102	0	106	89
6	101	0	78	49
8	116	1	NA	NA

What rule creates the row space now? Column space?

Right Join

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table B
Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

right_join(tableA, tableB, 
           by = "Subject")

Join
Subject	CGluc	UGluc	SBP	DBP
4	102	0	106	89
6	101	0	78	49
3	NA	NA	100	75
5	NA	NA	144	115
7	NA	NA	132	92

Row space? Column space?

(Full) Outer Join

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table B
Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

Row space? Column space?

full_join(tableA, tableB, 
           by = "Subject")

Join
Subject	CGluc	UGluc	SBP	DBP
1	106	0	NA	NA
2	101	0	NA	NA
4	102	0	106	89
6	101	0	78	49
8	116	1	NA	NA
3	NA	NA	100	75
5	NA	NA	144	115
7	NA	NA	132	92

Semi Join

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table B
Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

semi_join(tableA, tableB, 
           by = "Subject")

Join
Subject	CGluc	UGluc
4	102	0
6	101	0

Row space? Column space?

Anti Join

Table A
Subject	CGluc	UGluc
1	106	0
2	101	0
4	102	0
6	101	0
8	116	1

Table B
Subject	SBP	DBP
3	100	75
4	106	89
5	144	115
6	78	49
7	132	92

anti_join(tableA, tableB, 
           by = "Subject")

Join
Subject	CGluc	UGluc
1	106	0
2	101	0
8	116	1

Row space? Column space?

Closing Comments on Joining Data

Only seen how to do a one-to-one join
- Critical to use enough BY= variables to get a unique record per table!
One-to-many joins are common and important
Many-to-many joins should rarely be done!
R never coercies types – common variables must have same type!

Reshaping: Starting Data Set

Transposing/reshaping/restructuring data is an important tool
Let’s build a data set to use and look at two ways to restructure data:
- Wide to tall
- Tall to wide

Reshaping: Starting Data Set

Transposing/reshaping/restructuring data is an important tool
Let’s build a data set to use and look at two ways to restructure data:
- Wide to tall
- Tall to wide

tableC <- lab00 %>% select(Subject, ALB:DBili)
slice_head(tableC, n=2)

## # A tibble: 2 × 6
##   Subject   ALB AlkPhos   ALT   AST DBili
##     <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1       1   4.4      80    20    22  0.11
## 2       2   4.8      27    27    30  0.1

Reshape Wide to Tall

may see gather in older code
pivot_longer is updated version, handles more scenarios

taller <- tableC %>% pivot_longer(!Subject, #old columns to be reshaped
                                  names_to = "lbtest", #new column for old column names
                                  values_to = "result", #new column for old column values
                                 )
slice_head(taller, n=10)

## # A tibble: 10 × 3
##    Subject lbtest  result
##      <dbl> <chr>    <dbl>
##  1       1 ALB       4.4 
##  2       1 AlkPhos  80   
##  3       1 ALT      20   
##  4       1 AST      22   
##  5       1 DBili     0.11
##  6       2 ALB       4.8 
##  7       2 AlkPhos  27   
##  8       2 ALT      27   
##  9       2 AST      30   
## 10       2 DBili     0.1

Reshape: Tall to Wide

spread is opposite of gather
pivot_wider is opposite of pivot_longer

wider <- taller %>% pivot_wider(names_from = lbtest, #what should new column names be?
                                values_from = result, #what should new column values be?
                               )
slice_head(wider, n=2)

## # A tibble: 2 × 6
##   Subject   ALB AlkPhos   ALT   AST DBili
##     <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1       1   4.4      80    20    22  0.11
## 2       2   4.8      27    27    30  0.1

Activity #2

Concatenate lab00 through lab06 into a single data frame named lab
Concatenate vis00 through vis06 into a single data frame named vis
Combine the Lab and Visit data frames under the following conditions. In each case, ensure you do not duplicate any variables!
- Version 1: Only patients with Visit data and Lab data are important
- Version 2: All patients are important!
- Version 3: Want Visit and Lab data for any patients who have Lab data
- Version 4: We want a subset of the Lab data set based on whether the patient is not in the Visit data set
Advanced: Suppose you concatenate the results of semi_join(A,B) and anti_join(A,B). How does the resulting table compare to A and B?

Outline

R packages
Reading and writing data files (readr package)
Subsetting data by rows, columns, or both (dplyr package)
Deriving new variables unconditionally and conditionally (dplyr package)
Introduction to summarizing data

Creating a New Variable Unconditionally

Month <- 3 #Creating a new variable is easy :)

But how do we attach it to a data set, say, Lab03?

Creating a New Variable Unconditionally

Month <- 3 #Creating a new variable is easy :)

But how do we attach it to a data set, say, Lab03?

lab03 %>% mutate(month = 3)

## # A tibble: 75 × 20
##    Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##      <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
##  1       1 2018-04-24 00:00:00 NA        <NA>                1 M    
##  2       2 2018-04-26 00:00:00 NA        <NA>                1 F    
##  3       4 2018-04-19 00:00:00 NA        <NA>                1 M    
##  4       6 2018-04-26 00:00:00 NA        <NA>                1 M    
##  5       8 2018-04-25 00:00:00 NA        <NA>                1 F    
##  6       9 2018-04-30 00:00:00 NA        <NA>                1 M    
##  7      10 2018-04-22 00:00:00 NA        <NA>                1 M    
##  8      11 2018-04-27 00:00:00 NA        <NA>                1 F    
##  9      13 2018-04-27 00:00:00 NA        <NA>                1 F    
## 10      16 2018-04-26 00:00:00 NA        <NA>                1 M    
## # ℹ 65 more rows
## # ℹ 14 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>, month <dbl>

How Do We Confirm?

#I *really* encourage you to avoid overwriting objects!
lab03v2 <- lab03 %>% mutate(month = 3) 
lab03v2 %>% select(Subject, month)

## # A tibble: 75 × 2
##    Subject month
##      <dbl> <dbl>
##  1       1     3
##  2       2     3
##  3       4     3
##  4       6     3
##  5       8     3
##  6       9     3
##  7      10     3
##  8      11     3
##  9      13     3
## 10      16     3
## # ℹ 65 more rows

New Variable from Other Variables?

lab03v3 <- lab03 %>% mutate(CGratio = Hematocr/Hemoglob, 
                            CGRatioDefn = "Hematocrit/Hemoglobin") 
lab03v3 %>% select(Subject, starts_with("CGrat"))

## # A tibble: 75 × 3
##    Subject CGratio CGRatioDefn          
##      <dbl>   <dbl> <chr>                
##  1       1    3.33 Hematocrit/Hemoglobin
##  2       2    2.92 Hematocrit/Hemoglobin
##  3       4    2.77 Hematocrit/Hemoglobin
##  4       6    3.18 Hematocrit/Hemoglobin
##  5       8    2.71 Hematocrit/Hemoglobin
##  6       9    2.98 Hematocrit/Hemoglobin
##  7      10    2.5  Hematocrit/Hemoglobin
##  8      11    3.44 Hematocrit/Hemoglobin
##  9      13    3.08 Hematocrit/Hemoglobin
## 10      16    3.15 Hematocrit/Hemoglobin
## # ℹ 65 more rows

Only Want to See New Variable?

lab03 %>% transmute(CGratio = Hematocr/Hemoglob, CGRatioDefn = "Hematocrit/Hemoglobin")

## # A tibble: 75 × 2
##    CGratio CGRatioDefn          
##      <dbl> <chr>                
##  1    3.33 Hematocrit/Hemoglobin
##  2    2.92 Hematocrit/Hemoglobin
##  3    2.77 Hematocrit/Hemoglobin
##  4    3.18 Hematocrit/Hemoglobin
##  5    2.71 Hematocrit/Hemoglobin
##  6    2.98 Hematocrit/Hemoglobin
##  7    2.5  Hematocrit/Hemoglobin
##  8    3.44 Hematocrit/Hemoglobin
##  9    3.08 Hematocrit/Hemoglobin
## 10    3.15 Hematocrit/Hemoglobin
## # ℹ 65 more rows

Quick Recap

mutate creates new columns AND attaches them to the source data frame
- Useful for extending current data frame
transmute turns old columns into new columns
- Useful for standalone varibles to add to different data frame

Introducing Conditional Logic

Sometimes we need to create a variable based on certain conditions
Basic concept: if-then/else logic

if (if-condition) {
    action-statement
}

if-condtion is the logical condition applied to the record

action-statement is what you want R to do when if-condition is TRUE

Many languages use keywords if and then just like English syntax

`if-then` Example

name <- "Jonathan"
if (name == "Jonathan") {
    fuzzy <- "Yes"
}
fuzzy

## [1] "Yes"

Of course, commonly have multiple conditions

`if-then/else` Examples

name <- "Fozzy Bear"
if (name == "Jonathan") {
    fuzzy <- "Yes"
} else {
  fuzzy <- "Unknown"
}
fuzzy

## [1] "Unknown"

name <- "Fozzy Bear"
if (name == "Jonathan") {
    fuzzy <- "Yes"
} else if (name == "Fozzy Bear") {
  fuzzy <- "Fozzy, not fuzzy"
} else { fuzzy <- "Unknown"
}
fuzzy

## [1] "Fozzy, not fuzzy"

Cascading Conditions

name <- "Fozzy Bear"
if ((name == "Jonathan") | (str_detect(name,"Fozzy"))) {
    fuzzy <- "Yes"
} else if (name == "Fozzy Bear") {
  fuzzy <- "Fozzy, not fuzzy"
} else { fuzzy <- "Unknown"
}
fuzzy

## [1] "Yes"

str_detect searches name for the string “Fozzy”

It finds it, so FALSE | TRUE = TRUE

Meeting first condtion → don’t check other conditions

More Meaningful Example?

if (lab00$ALB < 4.0 | lab00$ALB > 5.4){
  ScreenFlag <- "ALB"
}

## Error in if (lab00$ALB < 4 | lab00$ALB > 5.4) {: the condition has length > 1

Sadly, basic version of if-then cannot handle more than one value

That means no vectors or data frames!

We need an approach that can handle vectors!

Vectorized Functions

Useful functions tend to be vectorized
- Means they can operate on a vector

Remember: data frame is just a bound collection of vectors!

Functions like mean and median accepted vectors

Vectorized IF-THEN

#Vectorized if-then is done with ifelse functions
ifelse(condition-to-apply-to-vector, action-if-true, action-if-false)

ifelse(lab00$ALB < 4.0 | lab00$ALB > 5.4, "HighLow", "Normal")

##  [1] "Normal"  "Normal"  "Normal"  "Normal"  "Normal"  "Normal"  "Normal" 
##  [8] "Normal"  "HighLow" "Normal"  "HighLow" "HighLow" "HighLow" "Normal" 
## [15] "Normal"  "HighLow" "HighLow" "Normal"  "Normal"  "Normal"  "Normal" 
## [22] "Normal"  "Normal"  "Normal"  "Normal"  "HighLow" "Normal"  "Normal" 
## [29] "Normal"  "Normal"  "Normal"  "HighLow" "Normal"  "Normal"  "Normal" 
## [36] "Normal"  "Normal"  "Normal"  "Normal"  "Normal"  "Normal"  "HighLow"
## [43] "Normal"  "HighLow" "Normal"  "Normal"  "Normal"  "Normal"  "Normal" 
## [50] "Normal"  "Normal"  "Normal"  "HighLow" "HighLow" "Normal"  "Normal" 
## [57] "Normal"  "Normal"  "HighLow" "HighLow" "Normal"  "Normal"  "Normal" 
## [64] "Normal"  "Normal"  "Normal"  "HighLow" "HighLow" "Normal"  "Normal" 
## [71] "Normal"  "HighLow" "Normal"  "Normal"  "Normal"  "Normal"  "Normal"

Nesting `ifelse`

#Can still have multiple conditions
ifelse(lab00$ALB < 4.0, "Low", 
       ifelse(lab00$ALB > 5.4, "High", 
              "Normal"))

##  [1] "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal"
##  [9] "Low"    "Normal" "Low"    "Low"    "Low"    "Normal" "Normal" "Low"   
## [17] "Low"    "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal"
## [25] "Normal" "Low"    "Normal" "Normal" "Normal" "Normal" "Normal" "Low"   
## [33] "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal"
## [41] "Normal" "Low"    "Normal" "Low"    "Normal" "Normal" "Normal" "Normal"
## [49] "Normal" "Normal" "Normal" "Normal" "Low"    "High"   "Normal" "Normal"
## [57] "Normal" "Normal" "Low"    "Low"    "Normal" "Normal" "Normal" "Normal"
## [65] "Normal" "Normal" "Low"    "Low"    "Normal" "Normal" "Normal" "Low"   
## [73] "Normal" "Normal" "Normal" "Normal" "Normal"

Syntax Improvement

#chain and transmute or mutate as needed
lab00 %>% transmute(AlbFlag = ifelse(lab00$ALB < 4.0, "Low", 
                                ifelse(lab00$ALB > 5.4, "High", 
                                       "Normal")))

## # A tibble: 77 × 1
##    AlbFlag
##    <chr>  
##  1 Normal 
##  2 Normal 
##  3 Normal 
##  4 Normal 
##  5 Normal 
##  6 Normal 
##  7 Normal 
##  8 Normal 
##  9 Low    
## 10 Normal 
## # ℹ 67 more rows

Leaving “Normal” as my catchall is probably not a good idea…

GPP Improvement

#chain and transmute or mutate as needed
lab00 %>% transmute(AlbFlag = ifelse(lab00$ALB < 4.0, "Low", 
                                ifelse(lab00$ALB > 5.4, "High", 
                                  ifelse(lab00$ALB >= 4.0 & lab00$ALB <= 5.4, "Normal", 
                                         "INVALID"))))

## # A tibble: 77 × 1
##    AlbFlag
##    <chr>  
##  1 Normal 
##  2 Normal 
##  3 Normal 
##  4 Normal 
##  5 Normal 
##  6 Normal 
##  7 Normal 
##  8 Normal 
##  9 Low    
## 10 Normal 
## # ℹ 67 more rows

Leaving “Normal” as my catchall is probably not a good idea…

Recap

Create variables unconditionally as constant or from other variables
Attach with mutate or view on their own with transmute
Vectorized functions accept vectors as arguments
Vectorized logic with ifelse
Caution: Avoid “miscelaneous” category also catching one of your expected categories!
Combining logic with filter, arrange, slice, *_join, bind_rows allows for lots of data manipulation

Activity #3

Create 3 values using the following code to create date constants

cut00 <- as.Date("2018-04-19") #Approximate last date of Baseline Visits
cut03 <- as.Date("2018-07-21") #Approximate last date of 3 Month Visits
cut06 <- as.Date("2018-10-21") #Approximate last date of 6 Month Visits

Update the vis data frame from Activity #2 to include the variable VisitNum. If a record has a visit date prior to the cutoff date for those visits, assign the appropriate value (1,2,3, etc.).

E.g., records with VisitDate = 2018-10-20 should get VisitNum = 6
Use NA for records with dates not covered by these three cutoff dates

Advanced: View the vis00, vis03, vis06 data frames and focus on dates near the cutoff. Did we do a good job creating our VisitNum variable? What should be done instead? If you want, update your code to derive the VisitNum variable more accurately.

Outline

R packages
Reading and writing data files (readr package)
Subsetting data by rows, columns, or both (dplyr package)
Deriving new variables unconditionally and conditionally (dplyr package)
Introduction to summarizing data

Summarizing Data with `tidyverse` Functions

summarise applies a summary function to your data
group_by collects data into groups for separate analyses
- Undone by using ungroup

Basic Summary Statistics

#Basic syntax of summarise
summarise(NewVarName = summary-function(<variable>))

lab03 %>% summarise(meanALB = mean(ALB), medianALB = median(ALB), count = n())

## # A tibble: 1 × 3
##   meanALB medianALB count
##     <dbl>     <dbl> <int>
## 1    4.39       4.4    75

Grouping Basic Statistics

#Note: using lab00, lab03, and lab06
lab <- lab00 %>% bind_rows(lab03) %>% bind_rows(lab06) #From Activity #2
lab %>% group_by(Subject) %>% summarise(meanALB = mean(ALB), 
                                        medianALB = median(ALB), 
                                        count = n())

## # A tibble: 77 × 4
##    Subject meanALB medianALB count
##      <dbl>   <dbl>     <dbl> <int>
##  1       1    4.4       4.4      3
##  2       2    4.7       4.7      3
##  3       4    4.25      4.25     2
##  4       6    4.4       4.4      3
##  5       8    4.1       4.1      3
##  6       9    4.37      4.4      3
##  7      10    5         5        3
##  8      11    4.03      4        3
##  9      13    3.53      3.5      3
## 10      16    4.63      4.6      3
## # ℹ 67 more rows

Grouping Better Statistics

lab %>% group_by(Sex) %>% summarise(count = n(), 
                                    p05 = quantile(ALB, 0.05), p15 = quantile(ALB, 0.15), 
                                    p25 = quantile(ALB, 0.25), p50 = quantile(ALB, 0.50), 
                                    p75 = quantile(ALB, 0.75), p85 = quantile(ALB, 0.85),
                                    p95 = quantile(ALB, 0.95), IQR = p75-p25)

## # A tibble: 2 × 10
##   Sex   count   p05   p15   p25   p50   p75   p85   p95   IQR
##   <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F       102  3.40  3.83   4.1   4.6  4.8   4.9   5.09 0.7  
## 2 M       122  3.4   3.6    4.1   4.4  4.77  4.98  5.2  0.675

Expanding Summary Techniques

Gets tedious for lots of variables and/or statistics
Applying to a group of variables? → use apply family of functions
Three choices: apply, lapply, sapply

Using `apply`

#Example syntax
apply(X = columns-of-interest, MARGIN = type-of-summary, 
      FUN= function-of-interest, na.rm = your-choice)

apply(select(lab,ALB:CGluc), MARGIN = 2, FUN = summary, na.rm = TRUE)

##              ALB  AlkPhos      ALT      AST DBili     GGTP    CGluc
## Min.    2.900000  22.8000  2.00000  7.70000 -0.01 -0.30000  95.7000
## 1st Qu. 4.100000  62.5500 16.00000 18.17500  0.11 19.00000 101.4750
## Median  4.500000  84.1500 21.05000 22.85000  0.15 28.00000 105.0000
## Mean    4.391518  85.5433 21.30625 22.63661  0.16 27.79509 105.3312
## 3rd Qu. 4.800000 108.0000 27.00000 27.15000  0.22 37.12500 109.0000
## Max.    6.000000 149.3000 39.50000 37.40000  0.41 49.50000 118.6000

Effect of `MARGIN=`

apply(select(lab[1:5,],ALB:CGluc), MARGIN = 1, FUN = summary, na.rm = TRUE)

##              [,1]  [,2]      [,3]      [,4]  [,5]
## Min.      0.11000   0.1   0.21000   0.25000   0.3
## 1st Qu.  12.20000   7.9   6.15000  12.20000   5.6
## Median   22.00000  27.0  19.00000  24.00000  20.0
## Mean     39.50143  28.7  35.21571  41.66429  34.5
## 3rd Qu.  62.00000  28.5  56.50000  63.00000  47.0
## Max.    106.00000 101.0 102.00000 117.00000 116.0

On To Bivariate Statistics!

Pearson Correlation!

lab %>% select(ALB:AST) %>% cor()

##                  ALB    AlkPhos        ALT          AST
## ALB      1.000000000 0.12051637 0.07624307 -0.004772245
## AlkPhos  0.120516369 1.00000000 0.02127493  0.136677803
## ALT      0.076243075 0.02127493 1.00000000  0.025933663
## AST     -0.004772245 0.13667780 0.02593366  1.000000000

Other Correlation Methods

lab %>% select(ALB:AST) %>% cor(method = "spearman")

##                ALB      AlkPhos          ALT          AST
## ALB     1.00000000  0.120722030  0.048230339  0.065368590
## AlkPhos 0.12072203  1.000000000 -0.004754992  0.172736863
## ALT     0.04823034 -0.004754992  1.000000000 -0.004662076
## AST     0.06536859  0.172736863 -0.004662076  1.000000000

P-values!

What kind of course would this be if we didn’t see at least one p-value?!

install.packages("Hmisc")
library(Hmisc)

This gives access to rcorr function which computes correlations and p-values

Example of `rcorr`

lab %>% select(ALB:AST) %>% as.matrix() %>% rcorr()

##          ALB AlkPhos  ALT  AST
## ALB     1.00    0.12 0.08 0.00
## AlkPhos 0.12    1.00 0.02 0.14
## ALT     0.08    0.02 1.00 0.03
## AST     0.00    0.14 0.03 1.00
## 
## n= 224 
## 
## 
## P
##         ALB    AlkPhos ALT    AST   
## ALB            0.0718  0.2558 0.9434
## AlkPhos 0.0718         0.7515 0.0410
## ALT     0.2558 0.7515         0.6995
## AST     0.9434 0.0410  0.6995

Recap

More math and stats functions than you can shake a stick at
Approach: know what analysis you should do, find the function that does that analysis, read the help page, implement the function
- tidyverse is wonderful but it doesn’t have everything (yet!)
- apply family good for working with multiple variables
- Other packages good for more specific analysis
Want to do more? Need to expand your skills!
- do loops allow you to carry out repetitive actions
- Vectorized functions are better
- Not too challenging to write your own function after you’ve programmed for a while

Activity #4

Use tidyverse functions to get the overall mean, standard deviation (which is square root of variance), median, 25th, and 75th percentile for SBP from the vis data frame
Repeat the previous analyses, but summarized separately for each Subject
Obtain the Spearman rank correlations for SBP, DBP, Pulse, and Temp from the vis data frame
Advanced: It may be important to look at SBP broken down by DBP. Create a variable to indicate whether a record has a DBP that is Normal (below 80), Stage 1 (80-89), Stage 2 (90-120), or Crisis (above 120). Obtain the mean, standard deviation, median, 25th percentile, and 75th percentile after being grouped by DBP classification.

The End

Thank you!
Questions?

Subject	CGluc	UGluc	SBP	DBP
1	106	0	NA	NA
2	101	0	NA	NA
4	102	0	106	89
6	101	0	78	49
8	116	1	NA	NA
3	NA	NA	100	75
5	NA	NA	144	115
7	NA	NA	132	92

Subject	CGluc	UGluc	SBP	DBP
1	106	0	NA	NA
2	101	0	NA	NA
4	102	0	106	89
6	101	0	78	49
8	116	1	NA	NA
3	NA	NA	100	75
5	NA	NA	144	115
7	NA	NA	132	92

Daily Schedule

Outline

R Packages

Installing a Package

Aside: Comments on Packages

Tidyverse

Tidyverse

What’s in the tidyverse?

tibble

Tibble Wasn’t a Data Type!

Tibble Wasn’t a Data Type!

Accessing External Data

Reading Data with readr

Reading External Data – Adding Context!

Example 1 - Site 1, Baseline Visit.CSV

Example 1 - Site 1, Baseline Visit.CSV

Example 1 - Adding Column Names

Example 1 - Adding Column Names: Round 2

Example 1 - Adding Column Names: Round 2

Example 1 - Setting Column Class

Example 1 - More Column Classses: Dates

Example 2 - Site 1, 3 Month Visit.txt

Example 2 - Site 1, 3 Month Visit.txt

Example 3 - Site 1, 6 Month Visit.txt

Example 3 - Site 1, 6 Month Visit.txt

Aside – Fixed-Width Data

Fixed-Width Data: Identifying Field Positions

Example 4 - Site 1, 9 Month Visit.dat

Example 4 - Self-Validating the Results

Example 4 - Using fwf_widths

Widths vs Positions

Widths vs Positions

Example 4 - Using fwf_positions

How Did We Do?

Fixed-Field Final Comments

Reading Data with readxl and haven

Example 5 - Sheets of Site 1 Lab.xlsx

Example 5 - Reading a Sheet from Site 1 Lab.xlsx

read_excel Features

Example 6 - Site1Lab09.sas7bdat

Reading Data Wrap Up

Activity #1

Outline

Types of Manipulations

Review Logical Filtering

Review Logical Filtering

Example of filter

Another Example of filter

slice vs. filter

Slice Features

Arranging Rows

Descending Order?

Combining Functions

Combining Functions

Piping/Chaining

Piping/Chaining

What About Columns?

Helpful Functions for Selecting Columns

What Comes Next?

Concatenation

Example of Concatenation

Concatenating Mismatched Columns

Combining Horizontally: Merges

Building Toy Data Sets

Inner Join - Implicit Key Variables

Inner Join - Implicit Key Variables

Inner Join - Explicit Key Variables

Joining by Differently-Named Variables

Joining by Differently-Named Variables

Left Join

Right Join

(Full) Outer Join

Semi Join

Anti Join

Closing Comments on Joining Data

Reshaping: Starting Data Set

Reshaping: Starting Data Set

Reshape Wide to Tall

Reshape: Tall to Wide

Activity #2

What’s in the `tidyverse`?

`tibble`

Reading Data with `readr`

Example 4 - Using `fwf_widths`

Example 4 - Using `fwf_positions`

Reading Data with `readxl` and `haven`

`read_excel` Features

Example of `filter`

Another Example of `filter`

`slice` vs. `filter`

`if-then` Example

`if-then/else` Examples

Nesting `ifelse`

Summarizing Data with `tidyverse` Functions

Using `apply`

Effect of `MARGIN=`

Example of `rcorr`

Subject	CGluc	UGluc	SBP	DBP
1	106	0	NA	NA
2	101	0	NA	NA
4	102	0	106	89
6	101	0	78	49
8	116	1	NA	NA
3	NA	NA	100	75
5	NA	NA	144	115
7	NA	NA	132	92