August 8, 2023

Daily Schedule

  • 9:30 - 10:35 Morning Session #1
  • 10:35 - 10:50 Morning Break
  • 10:50 - 11:55 Morning Session #2
  • 11:55 - 1:10 Lunch
  • 1:10 - 2:15 Afternoon Session #1
  • 2:15 - 2:30 Afternoon Break
  • 2:30 - 3:35 Afternoon Session #2
  • 3:35 - 3:40 Feedback/Q&A

Outline

  • R packages
  • Reading and writing data files (readr package)
  • Subsetting data by rows, columns, or both (dplyr package)
  • Deriving new variables unconditionally and conditionally (dplyr package)
  • Introduction to summarizing data

R Packages

  • Packages = bundle of features/functions

  • Download from CRAN

  • Careful: no central group → no constant syntax + no constant quality

  • Two actions with every new package:
  • Install: Downloads package to your computer. Only needed once.
    • Like local library getting a new book
  • Load: Allow current R session to access a package. Needed each time you want to use it.
    • Like checking that book out from your local library

Installing a Package

  • Use install.packages("name-of-package-goes-here") to download a package

  • Use library(name-of-package-goes-here) to use a package in your current session

  • You’ll see progress and messages in your Console
  • Common info:
    • Locations (red): where on your computer is package going and from where is it being downloaded
    • Size (red): Size of the download and how much is downloaded to your computer
    • Message (black): package “unpacked” and “sums checked”
    • Conflicts: Functions that have the same name as other functions

Aside: Comments on Packages

  • Do not blindly update packages

  • R does not ensure backwards compatibility

  • Updates can cause your program to stop working until you update with new code

  • Big/important project? → Document which version of packages you used

    • This way you can roll back an update to the right package if you need to!

Tidyverse

  • Collection of packages all designed to work together
    • (fairly) consistent syntax and solid quality!
install.packages("tidyverse") #Order your book

library(tidyverse) #Check out your book
  • If you plan to follow along, run that code!

Tidyverse

  • Collection of packages all designed to work together
    • (fairly) consistent syntax and solid quality!
install.packages("tidyverse") #Order your book

library(tidyverse) #Check out your book
  • If you plan to follow along, run that code!

  • Should see the following about conflicts

What’s in the tidyverse?

  • tibble - better data frames

  • readr - easier to read in external data

  • tidyr and dplyr - easier to clean up (aka tidy!) your data

  • Other packages are useful, but not for this intro course

tibble

  • Tibble is a table that has been “tidyverse-d”

  • Essentially, nicer printing properties + forcing better data management habits

  • Data created by tidyverse is tibble by default

  • For OG data frames, use as_tibble

growth <- as_tibble(ToothGrowth)
growth
## # A tibble: 60 × 3
##      len supp   dose
##    <dbl> <fct> <dbl>
##  1   4.2 VC      0.5
##  2  11.5 VC      0.5
##  3   7.3 VC      0.5
##  4   5.8 VC      0.5
##  5   6.4 VC      0.5
##  6  10   VC      0.5
##  7  11.2 VC      0.5
##  8  11.2 VC      0.5
##  9   5.2 VC      0.5
## 10   7   VC      0.5
## # ℹ 50 more rows

Tibble Wasn’t a Data Type!

  • Concerned we only learned about atomic vector, matrix, data frame, and list?

  • Those are the fundamental data types!

is.data.frame(growth)
## [1] TRUE
is_tibble(growth) #tidyverse functions tend to use _ and not .
## [1] TRUE

Tibble Wasn’t a Data Type!

  • Concerned we only learned about atomic vector, matrix, data frame, and list?

  • Those are the fundamental data types!

is.data.frame(growth)
## [1] TRUE
is_tibble(growth) #tidyverse functions tend to use _ and not .
## [1] TRUE
  • Think of a tibble like it’s an Instagram filter.

    • Still just a data frame underneath, but fancier!

Accessing External Data

  • Data can come to you in (almost) infinitely many different ways!
  • External, fully-structured data: SAS, SPSS, Stata, SQL database, MS Access
  • External, semi-structured data: MS Excel, HTML/XML, even an API
  • External, unstructured data: delimited, fixed-field, combo platter
  • Less structure ⇒ more programming work
  • Structured data doesn’t magically happen; programmers make it happen!

Reading Data with readr

  • Some base functions for reading raw data, but we’ll focus on tidyverse functions

  • read_csv: files delimited by a comma

  • read_tsv: files delimited by a tab

  • read_delim: files delimited by any character (including comma & tab)

  • read_fwf: files with fixed-width fields

  • read_table: files “structured” like a table using spaces (not tabs!)

  • Important Note: tabs and spaces are NOT the same thing!

Reading External Data – Adding Context!

  • Pulling from a case study on a clinical trial.
  • Multiple files for Site 1: Lab and Visit data at Baseline and 3, 6, and 9 months.
  • Each file helps us learn new functions/features of tidyverse
  • To read raw data, you need a good text editor
    • I strongly recommend Visual Studio Code
    • Default editor on your computer is usually insufficient
  • Data available here if you have not already downloaded it
    • Make sure you know where you download it!
    • Strongly suggest you put it and your code in your working directory

Example 1 - Site 1, Baseline Visit.CSV

  • Always look at data first!

  • What do we notice?

Example 1 - Site 1, Baseline Visit.CSV

  • Always look at data first!

  • What do we notice?

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv") #Fingers crossed!?
## New names:
## Rows: 127 Columns: 17
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): ...2, M, 22JAN2018, ...6, mm-hg, beats/min, RECLINED, lb dbl (8): 1...1,
## 1...3, 110, 80, 75, 98.4, 192, 2 lgl (1): F
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `1` -> `1...1`
## • `` -> `...2`
## • `1` -> `1...3`
## • `` -> `...6`

Example 1 - Adding Column Names

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", 
                  col_names=c("Subject", "ScreenFailReas", "Screen", "Sex", 
                              "VisitDate", "NotifDate", "SBP", "DBP", "BPUnit",
                              "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit",
                              "Wt", "WtUnit", "Pain")) 
## Rows: 128 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): ScreenFailReas, Sex, VisitDate, NotifDate, BPUnit, PulseUnit, Pos, ...
## dbl (8): Subject, Screen, SBP, DBP, Pulse, Temp, Wt, Pain
## lgl (1): TempUnit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Example 1 - Adding Column Names: Round 2

visNames <- c("Subject", "ScreenFailReas", "Screen", "Sex", "VisitDate", "NotifDate", 
              "SBP", "DBP", "BPUnit", "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit", 
              "Wt", "WtUnit", "Pain")

Example 1 - Adding Column Names: Round 2

visNames <- c("Subject", "ScreenFailReas", "Screen", "Sex", "VisitDate", "NotifDate",
              "SBP", "DBP", "BPUnit", "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit",
              "Wt", "WtUnit", "Pain")
vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames) 
## Rows: 128 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): ScreenFailReas, Sex, VisitDate, NotifDate, BPUnit, PulseUnit, Pos, ...
## dbl (8): Subject, Screen, SBP, DBP, Pulse, Temp, Wt, Pain
## lgl (1): TempUnit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Example 1 - Setting Column Class

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames, 
                  col_types=cols(TempUnit=col_character()))
  • col_types= option sets column types
  • cols() function allows you to specify types multiple ways
  • Var-Name = var-specification allows you to choose for specific column
  • Other options like col_numeric, col_date, col_factor and many others exist
  • View the vis00 tibble to confirm TempUnit is no longer FALSE!

Example 1 - More Column Classses: Dates

vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames, 
                  col_types=cols(TempUnit=col_character(), 
                          VisitDate=col_date(format="%d%b%Y"), 
                          NotifDate=col_date(format="%d%b%Y")))

Example 2 - Site 1, 3 Month Visit.txt

  • View raw data to confirm it is tab-delimited & has variable names in first row!

Example 2 - Site 1, 3 Month Visit.txt

  • View raw data to confirm it is tab-delimited & has variable names in first row!

  • Now that you know what raw data looks like, read it in!

vis03 <- read_tsv("data files/Site 1, 3 Month Visit.txt", 
           col_types=cols(TempUnit=col_character(), 
                          VisitDate=col_date(format="%d%b%Y"), 
                          NotifDate=col_date(format="%d%b%Y")))

Example 3 - Site 1, 6 Month Visit.txt

  • View raw data to determine how to read it in

Example 3 - Site 1, 6 Month Visit.txt

  • View raw data to determine how to read it in
vis06 <- read_delim("data files/Site 1, 6 Month Visit.txt", delim="/",
           col_names = visNames,
           col_types=cols(TempUnit=col_character(), 
                          VisitDate=col_date(format="%d%b%Y"), 
                          NotifDate=col_date(format="%d%b%Y")))
  • How did R distinguish between “/” as delimiter and “/” in beats/min?

Aside – Fixed-Width Data

  • Delimited data is easier to parse – delimiters do the heavy lifting!
  • In fixed-field data, values are always in the same position
    • I.e., always begin in the same column
  • Not necessarily any break between fields!
  • Take a look at Site 1, 9 Month Visit.dat as an example
    • Don’t confuse spaces for delimiters!

Fixed-Width Data: Identifying Field Positions

  • read_fwf offers multiple options to identify positions

  • fwf_empty guesses based on empty columns

  • fwf_widths uses column widths that you provide

  • fwf_positions uses column positions that you provided

Example 4 - Site 1, 9 Month Visit.dat

myFile<-"data files/Site 1, 9 Month Visit.dat"
vis09 <- read_fwf(myFile, col_positions = fwf_empty(myFile))
  • Ignoring variable name and type issues, did we get the data we wanted?

Example 4 - Self-Validating the Results

myFile<-"data files/Site 1, 9 Month Visit.dat"
vis09 <- read_fwf(myFile, col_positions = fwf_empty(myFile))
vis09
## # A tibble: 74 × 18
##       X1 X2    X3       X4 X5    X6    X7       X8    X9 X10     X11 X12   X13  
##    <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr>
##  1     1 <NA>  <NA>      1 M     25JU… <NA>    114    82 mm-hg    77 beat… RECL…
##  2     2 <NA>  <NA>      1 F     28JU… <NA>    101    84 mm-hg    79 beat… RECU…
##  3     4 DROP… HBP       0 M     17JU… 17JU…   110    91 mm-hg    89 beat… RECL…
##  4     6 <NA>  <NA>      1 M     25JU… <NA>     82    50 mm-hg    43 beat… RECU…
##  5     8 <NA>  <NA>      1 F     25JU… <NA>     82    64 mm-hg    57 beat… RECU…
##  6     9 <NA>  <NA>      1 M     02AU… <NA>     99    79 mm-hg    83 beat… RECU…
##  7    10 <NA>  <NA>      1 M     22JU… <NA>     99    84 mm-hg    75 beat… RECU…
##  8    11 <NA>  <NA>      1 F     25JU… <NA>     98    88 mm-hg    86 beat… RECL…
##  9    13 <NA>  <NA>      1 F     25JU… <NA>     96    61 mm-hg    57 beat… RECU…
## 10    16 <NA>  <NA>      1 M     24JU… <NA>    108    86 mm-hg    81 beat… RECU…
## # ℹ 64 more rows
## # ℹ 5 more variables: X14 <dbl>, X15 <lgl>, X16 <dbl>, X17 <chr>, X18 <dbl>

Example 4 - Using fwf_widths

myFile<-"data files/Site 1, 9 Month Visit.dat"
vis09 <- read_fwf(myFile, 
                  col_positions = fwf_widths(c(4,26,2,2,10,10,4,4,9,4,10,10,6,9,4,9,8),
                                             visNames),
                  col_types=cols(TempUnit=col_character(),
                                 VisitDate=col_date(format="%d%b%Y"), 
                                 NotifDate=col_date(format="%d%b%Y")))
vis09
## # A tibble: 74 × 17
##    Subject ScreenFailReas Screen Sex   VisitDate  NotifDate    SBP   DBP BPUnit
##      <dbl> <chr>           <dbl> <chr> <date>     <date>     <dbl> <dbl> <chr> 
##  1       1 <NA>                1 M     2018-07-25 NA           114    82 mm-hg 
##  2       2 <NA>                1 F     2018-07-28 NA           101    84 mm-hg 
##  3       4 DROPPED: HBP        0 M     2018-07-17 2018-07-17   110    91 mm-hg 
##  4       6 <NA>                1 M     2018-07-25 NA            82    50 mm-hg 
##  5       8 <NA>                1 F     2018-07-25 NA            82    64 mm-hg 
##  6       9 <NA>                1 M     2018-08-02 NA            99    79 mm-hg 
##  7      10 <NA>                1 M     2018-07-22 NA            99    84 mm-hg 
##  8      11 <NA>                1 F     2018-07-25 NA            98    88 mm-hg 
##  9      13 <NA>                1 F     2018-07-25 NA            96    61 mm-hg 
## 10      16 <NA>                1 M     2018-07-24 NA           108    86 mm-hg 
## # ℹ 64 more rows
## # ℹ 8 more variables: Pulse <dbl>, PulseUnit <chr>, Pos <chr>, Temp <dbl>,
## #   TempUnit <chr>, Wt <dbl>, WtUnit <chr>, Pain <dbl>

Widths vs Positions

  • fwf_widths has to account for spaces between fields and read them as data

  • fwf_positions can specify (start, end) positions of each variable

Widths vs Positions

  • fwf_widths has to account for spaces between fields and read them as data

  • fwf_positions can specify (start, end) positions of each variable

  • For our data, you can use your text editor to find that those positions are:

Subject ScreenFailReas Screen Sex VisitDate NotifDate SBP DBP BPUnit
1-3 5-29 31 33 35-43 45-53 55-57 59-61 63-67
Pulse PulseUnit Pos Temp TempUnit Wt WtUnit Pain
72-74 76-84 86-94 96-100 102 111-113 115- 116 131

Example 4 - Using fwf_positions

  • Getting those positions is annoying, but is only half the battle
myFile <- "data files/Site 1, 9 Month Visit.dat"
startPos <- c(1, 5, 31, 33, 35, 45, 55, 59, 63, 72, 76, 86, 96, 102, 111, 115, 131)
endPos <- c(3, 29, 31, 33, 43, 53, 57, 61, 67, 74, 84, 94, 100, 102, 113, 116, 131)
vis09 <- read_fwf(myFile, 
                  col_positions = fwf_positions(start = startPos,
                                                end = endPos,
                                                col_names = visNames),
                  col_types = cols(TempUnit=col_character(),
                                   VisitDate=col_date(format="%d%b%Y"),
                                   NotifDate=col_date(format="%d%b%Y")))

How Did We Do?

vis09
## # A tibble: 74 × 17
##    Subject ScreenFailReas Screen Sex   VisitDate  NotifDate    SBP   DBP BPUnit
##      <dbl> <chr>           <dbl> <chr> <date>     <date>     <dbl> <dbl> <chr> 
##  1       1 <NA>                1 M     2018-07-25 NA           114    82 mm-hg 
##  2       2 <NA>                1 F     2018-07-28 NA           101    84 mm-hg 
##  3       4 DROPPED: HBP        0 M     2018-07-17 2018-07-17   110    91 mm-hg 
##  4       6 <NA>                1 M     2018-07-25 NA            82    50 mm-hg 
##  5       8 <NA>                1 F     2018-07-25 NA            82    64 mm-hg 
##  6       9 <NA>                1 M     2018-08-02 NA            99    79 mm-hg 
##  7      10 <NA>                1 M     2018-07-22 NA            99    84 mm-hg 
##  8      11 <NA>                1 F     2018-07-25 NA            98    88 mm-hg 
##  9      13 <NA>                1 F     2018-07-25 NA            96    61 mm-hg 
## 10      16 <NA>                1 M     2018-07-24 NA           108    86 mm-hg 
## # ℹ 64 more rows
## # ℹ 8 more variables: Pulse <dbl>, PulseUnit <chr>, Pos <chr>, Temp <dbl>,
## #   TempUnit <chr>, Wt <dbl>, WtUnit <chr>, Pain <dbl>

Fixed-Field Final Comments

  • I hate fixed-field data so much. So, so, so much.
  • Good data practice says fixed data should come with a code book!
  • Great text editor is a must – again, Visual Studio Code is one of the best
  • Very common because no delimiters needed → compact storage
  • Easier to skip variables if you only want to read in part of the data

Reading Data with readxl and haven

  • So far, only looked at unstructured data

  • Data handling/manipulation is a huuuuuuge part of data analysis

    • Doesn’t mean it is everyone’s favorite task
    • Positive attitude goes a long way :D
  • What about data that is already semi-/fully-structured?
    • readxl package for reading MS Excel files
    • haven package for reading from other stat software

Example 5 - Sheets of Site 1 Lab.xlsx

library(readxl) #Part of tidyverse, but not loaded with it
excel_sheets("data files/Site 1 Lab.xlsx") #list all possible sheets
## [1] "Site1Lab00" "Site1Lab03" "Site1Lab06"

Example 5 - Reading a Sheet from Site 1 Lab.xlsx

library(readxl) #Part of tidyverse, but not loaded with it
excel_sheets("data files/Site 1 Lab.xlsx") #list all possible sheets
## [1] "Site1Lab00" "Site1Lab03" "Site1Lab06"
lab00<-read_excel("data files/Site 1 Lab.xlsx", sheet = "Site1Lab00")
#how did we do?
head(lab00)
## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex     ALB
##     <dbl> <dttm>              <lgl>     <lgl>           <dbl> <chr> <dbl>
## 1       1 2018-01-22 00:00:00 NA        NA                  1 M       4.4
## 2       2 2018-01-23 00:00:00 NA        NA                  1 F       4.8
## 3       4 2018-01-20 00:00:00 NA        NA                  1 M       4.3
## 4       6 2018-01-24 00:00:00 NA        NA                  1 M       4.4
## 5       8 2018-01-21 00:00:00 NA        NA                  1 F       4.2
## 6       9 2018-01-27 00:00:00 NA        NA                  1 M       4.5
## # ℹ 12 more variables: AlkPhos <dbl>, ALT <dbl>, AST <dbl>, DBili <dbl>,
## #   GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

## Example 5 - Setting Column Types

myFile <- "data files/Site 1 Lab.xlsx"
colTypes = c("numeric", "date", "date", "text", "numeric", "text", rep("numeric",13))
lab00<-read_excel(myFile, sheet = "Site1Lab00", col_types = colTypes)
lab03<-read_excel(myFile, sheet = "Site1Lab03", col_types = colTypes)
lab06<-read_excel(myFile, sheet = "Site1Lab06", col_types = colTypes)
head(lab00)
## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
## 1       1 2018-01-22 00:00:00 NA        <NA>                1 M    
## 2       2 2018-01-23 00:00:00 NA        <NA>                1 F    
## 3       4 2018-01-20 00:00:00 NA        <NA>                1 M    
## 4       6 2018-01-24 00:00:00 NA        <NA>                1 M    
## 5       8 2018-01-21 00:00:00 NA        <NA>                1 F    
## 6       9 2018-01-27 00:00:00 NA        <NA>                1 M    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

read_excel Features

  • Familiar features like col_names and col_types

  • Excel-specific features like range

  • Professional/Personal advice: you should never store data in Excel
    • Excel is a spreadsheet tool, not a database
    • If you want a Microsoft tool, then use MS Access

Example 6 - Site1Lab09.sas7bdat

library(haven)
lab09<-read_sas("data files/Site1Lab09.sas7bdat")
  • Just like reading from any third-party software, some special features

  • Notably, SAS variables usually have labels

    • Adds a new attribute when you investigate the tibble

Reading Data Wrap Up

  • Scratching the surface of data reading

  • Infinite* ways to “creatively” store data ⇒ many combinations of our tools

  • Worst case scenario, read entire file into a single string and parse it by hand
    • Definitely not a task for new programmers!
  • Writing data is as easy as using write_* functions instead of read_*

Activity #1

Three versions of a data set regarding patients’ blood measurements have been provided. Each file contains the same data, just presented differently.

  1. Read in bloodv1.dat

  2. Read in bloodv2.prn (Hint: use skip= to skip any rows you do not want to read.)

  3. Read in bloodv3.xlsx

Outline

  • R packages
  • Reading and writing data files (readr package)
  • Subsetting data by rows, columns, or both (dplyr package)
  • Deriving new variables unconditionally and conditionally (dplyr package)
  • Introduction to summarizing data

Types of Manipulations

  • Subset by:
    • filtering out rows
    • selecting columns
  • Combine data sets:
    • vertically
    • horizontally
  • Reshape data sets:
    • tall (stacked) to wide (unstacked)
    • wide (unstacked) to tall (stacked)

Review Logical Filtering

  • Already seen how to filter an object based on a Boolean (logical) vector
lab06$ScreenFailReas == "PREG."
##  [1]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [16]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [31]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA TRUE   NA   NA
## [46]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [61]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA TRUE   NA

Review Logical Filtering

  • Already seen how to filter an object based on a Boolean (logical) vector
sf<-lab06$Screen == 0
lab06[sf,]
## # A tibble: 2 × 19
##   Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
## 1      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
## 2     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Example of filter

filter(lab06, Screen == 0) #filter provided by dplyr package
## # A tibble: 2 × 19
##   Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
## 1      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
## 2     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Another Example of filter

  • Get to use those comparsion and Boolean operators!
filter(lab06, Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0))
## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
## 1      13 2018-07-25 00:00:00 NA                  <NA>                1 F    
## 2      18 2018-08-02 00:00:00 NA                  <NA>                1 M    
## 3      46 2018-08-25 00:00:00 NA                  <NA>                1 M    
## 4      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
## 5      88 2018-09-11 00:00:00 NA                  <NA>                1 M    
## 6     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

slice vs. filter

  • filter is logical while slice is positional
slice(lab06, 12:16)
## # A tibble: 5 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
## 1      19 2018-08-02 00:00:00 NA        <NA>                1 M    
## 2      20 2018-07-29 00:00:00 NA        <NA>                1 M    
## 3      24 2018-08-02 00:00:00 NA        <NA>                1 M    
## 4      25 2018-08-03 00:00:00 NA        <NA>                1 M    
## 5      26 2018-07-27 00:00:00 NA        <NA>                1 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Slice Features

  • slice does have cool helper functions
## slice_head(lab06, n=5) #grab the top 5 observations
## slice_tail(lab06, n=10) #grab the bottom 10 observations
## slice_sample(lab06, n=3) #simple random sample of 3 rows
## slice_sample(lab06, prop=0.25) #sample 25% of the data set
slice_min(lab06, ALB) #also, slice_max
## # A tibble: 1 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
## 1      90 2018-09-12 00:00:00 NA        <NA>                1 F    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Arranging Rows

  • Arranging rows is as simple as using arrange
arrange(lab06,Screen,Sex,VisitDate)
## # A tibble: 72 × 19
##    Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##      <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
##  1      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
##  2     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
##  3      17 2018-07-24 00:00:00 NA                  <NA>                1 F    
##  4       8 2018-07-25 00:00:00 NA                  <NA>                1 F    
##  5      11 2018-07-25 00:00:00 NA                  <NA>                1 F    
##  6      13 2018-07-25 00:00:00 NA                  <NA>                1 F    
##  7      26 2018-07-27 00:00:00 NA                  <NA>                1 F    
##  8       2 2018-07-28 00:00:00 NA                  <NA>                1 F    
##  9      27 2018-08-03 00:00:00 NA                  <NA>                1 F    
## 10      30 2018-08-03 00:00:00 NA                  <NA>                1 F    
## # ℹ 62 more rows
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Descending Order?

arrange(lab06,Screen,Sex,desc(VisitDate)) #Flip order of a variable with desc()
## # A tibble: 72 × 19
##    Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##      <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
##  1     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
##  2      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
##  3     126 2018-10-15 00:00:00 NA                  <NA>                1 F    
##  4     131 2018-10-15 00:00:00 NA                  <NA>                1 F    
##  5     122 2018-10-12 00:00:00 NA                  <NA>                1 F    
##  6     115 2018-10-07 00:00:00 NA                  <NA>                1 F    
##  7     114 2018-10-03 00:00:00 NA                  <NA>                1 F    
##  8     111 2018-09-30 00:00:00 NA                  <NA>                1 F    
##  9     103 2018-09-25 00:00:00 NA                  <NA>                1 F    
## 10     101 2018-09-24 00:00:00 NA                  <NA>                1 F    
## # ℹ 62 more rows
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Combining Functions

  • Obviously, you might want to filter and arrange your data!

  • Typical approach is to nest functions

arrange(filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)),Screen, Sex, desc(VisitDate))
## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate           ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>              <chr>           <dbl> <chr>
## 1     132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG.               0 F    
## 2      78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG.               0 F    
## 3      13 2018-07-25 00:00:00 NA                  <NA>                1 F    
## 4      88 2018-09-11 00:00:00 NA                  <NA>                1 M    
## 5      46 2018-08-25 00:00:00 NA                  <NA>                1 M    
## 6      18 2018-08-02 00:00:00 NA                  <NA>                1 M    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Combining Functions

  • Difficult for some to read, often space it this way

  • But… more functions ⇒???

arrange(
        filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)),
        Screen, Sex, desc(VisitDate)
       )

Piping/Chaining

  • tidyverse offers ability to pipe or chain functions
  • Output from one function is sent as input into next function
    • Think links in a chain

Piping/Chaining

  • tidyverse offers ability to pipe or chain functions

  • Output from one function is sent as input into next function

    • Think links in a chain

#filter is your first link - we know it outputs a subset
filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)) %>% #chain operator!
  arrange(Screen, Sex, desc(VisitDate))
  #arrange is your next link
  #note -- no first argument anymore, we start with the **2nd** argument

What About Columns?

select(lab06, Subject, VisitDate, CGluc, UGluc, Preg) #no chaining
lab06 %>% select(Subject, VisitDate, CGluc, UGluc, Preg) #same thing, with chaining
## # A tibble: 72 × 5
##    Subject VisitDate           CGluc UGluc  Preg
##      <dbl> <dttm>              <dbl> <dbl> <dbl>
##  1       1 2018-07-25 00:00:00 110       1     0
##  2       2 2018-07-28 00:00:00 102       0     0
##  3       6 2018-07-25 00:00:00  97.2     0     0
##  4       8 2018-07-25 00:00:00 113.      1     0
##  5       9 2018-08-02 00:00:00 106.      0     0
##  6      10 2018-07-22 00:00:00 103.      0     0
##  7      11 2018-07-25 00:00:00 103       0     0
##  8      13 2018-07-25 00:00:00 103.      0     0
##  9      16 2018-07-24 00:00:00 113.      1     0
## 10      17 2018-07-24 00:00:00 108.      0     0
## # ℹ 62 more rows

Helpful Functions for Selecting Columns

  • After lab06 %>% use:
select(starts_with("Screen")) #All variables starting with "Screen"
select(ends_with("date")) #All variables ending with "date"
select(starts_with("Screen") | ends_with("date")) #Either of the above
select(!(starts_with("Screen"))) #All variables that DO NOT start with "Screen"
select(contains('gluc')) #All variables contianing string 'gluc' in name
select(ALT:Prot) #All variables from ALT to Prot in data frame
select(where(is.numeric)) # Applies is.numeric to all variables and returns when TRUE

What Comes Next?

  • Reading data in is only half the battle
  • Still need to manipulate data sets in other ways:
    • Concatenate/stack records from different data sets
    • Merge/join records across data sets
    • Clean/tidy data
    • Derive new variables unconditionally
    • Derive new variables conditionally

Concatenation

  • Concatenation: vertically combining data sets by stacking records “on top” of each other
  • dplyr provides bind_rows to get it done
  • Important to consider certain metadata:
    • Names of columns
    • Data types of columns

Example of Concatenation

#I want to use small subsets so we can see what is happening!
set1 <- lab00 %>% slice(1:3) #first 3 rows of lab00
set2 <- lab03 %>% slice(1:3) #first 3 rows of lab03
bind_rows(set1, set2) #3 from lab00 + 3 from lab03 = 6, vertically combined, records
## # A tibble: 6 × 19
##   Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##     <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
## 1       1 2018-01-22 00:00:00 NA        <NA>                1 M    
## 2       2 2018-01-23 00:00:00 NA        <NA>                1 F    
## 3       4 2018-01-20 00:00:00 NA        <NA>                1 M    
## 4       1 2018-04-24 00:00:00 NA        <NA>                1 M    
## 5       2 2018-04-26 00:00:00 NA        <NA>                1 F    
## 6       4 2018-04-19 00:00:00 NA        <NA>                1 M    
## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>

Concatenating Mismatched Columns

set1 <- lab00 %>% select(1:2) %>% slice(1:3) #first 2 columns and 3 rows of lab00
set2 <- lab03 %>% select(c(5,7)) %>% slice(1:3) #first 3 rows of columns 5 & 7 from lab03
bind_rows(set1, set2) # six mismatched records
## # A tibble: 6 × 4
##   Subject VisitDate           Screen   ALB
##     <dbl> <dttm>               <dbl> <dbl>
## 1       1 2018-01-22 00:00:00     NA  NA  
## 2       2 2018-01-23 00:00:00     NA  NA  
## 3       4 2018-01-20 00:00:00     NA  NA  
## 4      NA NA                       1   4.4
## 5      NA NA                       1   4.7
## 6      NA NA                       1   4.2

Combining Horizontally: Merges

  • Merges also called match-merges or joins
  • Concatenations are simple operations: layer records from data frame 1 on top of data frame 2
  • Merges are more complicated because we need rules for combining records
  • Only appropriate when records can be linked
    • I.e., record in Table A and record in Table B combine to form single record
  • Lots of flavors of join: inner, left, right, full, semi, anti

Building Toy Data Sets

  • Easier to visualize with small data sets
#rows 1 - 5 of Subject, CGluc, UGluc in Lab00
tableA <- lab00 %>% slice(1:5) %>% select(Subject|contains("gluc"))
#rows 3 - 7 of Subject, SBP, DBP in Vis00
tableB <- vis00 %>% slice(3:7) %>% select(Subject|ends_with("bp")) 
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92

Inner Join - Implicit Key Variables

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table B
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
inner_join(tableA, tableB)

Inner Join - Implicit Key Variables

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table B
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
inner_join(tableA, tableB)
Join
Subject CGluc UGluc SBP DBP
4 102 0 106 89
6 101 0 78 49
  • How did R decide which A and B records to combine?

Inner Join - Explicit Key Variables

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table B
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
inner_join(tableA, tableB, 
           by = "Subject")
Join
Subject CGluc UGluc SBP DBP
4 102 0 106 89
6 101 0 78 49
  • What records are included in an inner join?

Joining by Differently-Named Variables

tableC <- tableB %>% rename(Subj = Subject)
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Subj SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92

Joining by Differently-Named Variables

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table C
Subj SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
inner_join(tableA, tableC, 
           by = c("Subject"="Subj"))
Join
Subject CGluc UGluc SBP DBP
4 102 0 106 89
6 101 0 78 49

Left Join

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table B
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
left_join(tableA, tableB, 
           by = "Subject")
Join
Subject CGluc UGluc SBP DBP
1 106 0 NA NA
2 101 0 NA NA
4 102 0 106 89
6 101 0 78 49
8 116 1 NA NA
  • What rule creates the row space now? Column space?

Right Join

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table B
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
right_join(tableA, tableB, 
           by = "Subject")
Join
Subject CGluc UGluc SBP DBP
4 102 0 106 89
6 101 0 78 49
3 NA NA 100 75
5 NA NA 144 115
7 NA NA 132 92
  • Row space? Column space?

(Full) Outer Join

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table B
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
  • Row space? Column space?
full_join(tableA, tableB, 
           by = "Subject")
Join
Subject CGluc UGluc SBP DBP
1 106 0 NA NA
2 101 0 NA NA
4 102 0 106 89
6 101 0 78 49
8 116 1 NA NA
3 NA NA 100 75
5 NA NA 144 115
7 NA NA 132 92

Semi Join

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table B
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
semi_join(tableA, tableB, 
           by = "Subject")
Join
Subject CGluc UGluc
4 102 0
6 101 0
  • Row space? Column space?

Anti Join

Table A
Subject CGluc UGluc
1 106 0
2 101 0
4 102 0
6 101 0
8 116 1
Table B
Subject SBP DBP
3 100 75
4 106 89
5 144 115
6 78 49
7 132 92
anti_join(tableA, tableB, 
           by = "Subject")
Join
Subject CGluc UGluc
1 106 0
2 101 0
8 116 1
  • Row space? Column space?

Closing Comments on Joining Data

  • Only seen how to do a one-to-one join

    • Critical to use enough BY= variables to get a unique record per table!
  • One-to-many joins are common and important

  • Many-to-many joins should rarely be done!

  • R never coercies types – common variables must have same type!

Reshaping: Starting Data Set

  • Transposing/reshaping/restructuring data is an important tool

  • Let’s build a data set to use and look at two ways to restructure data:

    • Wide to tall
    • Tall to wide

Reshaping: Starting Data Set

  • Transposing/reshaping/restructuring data is an important tool

  • Let’s build a data set to use and look at two ways to restructure data:

    • Wide to tall
    • Tall to wide
tableC <- lab00 %>% select(Subject, ALB:DBili)
slice_head(tableC, n=2) 
## # A tibble: 2 × 6
##   Subject   ALB AlkPhos   ALT   AST DBili
##     <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1       1   4.4      80    20    22  0.11
## 2       2   4.8      27    27    30  0.1

Reshape Wide to Tall

  • may see gather in older code

  • pivot_longer is updated version, handles more scenarios

taller <- tableC %>% pivot_longer(!Subject, #old columns to be reshaped
                                  names_to = "lbtest", #new column for old column names
                                  values_to = "result", #new column for old column values
                                 )
slice_head(taller, n=10) 
## # A tibble: 10 × 3
##    Subject lbtest  result
##      <dbl> <chr>    <dbl>
##  1       1 ALB       4.4 
##  2       1 AlkPhos  80   
##  3       1 ALT      20   
##  4       1 AST      22   
##  5       1 DBili     0.11
##  6       2 ALB       4.8 
##  7       2 AlkPhos  27   
##  8       2 ALT      27   
##  9       2 AST      30   
## 10       2 DBili     0.1

Reshape: Tall to Wide

  • spread is opposite of gather

  • pivot_wider is opposite of pivot_longer

wider <- taller %>% pivot_wider(names_from = lbtest, #what should new column names be?
                                values_from = result, #what should new column values be?
                               )
slice_head(wider, n=2) 
## # A tibble: 2 × 6
##   Subject   ALB AlkPhos   ALT   AST DBili
##     <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1       1   4.4      80    20    22  0.11
## 2       2   4.8      27    27    30  0.1

Activity #2

  1. Concatenate lab00 through lab06 into a single data frame named lab

  2. Concatenate vis00 through vis06 into a single data frame named vis

  3. Combine the Lab and Visit data frames under the following conditions. In each case, ensure you do not duplicate any variables!

    • Version 1: Only patients with Visit data and Lab data are important

    • Version 2: All patients are important!

    • Version 3: Want Visit and Lab data for any patients who have Lab data

    • Version 4: We want a subset of the Lab data set based on whether the patient is not in the Visit data set

  4. Advanced: Suppose you concatenate the results of semi_join(A,B) and anti_join(A,B). How does the resulting table compare to A and B?

Outline

  • R packages
  • Reading and writing data files (readr package)
  • Subsetting data by rows, columns, or both (dplyr package)
  • Deriving new variables unconditionally and conditionally (dplyr package)
  • Introduction to summarizing data

Creating a New Variable Unconditionally

Month <- 3 #Creating a new variable is easy :)
  • But how do we attach it to a data set, say, Lab03?

Creating a New Variable Unconditionally

Month <- 3 #Creating a new variable is easy :)
  • But how do we attach it to a data set, say, Lab03?
lab03 %>% mutate(month = 3)
## # A tibble: 75 × 20
##    Subject VisitDate           NotifDate ScreenFailReas Screen Sex  
##      <dbl> <dttm>              <dttm>    <chr>           <dbl> <chr>
##  1       1 2018-04-24 00:00:00 NA        <NA>                1 M    
##  2       2 2018-04-26 00:00:00 NA        <NA>                1 F    
##  3       4 2018-04-19 00:00:00 NA        <NA>                1 M    
##  4       6 2018-04-26 00:00:00 NA        <NA>                1 M    
##  5       8 2018-04-25 00:00:00 NA        <NA>                1 F    
##  6       9 2018-04-30 00:00:00 NA        <NA>                1 M    
##  7      10 2018-04-22 00:00:00 NA        <NA>                1 M    
##  8      11 2018-04-27 00:00:00 NA        <NA>                1 F    
##  9      13 2018-04-27 00:00:00 NA        <NA>                1 F    
## 10      16 2018-04-26 00:00:00 NA        <NA>                1 M    
## # ℹ 65 more rows
## # ℹ 14 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>,
## #   DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>,
## #   Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>, month <dbl>

How Do We Confirm?

#I *really* encourage you to avoid overwriting objects!
lab03v2 <- lab03 %>% mutate(month = 3) 
lab03v2 %>% select(Subject, month)
## # A tibble: 75 × 2
##    Subject month
##      <dbl> <dbl>
##  1       1     3
##  2       2     3
##  3       4     3
##  4       6     3
##  5       8     3
##  6       9     3
##  7      10     3
##  8      11     3
##  9      13     3
## 10      16     3
## # ℹ 65 more rows

New Variable from Other Variables?

lab03v3 <- lab03 %>% mutate(CGratio = Hematocr/Hemoglob, 
                            CGRatioDefn = "Hematocrit/Hemoglobin") 
lab03v3 %>% select(Subject, starts_with("CGrat"))
## # A tibble: 75 × 3
##    Subject CGratio CGRatioDefn          
##      <dbl>   <dbl> <chr>                
##  1       1    3.33 Hematocrit/Hemoglobin
##  2       2    2.92 Hematocrit/Hemoglobin
##  3       4    2.77 Hematocrit/Hemoglobin
##  4       6    3.18 Hematocrit/Hemoglobin
##  5       8    2.71 Hematocrit/Hemoglobin
##  6       9    2.98 Hematocrit/Hemoglobin
##  7      10    2.5  Hematocrit/Hemoglobin
##  8      11    3.44 Hematocrit/Hemoglobin
##  9      13    3.08 Hematocrit/Hemoglobin
## 10      16    3.15 Hematocrit/Hemoglobin
## # ℹ 65 more rows

Only Want to See New Variable?

lab03 %>% transmute(CGratio = Hematocr/Hemoglob, CGRatioDefn = "Hematocrit/Hemoglobin") 
## # A tibble: 75 × 2
##    CGratio CGRatioDefn          
##      <dbl> <chr>                
##  1    3.33 Hematocrit/Hemoglobin
##  2    2.92 Hematocrit/Hemoglobin
##  3    2.77 Hematocrit/Hemoglobin
##  4    3.18 Hematocrit/Hemoglobin
##  5    2.71 Hematocrit/Hemoglobin
##  6    2.98 Hematocrit/Hemoglobin
##  7    2.5  Hematocrit/Hemoglobin
##  8    3.44 Hematocrit/Hemoglobin
##  9    3.08 Hematocrit/Hemoglobin
## 10    3.15 Hematocrit/Hemoglobin
## # ℹ 65 more rows

Quick Recap

  • mutate creates new columns AND attaches them to the source data frame
    • Useful for extending current data frame
  • transmute turns old columns into new columns
    • Useful for standalone varibles to add to different data frame

Introducing Conditional Logic

  • Sometimes we need to create a variable based on certain conditions

  • Basic concept: if-then/else logic

if (if-condition) {
    action-statement
}
  • if-condtion is the logical condition applied to the record
  • action-statement is what you want R to do when if-condition is TRUE
  • Many languages use keywords if and then just like English syntax

if-then Example

name <- "Jonathan"
if (name == "Jonathan") {
    fuzzy <- "Yes"
}
fuzzy
## [1] "Yes"
  • Of course, commonly have multiple conditions

if-then/else Examples

name <- "Fozzy Bear"
if (name == "Jonathan") {
    fuzzy <- "Yes"
} else {
  fuzzy <- "Unknown"
}
fuzzy
## [1] "Unknown"
name <- "Fozzy Bear"
if (name == "Jonathan") {
    fuzzy <- "Yes"
} else if (name == "Fozzy Bear") {
  fuzzy <- "Fozzy, not fuzzy"
} else { fuzzy <- "Unknown"
}
fuzzy
## [1] "Fozzy, not fuzzy"

Cascading Conditions

name <- "Fozzy Bear"
if ((name == "Jonathan") | (str_detect(name,"Fozzy"))) {
    fuzzy <- "Yes"
} else if (name == "Fozzy Bear") {
  fuzzy <- "Fozzy, not fuzzy"
} else { fuzzy <- "Unknown"
}
fuzzy
## [1] "Yes"
  • str_detect searches name for the string “Fozzy”
  • It finds it, so FALSE | TRUE = TRUE
  • Meeting first condtion → don’t check other conditions

More Meaningful Example?

if (lab00$ALB < 4.0 | lab00$ALB > 5.4){
  ScreenFlag <- "ALB"
}
## Error in if (lab00$ALB < 4 | lab00$ALB > 5.4) {: the condition has length > 1
  • Sadly, basic version of if-then cannot handle more than one value
  • That means no vectors or data frames!
  • We need an approach that can handle vectors!

Vectorized Functions

  • Useful functions tend to be vectorized
    • Means they can operate on a vector
  • Remember: data frame is just a bound collection of vectors!
  • Functions like mean and median accepted vectors

Vectorized IF-THEN

#Vectorized if-then is done with ifelse functions
ifelse(condition-to-apply-to-vector, action-if-true, action-if-false)
ifelse(lab00$ALB < 4.0 | lab00$ALB > 5.4, "HighLow", "Normal")
##  [1] "Normal"  "Normal"  "Normal"  "Normal"  "Normal"  "Normal"  "Normal" 
##  [8] "Normal"  "HighLow" "Normal"  "HighLow" "HighLow" "HighLow" "Normal" 
## [15] "Normal"  "HighLow" "HighLow" "Normal"  "Normal"  "Normal"  "Normal" 
## [22] "Normal"  "Normal"  "Normal"  "Normal"  "HighLow" "Normal"  "Normal" 
## [29] "Normal"  "Normal"  "Normal"  "HighLow" "Normal"  "Normal"  "Normal" 
## [36] "Normal"  "Normal"  "Normal"  "Normal"  "Normal"  "Normal"  "HighLow"
## [43] "Normal"  "HighLow" "Normal"  "Normal"  "Normal"  "Normal"  "Normal" 
## [50] "Normal"  "Normal"  "Normal"  "HighLow" "HighLow" "Normal"  "Normal" 
## [57] "Normal"  "Normal"  "HighLow" "HighLow" "Normal"  "Normal"  "Normal" 
## [64] "Normal"  "Normal"  "Normal"  "HighLow" "HighLow" "Normal"  "Normal" 
## [71] "Normal"  "HighLow" "Normal"  "Normal"  "Normal"  "Normal"  "Normal"

Nesting ifelse

#Can still have multiple conditions
ifelse(lab00$ALB < 4.0, "Low", 
       ifelse(lab00$ALB > 5.4, "High", 
              "Normal"))
##  [1] "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal"
##  [9] "Low"    "Normal" "Low"    "Low"    "Low"    "Normal" "Normal" "Low"   
## [17] "Low"    "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal"
## [25] "Normal" "Low"    "Normal" "Normal" "Normal" "Normal" "Normal" "Low"   
## [33] "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal"
## [41] "Normal" "Low"    "Normal" "Low"    "Normal" "Normal" "Normal" "Normal"
## [49] "Normal" "Normal" "Normal" "Normal" "Low"    "High"   "Normal" "Normal"
## [57] "Normal" "Normal" "Low"    "Low"    "Normal" "Normal" "Normal" "Normal"
## [65] "Normal" "Normal" "Low"    "Low"    "Normal" "Normal" "Normal" "Low"   
## [73] "Normal" "Normal" "Normal" "Normal" "Normal"

Syntax Improvement

#chain and transmute or mutate as needed
lab00 %>% transmute(AlbFlag = ifelse(lab00$ALB < 4.0, "Low", 
                                ifelse(lab00$ALB > 5.4, "High", 
                                       "Normal")))
## # A tibble: 77 × 1
##    AlbFlag
##    <chr>  
##  1 Normal 
##  2 Normal 
##  3 Normal 
##  4 Normal 
##  5 Normal 
##  6 Normal 
##  7 Normal 
##  8 Normal 
##  9 Low    
## 10 Normal 
## # ℹ 67 more rows
  • Leaving “Normal” as my catchall is probably not a good idea…

GPP Improvement

#chain and transmute or mutate as needed
lab00 %>% transmute(AlbFlag = ifelse(lab00$ALB < 4.0, "Low", 
                                ifelse(lab00$ALB > 5.4, "High", 
                                  ifelse(lab00$ALB >= 4.0 & lab00$ALB <= 5.4, "Normal", 
                                         "INVALID"))))
## # A tibble: 77 × 1
##    AlbFlag
##    <chr>  
##  1 Normal 
##  2 Normal 
##  3 Normal 
##  4 Normal 
##  5 Normal 
##  6 Normal 
##  7 Normal 
##  8 Normal 
##  9 Low    
## 10 Normal 
## # ℹ 67 more rows
  • Leaving “Normal” as my catchall is probably not a good idea…

Recap

  • Create variables unconditionally as constant or from other variables

  • Attach with mutate or view on their own with transmute

  • Vectorized functions accept vectors as arguments

  • Vectorized logic with ifelse

  • Caution: Avoid “miscelaneous” category also catching one of your expected categories!

  • Combining logic with filter, arrange, slice, *_join, bind_rows allows for lots of data manipulation

Activity #3

  1. Create 3 values using the following code to create date constants
  2. cut00 <- as.Date("2018-04-19") #Approximate last date of Baseline Visits
    cut03 <- as.Date("2018-07-21") #Approximate last date of 3 Month Visits
    cut06 <- as.Date("2018-10-21") #Approximate last date of 6 Month Visits
  3. Update the vis data frame from Activity #2 to include the variable VisitNum. If a record has a visit date prior to the cutoff date for those visits, assign the appropriate value (1,2,3, etc.).
    • E.g., records with VisitDate = 2018-10-20 should get VisitNum = 6
    • Use NA for records with dates not covered by these three cutoff dates
  4. Advanced: View the vis00, vis03, vis06 data frames and focus on dates near the cutoff. Did we do a good job creating our VisitNum variable? What should be done instead? If you want, update your code to derive the VisitNum variable more accurately.

Outline

  • R packages
  • Reading and writing data files (readr package)
  • Subsetting data by rows, columns, or both (dplyr package)
  • Deriving new variables unconditionally and conditionally (dplyr package)
  • Introduction to summarizing data

Summarizing Data with tidyverse Functions

  • summarise applies a summary function to your data

  • group_by collects data into groups for separate analyses

    • Undone by using ungroup

Basic Summary Statistics

#Basic syntax of summarise
summarise(NewVarName = summary-function(<variable>))
lab03 %>% summarise(meanALB = mean(ALB), medianALB = median(ALB), count = n())
## # A tibble: 1 × 3
##   meanALB medianALB count
##     <dbl>     <dbl> <int>
## 1    4.39       4.4    75

Grouping Basic Statistics

#Note: using lab00, lab03, and lab06
lab <- lab00 %>% bind_rows(lab03) %>% bind_rows(lab06) #From Activity #2
lab %>% group_by(Subject) %>% summarise(meanALB = mean(ALB), 
                                        medianALB = median(ALB), 
                                        count = n())
## # A tibble: 77 × 4
##    Subject meanALB medianALB count
##      <dbl>   <dbl>     <dbl> <int>
##  1       1    4.4       4.4      3
##  2       2    4.7       4.7      3
##  3       4    4.25      4.25     2
##  4       6    4.4       4.4      3
##  5       8    4.1       4.1      3
##  6       9    4.37      4.4      3
##  7      10    5         5        3
##  8      11    4.03      4        3
##  9      13    3.53      3.5      3
## 10      16    4.63      4.6      3
## # ℹ 67 more rows

Grouping Better Statistics

lab %>% group_by(Sex) %>% summarise(count = n(), 
                                    p05 = quantile(ALB, 0.05), p15 = quantile(ALB, 0.15), 
                                    p25 = quantile(ALB, 0.25), p50 = quantile(ALB, 0.50), 
                                    p75 = quantile(ALB, 0.75), p85 = quantile(ALB, 0.85),
                                    p95 = quantile(ALB, 0.95), IQR = p75-p25)
## # A tibble: 2 × 10
##   Sex   count   p05   p15   p25   p50   p75   p85   p95   IQR
##   <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F       102  3.40  3.83   4.1   4.6  4.8   4.9   5.09 0.7  
## 2 M       122  3.4   3.6    4.1   4.4  4.77  4.98  5.2  0.675

Expanding Summary Techniques

  • Gets tedious for lots of variables and/or statistics

  • Applying to a group of variables? → use apply family of functions

  • Three choices: apply, lapply, sapply

Using apply

#Example syntax
apply(X = columns-of-interest, MARGIN = type-of-summary, 
      FUN= function-of-interest, na.rm = your-choice)
apply(select(lab,ALB:CGluc), MARGIN = 2, FUN = summary, na.rm = TRUE)
##              ALB  AlkPhos      ALT      AST DBili     GGTP    CGluc
## Min.    2.900000  22.8000  2.00000  7.70000 -0.01 -0.30000  95.7000
## 1st Qu. 4.100000  62.5500 16.00000 18.17500  0.11 19.00000 101.4750
## Median  4.500000  84.1500 21.05000 22.85000  0.15 28.00000 105.0000
## Mean    4.391518  85.5433 21.30625 22.63661  0.16 27.79509 105.3312
## 3rd Qu. 4.800000 108.0000 27.00000 27.15000  0.22 37.12500 109.0000
## Max.    6.000000 149.3000 39.50000 37.40000  0.41 49.50000 118.6000

Effect of MARGIN=

apply(select(lab[1:5,],ALB:CGluc), MARGIN = 1, FUN = summary, na.rm = TRUE)
##              [,1]  [,2]      [,3]      [,4]  [,5]
## Min.      0.11000   0.1   0.21000   0.25000   0.3
## 1st Qu.  12.20000   7.9   6.15000  12.20000   5.6
## Median   22.00000  27.0  19.00000  24.00000  20.0
## Mean     39.50143  28.7  35.21571  41.66429  34.5
## 3rd Qu.  62.00000  28.5  56.50000  63.00000  47.0
## Max.    106.00000 101.0 102.00000 117.00000 116.0

On To Bivariate Statistics!

Pearson Correlation!

lab %>% select(ALB:AST) %>% cor()
##                  ALB    AlkPhos        ALT          AST
## ALB      1.000000000 0.12051637 0.07624307 -0.004772245
## AlkPhos  0.120516369 1.00000000 0.02127493  0.136677803
## ALT      0.076243075 0.02127493 1.00000000  0.025933663
## AST     -0.004772245 0.13667780 0.02593366  1.000000000

Other Correlation Methods

lab %>% select(ALB:AST) %>% cor(method = "spearman")
##                ALB      AlkPhos          ALT          AST
## ALB     1.00000000  0.120722030  0.048230339  0.065368590
## AlkPhos 0.12072203  1.000000000 -0.004754992  0.172736863
## ALT     0.04823034 -0.004754992  1.000000000 -0.004662076
## AST     0.06536859  0.172736863 -0.004662076  1.000000000

P-values!

  • What kind of course would this be if we didn’t see at least one p-value?!
install.packages("Hmisc")
library(Hmisc)
  • This gives access to rcorr function which computes correlations and p-values

Example of rcorr

lab %>% select(ALB:AST) %>% as.matrix() %>% rcorr()
##          ALB AlkPhos  ALT  AST
## ALB     1.00    0.12 0.08 0.00
## AlkPhos 0.12    1.00 0.02 0.14
## ALT     0.08    0.02 1.00 0.03
## AST     0.00    0.14 0.03 1.00
## 
## n= 224 
## 
## 
## P
##         ALB    AlkPhos ALT    AST   
## ALB            0.0718  0.2558 0.9434
## AlkPhos 0.0718         0.7515 0.0410
## ALT     0.2558 0.7515         0.6995
## AST     0.9434 0.0410  0.6995

Recap

  • More math and stats functions than you can shake a stick at

  • Approach: know what analysis you should do, find the function that does that analysis, read the help page, implement the function

    • tidyverse is wonderful but it doesn’t have everything (yet!)
    • apply family good for working with multiple variables
    • Other packages good for more specific analysis
  • Want to do more? Need to expand your skills!

    • do loops allow you to carry out repetitive actions
    • Vectorized functions are better
    • Not too challenging to write your own function after you’ve programmed for a while

Activity #4

  1. Use tidyverse functions to get the overall mean, standard deviation (which is square root of variance), median, 25th, and 75th percentile for SBP from the vis data frame
  2. Repeat the previous analyses, but summarized separately for each Subject
  3. Obtain the Spearman rank correlations for SBP, DBP, Pulse, and Temp from the vis data frame
  4. Advanced: It may be important to look at SBP broken down by DBP. Create a variable to indicate whether a record has a DBP that is Normal (below 80), Stage 1 (80-89), Stage 2 (90-120), or Crisis (above 120). Obtain the mean, standard deviation, median, 25th percentile, and 75th percentile after being grouped by DBP classification.

The End

  • Thank you!

  • Questions?