- 9:30 - 10:35 Morning Session #1
- 10:35 - 10:50 Morning Break
- 10:50 - 11:55 Morning Session #2
- 11:55 - 1:10 Lunch
- 1:10 - 2:15 Afternoon Session #1
- 2:15 - 2:30 Afternoon Break
- 2:30 - 3:35 Afternoon Session #2
- 3:35 - 3:40 Feedback/Q&A
August 8, 2023
readr
package)dplyr
package)dplyr
package)Packages = bundle of features/functions
Download from CRAN
Careful: no central group → no constant syntax + no constant quality
Use install.packages("name-of-package-goes-here")
to download a package
Use library(name-of-package-goes-here)
to use a package in your current session
Do not blindly update packages
R does not ensure backwards compatibility
Updates can cause your program to stop working until you update with new code
Big/important project? → Document which version of packages you used
install.packages("tidyverse") #Order your book library(tidyverse) #Check out your book
install.packages("tidyverse") #Order your book library(tidyverse) #Check out your book
If you plan to follow along, run that code!
Should see the following about conflicts
tidyverse
?tibble
- better data frames
readr
- easier to read in external data
tidyr
and dplyr
- easier to clean up (aka tidy!) your data
Other packages are useful, but not for this intro course
tibble
Tibble is a table that has been “tidyverse-d”
Essentially, nicer printing properties + forcing better data management habits
Data created by tidyverse is tibble by default
For OG data frames, use as_tibble
growth <- as_tibble(ToothGrowth) growth
## # A tibble: 60 × 3 ## len supp dose ## <dbl> <fct> <dbl> ## 1 4.2 VC 0.5 ## 2 11.5 VC 0.5 ## 3 7.3 VC 0.5 ## 4 5.8 VC 0.5 ## 5 6.4 VC 0.5 ## 6 10 VC 0.5 ## 7 11.2 VC 0.5 ## 8 11.2 VC 0.5 ## 9 5.2 VC 0.5 ## 10 7 VC 0.5 ## # ℹ 50 more rows
Concerned we only learned about atomic vector, matrix, data frame, and list?
Those are the fundamental data types!
is.data.frame(growth)
## [1] TRUE
is_tibble(growth) #tidyverse functions tend to use _ and not .
## [1] TRUE
Concerned we only learned about atomic vector, matrix, data frame, and list?
Those are the fundamental data types!
is.data.frame(growth)
## [1] TRUE
is_tibble(growth) #tidyverse functions tend to use _ and not .
## [1] TRUE
Think of a tibble like it’s an Instagram filter.
readr
Some base functions for reading raw data, but we’ll focus on tidyverse
functions
read_csv
: files delimited by a comma
read_tsv
: files delimited by a tab
read_delim
: files delimited by any character (including comma & tab)
read_fwf
: files with fixed-width fields
read_table
: files “structured” like a table using spaces (not tabs!)
tidyverse
Always look at data first!
What do we notice?
Always look at data first!
What do we notice?
vis00 <- read_csv("data files/Site 1, Baseline Visit.csv") #Fingers crossed!?
## New names: ## Rows: 127 Columns: 17 ## ── Column specification ## ──────────────────────────────────────────────────────── Delimiter: "," chr ## (8): ...2, M, 22JAN2018, ...6, mm-hg, beats/min, RECLINED, lb dbl (8): 1...1, ## 1...3, 110, 80, 75, 98.4, 192, 2 lgl (1): F ## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ ## Specify the column types or set `show_col_types = FALSE` to quiet this message. ## • `1` -> `1...1` ## • `` -> `...2` ## • `1` -> `1...3` ## • `` -> `...6`
vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=c("Subject", "ScreenFailReas", "Screen", "Sex", "VisitDate", "NotifDate", "SBP", "DBP", "BPUnit", "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit", "Wt", "WtUnit", "Pain"))
## Rows: 128 Columns: 17 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (8): ScreenFailReas, Sex, VisitDate, NotifDate, BPUnit, PulseUnit, Pos, ... ## dbl (8): Subject, Screen, SBP, DBP, Pulse, Temp, Wt, Pain ## lgl (1): TempUnit ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
visNames <- c("Subject", "ScreenFailReas", "Screen", "Sex", "VisitDate", "NotifDate", "SBP", "DBP", "BPUnit", "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit", "Wt", "WtUnit", "Pain")
visNames <- c("Subject", "ScreenFailReas", "Screen", "Sex", "VisitDate", "NotifDate", "SBP", "DBP", "BPUnit", "Pulse", "PulseUnit", "Pos", "Temp", "TempUnit", "Wt", "WtUnit", "Pain")
vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames)
## Rows: 128 Columns: 17 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (8): ScreenFailReas, Sex, VisitDate, NotifDate, BPUnit, PulseUnit, Pos, ... ## dbl (8): Subject, Screen, SBP, DBP, Pulse, Temp, Wt, Pain ## lgl (1): TempUnit ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames, col_types=cols(TempUnit=col_character()))
col_types=
option sets column typescols()
function allows you to specify types multiple waysVar-Name = var-specification
allows you to choose for specific columncol_numeric
, col_date
, col_factor
and many others existvis00
tibble to confirm TempUnit is no longer FALSE!vis00 <- read_csv("data files/Site 1, Baseline Visit.csv", col_names=visNames, col_types=cols(TempUnit=col_character(), VisitDate=col_date(format="%d%b%Y"), NotifDate=col_date(format="%d%b%Y")))
View raw data to confirm it is tab-delimited & has variable names in first row!
Now that you know what raw data looks like, read it in!
vis03 <- read_tsv("data files/Site 1, 3 Month Visit.txt", col_types=cols(TempUnit=col_character(), VisitDate=col_date(format="%d%b%Y"), NotifDate=col_date(format="%d%b%Y")))
vis06 <- read_delim("data files/Site 1, 6 Month Visit.txt", delim="/", col_names = visNames, col_types=cols(TempUnit=col_character(), VisitDate=col_date(format="%d%b%Y"), NotifDate=col_date(format="%d%b%Y")))
beats/min
?read_fwf
offers multiple options to identify positions
fwf_empty
guesses based on empty columns
fwf_widths
uses column widths that you provide
fwf_positions
uses column positions that you provided
myFile<-"data files/Site 1, 9 Month Visit.dat" vis09 <- read_fwf(myFile, col_positions = fwf_empty(myFile))
myFile<-"data files/Site 1, 9 Month Visit.dat" vis09 <- read_fwf(myFile, col_positions = fwf_empty(myFile)) vis09
## # A tibble: 74 × 18 ## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 ## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> ## 1 1 <NA> <NA> 1 M 25JU… <NA> 114 82 mm-hg 77 beat… RECL… ## 2 2 <NA> <NA> 1 F 28JU… <NA> 101 84 mm-hg 79 beat… RECU… ## 3 4 DROP… HBP 0 M 17JU… 17JU… 110 91 mm-hg 89 beat… RECL… ## 4 6 <NA> <NA> 1 M 25JU… <NA> 82 50 mm-hg 43 beat… RECU… ## 5 8 <NA> <NA> 1 F 25JU… <NA> 82 64 mm-hg 57 beat… RECU… ## 6 9 <NA> <NA> 1 M 02AU… <NA> 99 79 mm-hg 83 beat… RECU… ## 7 10 <NA> <NA> 1 M 22JU… <NA> 99 84 mm-hg 75 beat… RECU… ## 8 11 <NA> <NA> 1 F 25JU… <NA> 98 88 mm-hg 86 beat… RECL… ## 9 13 <NA> <NA> 1 F 25JU… <NA> 96 61 mm-hg 57 beat… RECU… ## 10 16 <NA> <NA> 1 M 24JU… <NA> 108 86 mm-hg 81 beat… RECU… ## # ℹ 64 more rows ## # ℹ 5 more variables: X14 <dbl>, X15 <lgl>, X16 <dbl>, X17 <chr>, X18 <dbl>
fwf_widths
myFile<-"data files/Site 1, 9 Month Visit.dat" vis09 <- read_fwf(myFile, col_positions = fwf_widths(c(4,26,2,2,10,10,4,4,9,4,10,10,6,9,4,9,8), visNames), col_types=cols(TempUnit=col_character(), VisitDate=col_date(format="%d%b%Y"), NotifDate=col_date(format="%d%b%Y"))) vis09
## # A tibble: 74 × 17 ## Subject ScreenFailReas Screen Sex VisitDate NotifDate SBP DBP BPUnit ## <dbl> <chr> <dbl> <chr> <date> <date> <dbl> <dbl> <chr> ## 1 1 <NA> 1 M 2018-07-25 NA 114 82 mm-hg ## 2 2 <NA> 1 F 2018-07-28 NA 101 84 mm-hg ## 3 4 DROPPED: HBP 0 M 2018-07-17 2018-07-17 110 91 mm-hg ## 4 6 <NA> 1 M 2018-07-25 NA 82 50 mm-hg ## 5 8 <NA> 1 F 2018-07-25 NA 82 64 mm-hg ## 6 9 <NA> 1 M 2018-08-02 NA 99 79 mm-hg ## 7 10 <NA> 1 M 2018-07-22 NA 99 84 mm-hg ## 8 11 <NA> 1 F 2018-07-25 NA 98 88 mm-hg ## 9 13 <NA> 1 F 2018-07-25 NA 96 61 mm-hg ## 10 16 <NA> 1 M 2018-07-24 NA 108 86 mm-hg ## # ℹ 64 more rows ## # ℹ 8 more variables: Pulse <dbl>, PulseUnit <chr>, Pos <chr>, Temp <dbl>, ## # TempUnit <chr>, Wt <dbl>, WtUnit <chr>, Pain <dbl>
fwf_widths
has to account for spaces between fields and read them as data
fwf_positions
can specify (start, end)
positions of each variable
fwf_widths
has to account for spaces between fields and read them as data
fwf_positions
can specify (start, end)
positions of each variable
For our data, you can use your text editor to find that those positions are:
Subject | ScreenFailReas | Screen | Sex | VisitDate | NotifDate | SBP | DBP | BPUnit |
---|---|---|---|---|---|---|---|---|
1-3 | 5-29 | 31 | 33 | 35-43 | 45-53 | 55-57 | 59-61 | 63-67 |
Pulse | PulseUnit | Pos | Temp | TempUnit | Wt | WtUnit | Pain |
---|---|---|---|---|---|---|---|
72-74 | 76-84 | 86-94 | 96-100 | 102 | 111-113 | 115- 116 | 131 |
fwf_positions
myFile <- "data files/Site 1, 9 Month Visit.dat" startPos <- c(1, 5, 31, 33, 35, 45, 55, 59, 63, 72, 76, 86, 96, 102, 111, 115, 131) endPos <- c(3, 29, 31, 33, 43, 53, 57, 61, 67, 74, 84, 94, 100, 102, 113, 116, 131) vis09 <- read_fwf(myFile, col_positions = fwf_positions(start = startPos, end = endPos, col_names = visNames), col_types = cols(TempUnit=col_character(), VisitDate=col_date(format="%d%b%Y"), NotifDate=col_date(format="%d%b%Y")))
vis09
## # A tibble: 74 × 17 ## Subject ScreenFailReas Screen Sex VisitDate NotifDate SBP DBP BPUnit ## <dbl> <chr> <dbl> <chr> <date> <date> <dbl> <dbl> <chr> ## 1 1 <NA> 1 M 2018-07-25 NA 114 82 mm-hg ## 2 2 <NA> 1 F 2018-07-28 NA 101 84 mm-hg ## 3 4 DROPPED: HBP 0 M 2018-07-17 2018-07-17 110 91 mm-hg ## 4 6 <NA> 1 M 2018-07-25 NA 82 50 mm-hg ## 5 8 <NA> 1 F 2018-07-25 NA 82 64 mm-hg ## 6 9 <NA> 1 M 2018-08-02 NA 99 79 mm-hg ## 7 10 <NA> 1 M 2018-07-22 NA 99 84 mm-hg ## 8 11 <NA> 1 F 2018-07-25 NA 98 88 mm-hg ## 9 13 <NA> 1 F 2018-07-25 NA 96 61 mm-hg ## 10 16 <NA> 1 M 2018-07-24 NA 108 86 mm-hg ## # ℹ 64 more rows ## # ℹ 8 more variables: Pulse <dbl>, PulseUnit <chr>, Pos <chr>, Temp <dbl>, ## # TempUnit <chr>, Wt <dbl>, WtUnit <chr>, Pain <dbl>
readxl
and haven
So far, only looked at unstructured data
Data handling/manipulation is a huuuuuuge part of data analysis
readxl
package for reading MS Excel files
haven
package for reading from other stat software
library(readxl) #Part of tidyverse, but not loaded with it excel_sheets("data files/Site 1 Lab.xlsx") #list all possible sheets
## [1] "Site1Lab00" "Site1Lab03" "Site1Lab06"
library(readxl) #Part of tidyverse, but not loaded with it excel_sheets("data files/Site 1 Lab.xlsx") #list all possible sheets
## [1] "Site1Lab00" "Site1Lab03" "Site1Lab06"
lab00<-read_excel("data files/Site 1 Lab.xlsx", sheet = "Site1Lab00") #how did we do? head(lab00)
## # A tibble: 6 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ALB ## <dbl> <dttm> <lgl> <lgl> <dbl> <chr> <dbl> ## 1 1 2018-01-22 00:00:00 NA NA 1 M 4.4 ## 2 2 2018-01-23 00:00:00 NA NA 1 F 4.8 ## 3 4 2018-01-20 00:00:00 NA NA 1 M 4.3 ## 4 6 2018-01-24 00:00:00 NA NA 1 M 4.4 ## 5 8 2018-01-21 00:00:00 NA NA 1 F 4.2 ## 6 9 2018-01-27 00:00:00 NA NA 1 M 4.5 ## # ℹ 12 more variables: AlkPhos <dbl>, ALT <dbl>, AST <dbl>, DBili <dbl>, ## # GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
## Example 5 - Setting Column Types
myFile <- "data files/Site 1 Lab.xlsx" colTypes = c("numeric", "date", "date", "text", "numeric", "text", rep("numeric",13)) lab00<-read_excel(myFile, sheet = "Site1Lab00", col_types = colTypes) lab03<-read_excel(myFile, sheet = "Site1Lab03", col_types = colTypes) lab06<-read_excel(myFile, sheet = "Site1Lab06", col_types = colTypes) head(lab00)
## # A tibble: 6 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 1 2018-01-22 00:00:00 NA <NA> 1 M ## 2 2 2018-01-23 00:00:00 NA <NA> 1 F ## 3 4 2018-01-20 00:00:00 NA <NA> 1 M ## 4 6 2018-01-24 00:00:00 NA <NA> 1 M ## 5 8 2018-01-21 00:00:00 NA <NA> 1 F ## 6 9 2018-01-27 00:00:00 NA <NA> 1 M ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
read_excel
FeaturesFamiliar features like col_names
and col_types
Excel-specific features like range
library(haven) lab09<-read_sas("data files/Site1Lab09.sas7bdat")
Just like reading from any third-party software, some special features
Notably, SAS variables usually have labels
attribute
when you investigate the tibbleScratching the surface of data reading
Infinite* ways to “creatively” store data ⇒ many combinations of our tools
write_*
functions instead of read_*
Three versions of a data set regarding patients’ blood measurements have been provided. Each file contains the same data, just presented differently.
Read in bloodv1.dat
Read in bloodv2.prn
(Hint: use skip=
to skip any rows you do not want to read.)
Read in bloodv3.xlsx
readr
package)dplyr
package)dplyr
package)lab06$ScreenFailReas == "PREG."
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## [16] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## [31] NA NA NA NA NA NA NA NA NA NA NA NA TRUE NA NA ## [46] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## [61] NA NA NA NA NA NA NA NA NA NA TRUE NA
sf<-lab06$Screen == 0 lab06[sf,]
## # A tibble: 2 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG. 0 F ## 2 132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG. 0 F ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
filter
filter(lab06, Screen == 0) #filter provided by dplyr package
## # A tibble: 2 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG. 0 F ## 2 132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG. 0 F ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
filter
filter(lab06, Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0))
## # A tibble: 6 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 13 2018-07-25 00:00:00 NA <NA> 1 F ## 2 18 2018-08-02 00:00:00 NA <NA> 1 M ## 3 46 2018-08-25 00:00:00 NA <NA> 1 M ## 4 78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG. 0 F ## 5 88 2018-09-11 00:00:00 NA <NA> 1 M ## 6 132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG. 0 F ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
slice
vs. filter
filter
is logical while slice
is positionalslice(lab06, 12:16)
## # A tibble: 5 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 19 2018-08-02 00:00:00 NA <NA> 1 M ## 2 20 2018-07-29 00:00:00 NA <NA> 1 M ## 3 24 2018-08-02 00:00:00 NA <NA> 1 M ## 4 25 2018-08-03 00:00:00 NA <NA> 1 M ## 5 26 2018-07-27 00:00:00 NA <NA> 1 F ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
slice
does have cool helper functions## slice_head(lab06, n=5) #grab the top 5 observations ## slice_tail(lab06, n=10) #grab the bottom 10 observations ## slice_sample(lab06, n=3) #simple random sample of 3 rows ## slice_sample(lab06, prop=0.25) #sample 25% of the data set slice_min(lab06, ALB) #also, slice_max
## # A tibble: 1 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 90 2018-09-12 00:00:00 NA <NA> 1 F ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
arrange
arrange(lab06,Screen,Sex,VisitDate)
## # A tibble: 72 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG. 0 F ## 2 132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG. 0 F ## 3 17 2018-07-24 00:00:00 NA <NA> 1 F ## 4 8 2018-07-25 00:00:00 NA <NA> 1 F ## 5 11 2018-07-25 00:00:00 NA <NA> 1 F ## 6 13 2018-07-25 00:00:00 NA <NA> 1 F ## 7 26 2018-07-27 00:00:00 NA <NA> 1 F ## 8 2 2018-07-28 00:00:00 NA <NA> 1 F ## 9 27 2018-08-03 00:00:00 NA <NA> 1 F ## 10 30 2018-08-03 00:00:00 NA <NA> 1 F ## # ℹ 62 more rows ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
arrange(lab06,Screen,Sex,desc(VisitDate)) #Flip order of a variable with desc()
## # A tibble: 72 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG. 0 F ## 2 78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG. 0 F ## 3 126 2018-10-15 00:00:00 NA <NA> 1 F ## 4 131 2018-10-15 00:00:00 NA <NA> 1 F ## 5 122 2018-10-12 00:00:00 NA <NA> 1 F ## 6 115 2018-10-07 00:00:00 NA <NA> 1 F ## 7 114 2018-10-03 00:00:00 NA <NA> 1 F ## 8 111 2018-09-30 00:00:00 NA <NA> 1 F ## 9 103 2018-09-25 00:00:00 NA <NA> 1 F ## 10 101 2018-09-24 00:00:00 NA <NA> 1 F ## # ℹ 62 more rows ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
Obviously, you might want to filter
and arrange
your data!
Typical approach is to nest functions
arrange(filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)),Screen, Sex, desc(VisitDate))
## # A tibble: 6 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 132 2018-10-18 00:00:00 2018-10-20 00:00:00 PREG. 0 F ## 2 78 2018-09-17 00:00:00 2018-09-19 00:00:00 PREG. 0 F ## 3 13 2018-07-25 00:00:00 NA <NA> 1 F ## 4 88 2018-09-11 00:00:00 NA <NA> 1 M ## 5 46 2018-08-25 00:00:00 NA <NA> 1 M ## 6 18 2018-08-02 00:00:00 NA <NA> 1 M ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
Difficult for some to read, often space it this way
But… more functions ⇒???
arrange( filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)), Screen, Sex, desc(VisitDate) )
tidyverse
offers ability to pipe or chain functionstidyverse
offers ability to pipe or chain functions
Output from one function is sent as input into next function
Think links in a chain
#filter is your first link - we know it outputs a subset filter(lab06,Screen == 0 | (AlkPhos > 110.0 & ALT > 28.0)) %>% #chain operator! arrange(Screen, Sex, desc(VisitDate)) #arrange is your next link #note -- no first argument anymore, we start with the **2nd** argument
select(lab06, Subject, VisitDate, CGluc, UGluc, Preg) #no chaining
lab06 %>% select(Subject, VisitDate, CGluc, UGluc, Preg) #same thing, with chaining
## # A tibble: 72 × 5 ## Subject VisitDate CGluc UGluc Preg ## <dbl> <dttm> <dbl> <dbl> <dbl> ## 1 1 2018-07-25 00:00:00 110 1 0 ## 2 2 2018-07-28 00:00:00 102 0 0 ## 3 6 2018-07-25 00:00:00 97.2 0 0 ## 4 8 2018-07-25 00:00:00 113. 1 0 ## 5 9 2018-08-02 00:00:00 106. 0 0 ## 6 10 2018-07-22 00:00:00 103. 0 0 ## 7 11 2018-07-25 00:00:00 103 0 0 ## 8 13 2018-07-25 00:00:00 103. 0 0 ## 9 16 2018-07-24 00:00:00 113. 1 0 ## 10 17 2018-07-24 00:00:00 108. 0 0 ## # ℹ 62 more rows
lab06 %>%
use:select(starts_with("Screen")) #All variables starting with "Screen" select(ends_with("date")) #All variables ending with "date" select(starts_with("Screen") | ends_with("date")) #Either of the above select(!(starts_with("Screen"))) #All variables that DO NOT start with "Screen" select(contains('gluc')) #All variables contianing string 'gluc' in name select(ALT:Prot) #All variables from ALT to Prot in data frame select(where(is.numeric)) # Applies is.numeric to all variables and returns when TRUE
dplyr
provides bind_rows
to get it done#I want to use small subsets so we can see what is happening! set1 <- lab00 %>% slice(1:3) #first 3 rows of lab00 set2 <- lab03 %>% slice(1:3) #first 3 rows of lab03 bind_rows(set1, set2) #3 from lab00 + 3 from lab03 = 6, vertically combined, records
## # A tibble: 6 × 19 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 1 2018-01-22 00:00:00 NA <NA> 1 M ## 2 2 2018-01-23 00:00:00 NA <NA> 1 F ## 3 4 2018-01-20 00:00:00 NA <NA> 1 M ## 4 1 2018-04-24 00:00:00 NA <NA> 1 M ## 5 2 2018-04-26 00:00:00 NA <NA> 1 F ## 6 4 2018-04-19 00:00:00 NA <NA> 1 M ## # ℹ 13 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>
set1 <- lab00 %>% select(1:2) %>% slice(1:3) #first 2 columns and 3 rows of lab00 set2 <- lab03 %>% select(c(5,7)) %>% slice(1:3) #first 3 rows of columns 5 & 7 from lab03 bind_rows(set1, set2) # six mismatched records
## # A tibble: 6 × 4 ## Subject VisitDate Screen ALB ## <dbl> <dttm> <dbl> <dbl> ## 1 1 2018-01-22 00:00:00 NA NA ## 2 2 2018-01-23 00:00:00 NA NA ## 3 4 2018-01-20 00:00:00 NA NA ## 4 NA NA 1 4.4 ## 5 NA NA 1 4.7 ## 6 NA NA 1 4.2
#rows 1 - 5 of Subject, CGluc, UGluc in Lab00 tableA <- lab00 %>% slice(1:5) %>% select(Subject|contains("gluc")) #rows 3 - 7 of Subject, SBP, DBP in Vis00 tableB <- vis00 %>% slice(3:7) %>% select(Subject|ends_with("bp"))
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
inner_join(tableA, tableB)
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
inner_join(tableA, tableB)
Subject | CGluc | UGluc | SBP | DBP |
---|---|---|---|---|
4 | 102 | 0 | 106 | 89 |
6 | 101 | 0 | 78 | 49 |
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
inner_join(tableA, tableB, by = "Subject")
Subject | CGluc | UGluc | SBP | DBP |
---|---|---|---|---|
4 | 102 | 0 | 106 | 89 |
6 | 101 | 0 | 78 | 49 |
tableC <- tableB %>% rename(Subj = Subject)
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subj | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subj | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
inner_join(tableA, tableC, by = c("Subject"="Subj"))
Subject | CGluc | UGluc | SBP | DBP |
---|---|---|---|---|
4 | 102 | 0 | 106 | 89 |
6 | 101 | 0 | 78 | 49 |
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
left_join(tableA, tableB, by = "Subject")
Subject | CGluc | UGluc | SBP | DBP |
---|---|---|---|---|
1 | 106 | 0 | NA | NA |
2 | 101 | 0 | NA | NA |
4 | 102 | 0 | 106 | 89 |
6 | 101 | 0 | 78 | 49 |
8 | 116 | 1 | NA | NA |
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
right_join(tableA, tableB, by = "Subject")
Subject | CGluc | UGluc | SBP | DBP |
---|---|---|---|---|
4 | 102 | 0 | 106 | 89 |
6 | 101 | 0 | 78 | 49 |
3 | NA | NA | 100 | 75 |
5 | NA | NA | 144 | 115 |
7 | NA | NA | 132 | 92 |
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
full_join(tableA, tableB, by = "Subject")
Subject | CGluc | UGluc | SBP | DBP |
---|---|---|---|---|
1 | 106 | 0 | NA | NA |
2 | 101 | 0 | NA | NA |
4 | 102 | 0 | 106 | 89 |
6 | 101 | 0 | 78 | 49 |
8 | 116 | 1 | NA | NA |
3 | NA | NA | 100 | 75 |
5 | NA | NA | 144 | 115 |
7 | NA | NA | 132 | 92 |
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
semi_join(tableA, tableB, by = "Subject")
Subject | CGluc | UGluc |
---|---|---|
4 | 102 | 0 |
6 | 101 | 0 |
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
4 | 102 | 0 |
6 | 101 | 0 |
8 | 116 | 1 |
Subject | SBP | DBP |
---|---|---|
3 | 100 | 75 |
4 | 106 | 89 |
5 | 144 | 115 |
6 | 78 | 49 |
7 | 132 | 92 |
anti_join(tableA, tableB, by = "Subject")
Subject | CGluc | UGluc |
---|---|---|
1 | 106 | 0 |
2 | 101 | 0 |
8 | 116 | 1 |
Only seen how to do a one-to-one join
One-to-many joins are common and important
Many-to-many joins should rarely be done!
R never coercies types – common variables must have same type!
Transposing/reshaping/restructuring data is an important tool
Let’s build a data set to use and look at two ways to restructure data:
Transposing/reshaping/restructuring data is an important tool
Let’s build a data set to use and look at two ways to restructure data:
tableC <- lab00 %>% select(Subject, ALB:DBili) slice_head(tableC, n=2)
## # A tibble: 2 × 6 ## Subject ALB AlkPhos ALT AST DBili ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 4.4 80 20 22 0.11 ## 2 2 4.8 27 27 30 0.1
may see gather
in older code
pivot_longer
is updated version, handles more scenarios
taller <- tableC %>% pivot_longer(!Subject, #old columns to be reshaped names_to = "lbtest", #new column for old column names values_to = "result", #new column for old column values ) slice_head(taller, n=10)
## # A tibble: 10 × 3 ## Subject lbtest result ## <dbl> <chr> <dbl> ## 1 1 ALB 4.4 ## 2 1 AlkPhos 80 ## 3 1 ALT 20 ## 4 1 AST 22 ## 5 1 DBili 0.11 ## 6 2 ALB 4.8 ## 7 2 AlkPhos 27 ## 8 2 ALT 27 ## 9 2 AST 30 ## 10 2 DBili 0.1
spread
is opposite of gather
pivot_wider
is opposite of pivot_longer
wider <- taller %>% pivot_wider(names_from = lbtest, #what should new column names be? values_from = result, #what should new column values be? ) slice_head(wider, n=2)
## # A tibble: 2 × 6 ## Subject ALB AlkPhos ALT AST DBili ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 4.4 80 20 22 0.11 ## 2 2 4.8 27 27 30 0.1
Concatenate lab00 through lab06 into a single data frame named lab
Concatenate vis00 through vis06 into a single data frame named vis
Combine the Lab and Visit data frames under the following conditions. In each case, ensure you do not duplicate any variables!
Version 1: Only patients with Visit data and Lab data are important
Version 2: All patients are important!
Version 3: Want Visit and Lab data for any patients who have Lab data
Version 4: We want a subset of the Lab data set based on whether the patient is not in the Visit data set
Advanced: Suppose you concatenate the results of semi_join(A,B)
and anti_join(A,B)
. How does the resulting table compare to A
and B
?
readr
package)dplyr
package)dplyr
package)Month <- 3 #Creating a new variable is easy :)
Month <- 3 #Creating a new variable is easy :)
lab03 %>% mutate(month = 3)
## # A tibble: 75 × 20 ## Subject VisitDate NotifDate ScreenFailReas Screen Sex ## <dbl> <dttm> <dttm> <chr> <dbl> <chr> ## 1 1 2018-04-24 00:00:00 NA <NA> 1 M ## 2 2 2018-04-26 00:00:00 NA <NA> 1 F ## 3 4 2018-04-19 00:00:00 NA <NA> 1 M ## 4 6 2018-04-26 00:00:00 NA <NA> 1 M ## 5 8 2018-04-25 00:00:00 NA <NA> 1 F ## 6 9 2018-04-30 00:00:00 NA <NA> 1 M ## 7 10 2018-04-22 00:00:00 NA <NA> 1 M ## 8 11 2018-04-27 00:00:00 NA <NA> 1 F ## 9 13 2018-04-27 00:00:00 NA <NA> 1 F ## 10 16 2018-04-26 00:00:00 NA <NA> 1 M ## # ℹ 65 more rows ## # ℹ 14 more variables: ALB <dbl>, AlkPhos <dbl>, ALT <dbl>, AST <dbl>, ## # DBili <dbl>, GGTP <dbl>, CGluc <dbl>, UGluc <dbl>, TBili <dbl>, Prot <dbl>, ## # Hemoglob <dbl>, Hematocr <dbl>, Preg <dbl>, month <dbl>
#I *really* encourage you to avoid overwriting objects! lab03v2 <- lab03 %>% mutate(month = 3) lab03v2 %>% select(Subject, month)
## # A tibble: 75 × 2 ## Subject month ## <dbl> <dbl> ## 1 1 3 ## 2 2 3 ## 3 4 3 ## 4 6 3 ## 5 8 3 ## 6 9 3 ## 7 10 3 ## 8 11 3 ## 9 13 3 ## 10 16 3 ## # ℹ 65 more rows
lab03v3 <- lab03 %>% mutate(CGratio = Hematocr/Hemoglob, CGRatioDefn = "Hematocrit/Hemoglobin") lab03v3 %>% select(Subject, starts_with("CGrat"))
## # A tibble: 75 × 3 ## Subject CGratio CGRatioDefn ## <dbl> <dbl> <chr> ## 1 1 3.33 Hematocrit/Hemoglobin ## 2 2 2.92 Hematocrit/Hemoglobin ## 3 4 2.77 Hematocrit/Hemoglobin ## 4 6 3.18 Hematocrit/Hemoglobin ## 5 8 2.71 Hematocrit/Hemoglobin ## 6 9 2.98 Hematocrit/Hemoglobin ## 7 10 2.5 Hematocrit/Hemoglobin ## 8 11 3.44 Hematocrit/Hemoglobin ## 9 13 3.08 Hematocrit/Hemoglobin ## 10 16 3.15 Hematocrit/Hemoglobin ## # ℹ 65 more rows
lab03 %>% transmute(CGratio = Hematocr/Hemoglob, CGRatioDefn = "Hematocrit/Hemoglobin")
## # A tibble: 75 × 2 ## CGratio CGRatioDefn ## <dbl> <chr> ## 1 3.33 Hematocrit/Hemoglobin ## 2 2.92 Hematocrit/Hemoglobin ## 3 2.77 Hematocrit/Hemoglobin ## 4 3.18 Hematocrit/Hemoglobin ## 5 2.71 Hematocrit/Hemoglobin ## 6 2.98 Hematocrit/Hemoglobin ## 7 2.5 Hematocrit/Hemoglobin ## 8 3.44 Hematocrit/Hemoglobin ## 9 3.08 Hematocrit/Hemoglobin ## 10 3.15 Hematocrit/Hemoglobin ## # ℹ 65 more rows
mutate
creates new columns AND attaches them to the source data frame
transmute
turns old columns into new columns
Sometimes we need to create a variable based on certain conditions
Basic concept: if-then/else
logic
if (if-condition) { action-statement }
if-condtion
is the logical condition applied to the recordaction-statement
is what you want R to do when if-condition
is TRUEif
and then
just like English syntaxif-then
Examplename <- "Jonathan" if (name == "Jonathan") { fuzzy <- "Yes" } fuzzy
## [1] "Yes"
if-then/else
Examplesname <- "Fozzy Bear" if (name == "Jonathan") { fuzzy <- "Yes" } else { fuzzy <- "Unknown" } fuzzy
## [1] "Unknown"
name <- "Fozzy Bear" if (name == "Jonathan") { fuzzy <- "Yes" } else if (name == "Fozzy Bear") { fuzzy <- "Fozzy, not fuzzy" } else { fuzzy <- "Unknown" } fuzzy
## [1] "Fozzy, not fuzzy"
name <- "Fozzy Bear" if ((name == "Jonathan") | (str_detect(name,"Fozzy"))) { fuzzy <- "Yes" } else if (name == "Fozzy Bear") { fuzzy <- "Fozzy, not fuzzy" } else { fuzzy <- "Unknown" } fuzzy
## [1] "Yes"
str_detect
searches name
for the string “Fozzy”if (lab00$ALB < 4.0 | lab00$ALB > 5.4){ ScreenFlag <- "ALB" }
## Error in if (lab00$ALB < 4 | lab00$ALB > 5.4) {: the condition has length > 1
if-then
cannot handle more than one valuemean
and median
accepted vectors#Vectorized if-then is done with ifelse functions ifelse(condition-to-apply-to-vector, action-if-true, action-if-false)
ifelse(lab00$ALB < 4.0 | lab00$ALB > 5.4, "HighLow", "Normal")
## [1] "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" ## [8] "Normal" "HighLow" "Normal" "HighLow" "HighLow" "HighLow" "Normal" ## [15] "Normal" "HighLow" "HighLow" "Normal" "Normal" "Normal" "Normal" ## [22] "Normal" "Normal" "Normal" "Normal" "HighLow" "Normal" "Normal" ## [29] "Normal" "Normal" "Normal" "HighLow" "Normal" "Normal" "Normal" ## [36] "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "HighLow" ## [43] "Normal" "HighLow" "Normal" "Normal" "Normal" "Normal" "Normal" ## [50] "Normal" "Normal" "Normal" "HighLow" "HighLow" "Normal" "Normal" ## [57] "Normal" "Normal" "HighLow" "HighLow" "Normal" "Normal" "Normal" ## [64] "Normal" "Normal" "Normal" "HighLow" "HighLow" "Normal" "Normal" ## [71] "Normal" "HighLow" "Normal" "Normal" "Normal" "Normal" "Normal"
ifelse
#Can still have multiple conditions ifelse(lab00$ALB < 4.0, "Low", ifelse(lab00$ALB > 5.4, "High", "Normal"))
## [1] "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" ## [9] "Low" "Normal" "Low" "Low" "Low" "Normal" "Normal" "Low" ## [17] "Low" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" ## [25] "Normal" "Low" "Normal" "Normal" "Normal" "Normal" "Normal" "Low" ## [33] "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" "Normal" ## [41] "Normal" "Low" "Normal" "Low" "Normal" "Normal" "Normal" "Normal" ## [49] "Normal" "Normal" "Normal" "Normal" "Low" "High" "Normal" "Normal" ## [57] "Normal" "Normal" "Low" "Low" "Normal" "Normal" "Normal" "Normal" ## [65] "Normal" "Normal" "Low" "Low" "Normal" "Normal" "Normal" "Low" ## [73] "Normal" "Normal" "Normal" "Normal" "Normal"
#chain and transmute or mutate as needed lab00 %>% transmute(AlbFlag = ifelse(lab00$ALB < 4.0, "Low", ifelse(lab00$ALB > 5.4, "High", "Normal")))
## # A tibble: 77 × 1 ## AlbFlag ## <chr> ## 1 Normal ## 2 Normal ## 3 Normal ## 4 Normal ## 5 Normal ## 6 Normal ## 7 Normal ## 8 Normal ## 9 Low ## 10 Normal ## # ℹ 67 more rows
#chain and transmute or mutate as needed lab00 %>% transmute(AlbFlag = ifelse(lab00$ALB < 4.0, "Low", ifelse(lab00$ALB > 5.4, "High", ifelse(lab00$ALB >= 4.0 & lab00$ALB <= 5.4, "Normal", "INVALID"))))
## # A tibble: 77 × 1 ## AlbFlag ## <chr> ## 1 Normal ## 2 Normal ## 3 Normal ## 4 Normal ## 5 Normal ## 6 Normal ## 7 Normal ## 8 Normal ## 9 Low ## 10 Normal ## # ℹ 67 more rows
Create variables unconditionally as constant or from other variables
Attach with mutate
or view on their own with transmute
Vectorized functions accept vectors as arguments
Vectorized logic with ifelse
Caution: Avoid “miscelaneous” category also catching one of your expected categories!
Combining logic with filter
, arrange
, slice
, *_join
, bind_rows
allows for lots of data manipulation
cut00 <- as.Date("2018-04-19") #Approximate last date of Baseline Visits cut03 <- as.Date("2018-07-21") #Approximate last date of 3 Month Visits cut06 <- as.Date("2018-10-21") #Approximate last date of 6 Month Visits
VisitDate = 2018-10-20
should get VisitNum = 6
NA
for records with dates not covered by these three cutoff dates
readr
package)dplyr
package)dplyr
package)tidyverse
Functionssummarise
applies a summary function to your data
group_by
collects data into groups for separate analyses
ungroup
#Basic syntax of summarise summarise(NewVarName = summary-function(<variable>))
lab03 %>% summarise(meanALB = mean(ALB), medianALB = median(ALB), count = n())
## # A tibble: 1 × 3 ## meanALB medianALB count ## <dbl> <dbl> <int> ## 1 4.39 4.4 75
#Note: using lab00, lab03, and lab06 lab <- lab00 %>% bind_rows(lab03) %>% bind_rows(lab06) #From Activity #2 lab %>% group_by(Subject) %>% summarise(meanALB = mean(ALB), medianALB = median(ALB), count = n())
## # A tibble: 77 × 4 ## Subject meanALB medianALB count ## <dbl> <dbl> <dbl> <int> ## 1 1 4.4 4.4 3 ## 2 2 4.7 4.7 3 ## 3 4 4.25 4.25 2 ## 4 6 4.4 4.4 3 ## 5 8 4.1 4.1 3 ## 6 9 4.37 4.4 3 ## 7 10 5 5 3 ## 8 11 4.03 4 3 ## 9 13 3.53 3.5 3 ## 10 16 4.63 4.6 3 ## # ℹ 67 more rows
lab %>% group_by(Sex) %>% summarise(count = n(), p05 = quantile(ALB, 0.05), p15 = quantile(ALB, 0.15), p25 = quantile(ALB, 0.25), p50 = quantile(ALB, 0.50), p75 = quantile(ALB, 0.75), p85 = quantile(ALB, 0.85), p95 = quantile(ALB, 0.95), IQR = p75-p25)
## # A tibble: 2 × 10 ## Sex count p05 p15 p25 p50 p75 p85 p95 IQR ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 F 102 3.40 3.83 4.1 4.6 4.8 4.9 5.09 0.7 ## 2 M 122 3.4 3.6 4.1 4.4 4.77 4.98 5.2 0.675
Gets tedious for lots of variables and/or statistics
Applying to a group of variables? → use apply
family of functions
Three choices: apply
, lapply
, sapply
apply
#Example syntax apply(X = columns-of-interest, MARGIN = type-of-summary, FUN= function-of-interest, na.rm = your-choice)
apply(select(lab,ALB:CGluc), MARGIN = 2, FUN = summary, na.rm = TRUE)
## ALB AlkPhos ALT AST DBili GGTP CGluc ## Min. 2.900000 22.8000 2.00000 7.70000 -0.01 -0.30000 95.7000 ## 1st Qu. 4.100000 62.5500 16.00000 18.17500 0.11 19.00000 101.4750 ## Median 4.500000 84.1500 21.05000 22.85000 0.15 28.00000 105.0000 ## Mean 4.391518 85.5433 21.30625 22.63661 0.16 27.79509 105.3312 ## 3rd Qu. 4.800000 108.0000 27.00000 27.15000 0.22 37.12500 109.0000 ## Max. 6.000000 149.3000 39.50000 37.40000 0.41 49.50000 118.6000
MARGIN=
apply(select(lab[1:5,],ALB:CGluc), MARGIN = 1, FUN = summary, na.rm = TRUE)
## [,1] [,2] [,3] [,4] [,5] ## Min. 0.11000 0.1 0.21000 0.25000 0.3 ## 1st Qu. 12.20000 7.9 6.15000 12.20000 5.6 ## Median 22.00000 27.0 19.00000 24.00000 20.0 ## Mean 39.50143 28.7 35.21571 41.66429 34.5 ## 3rd Qu. 62.00000 28.5 56.50000 63.00000 47.0 ## Max. 106.00000 101.0 102.00000 117.00000 116.0
Pearson Correlation!
lab %>% select(ALB:AST) %>% cor()
## ALB AlkPhos ALT AST ## ALB 1.000000000 0.12051637 0.07624307 -0.004772245 ## AlkPhos 0.120516369 1.00000000 0.02127493 0.136677803 ## ALT 0.076243075 0.02127493 1.00000000 0.025933663 ## AST -0.004772245 0.13667780 0.02593366 1.000000000
lab %>% select(ALB:AST) %>% cor(method = "spearman")
## ALB AlkPhos ALT AST ## ALB 1.00000000 0.120722030 0.048230339 0.065368590 ## AlkPhos 0.12072203 1.000000000 -0.004754992 0.172736863 ## ALT 0.04823034 -0.004754992 1.000000000 -0.004662076 ## AST 0.06536859 0.172736863 -0.004662076 1.000000000
install.packages("Hmisc") library(Hmisc)
rcorr
function which computes correlations and p-valuesrcorr
lab %>% select(ALB:AST) %>% as.matrix() %>% rcorr()
## ALB AlkPhos ALT AST ## ALB 1.00 0.12 0.08 0.00 ## AlkPhos 0.12 1.00 0.02 0.14 ## ALT 0.08 0.02 1.00 0.03 ## AST 0.00 0.14 0.03 1.00 ## ## n= 224 ## ## ## P ## ALB AlkPhos ALT AST ## ALB 0.0718 0.2558 0.9434 ## AlkPhos 0.0718 0.7515 0.0410 ## ALT 0.2558 0.7515 0.6995 ## AST 0.9434 0.0410 0.6995
More math and stats functions than you can shake a stick at
Approach: know what analysis you should do, find the function that does that analysis, read the help page, implement the function
tidyverse
is wonderful but it doesn’t have everything (yet!)apply
family good for working with multiple variablesWant to do more? Need to expand your skills!
do
loops allow you to carry out repetitive actionstidyverse
functions to get the overall mean, standard deviation (which is square root of variance), median, 25th, and 75th percentile for SBP from the vis data frame
Thank you!
Questions?