Powering up! Descriptive Statistics and Inferential Tests

Descriptive and inferential statistics form the bedrock of any complex analysis. Descriptive statistics summarize raw data and provide a snapshot of the sample’s features, revealing trends, patterns, and distributions, essential for making the data comprehensible. Inferential statistics allow for conclusions or predictions about a larger population from the sampled data, providing the necessary basis for hypothesis testing. Together, they provide an initial understanding of data, and an informed context for the application of more complex statistical or machine learning techniques.

As we mentioned in the Introduction- there are many ways to skin a cat in R.

`tidyverse()` for Describing data

As we learned yesterday- the tidyverse has revolutionised data wranggling and can be extended likewise into the realm of descriptive statistics. Creating tables with dplyr functions summarise() and count() is a useful approach to calculating summary statistics, summarize by group, or pass tables to ggplot() or flextable(). In yesterdays tutorial we briefly did visit this, but we will extend on this in the next 10 minutes or so before transitioning into a simpler more efficient way of describing data in R.

Lets get some important packages loaded

required_packages <- c("tidyverse", "lubridate", "gtsummary", "rstatix", "janitor", "corrr")
not_installed <- required_packages[!(required_packages %in% installed.packages()[ , "Package"])]    
if(length(not_installed)) install.packages(not_installed)                                           
suppressWarnings(lapply(required_packages, require, character.only = TRUE))

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: gtsummary

Loading required package: rstatix


Attaching package: 'rstatix'


The following object is masked from 'package:stats':

    filter


Loading required package: janitor


Attaching package: 'janitor'


The following object is masked from 'package:rstatix':

    make_clean_names


The following objects are masked from 'package:stats':

    chisq.test, fisher.test


Loading required package: corrr

Importing some data

Lets call in the data:

c19_df <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/linelist/linelist_deaths.csv")

Just to be consistent we what I said yesterday- before diving into the data always always skim the data first to get a quick feels

c19_df %>% 
  skimr::skim()

Data summary
Name	Piped data
Number of rows	37152
Number of columns	15
_______________________
Column type frequency:
character	10
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
date	1	10	10	0	1004
date_announced	1	10	10	0	1000
date_positive	1	10	10	0	999
date_dose1	1	0	10	22437	298
date_dose2	1	0	10	28034	267
date_dose3	1	0	10	35719	161
brand1	1	0	16	22437	8
brand2	1	0	16	28034	7
brand3	1	0	16	35719	6
state	1	5	17	0	16

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
age	1	62.65	16.59	51	64	75	130	▁▃▇▃▁
male	1	0.58	0.49	0	1	1	1	▆▁▁▁▇
bid	1	0.21	0.41	0	0	0	1	▇▁▁▁▂
malaysian	1	0.89	0.31	1	1	1	1	▁▁▁▁▇
comorb	1	0.79	0.41	1	1	1	1	▂▁▁▁▇

Get counts

The most simple function to apply within summarise() is n(). Leave the parentheses empty to count the number of rows. You may have seen this being used several times yesterday. Let

c19_df %>%                 # begin with linelist
  summarise(n_rows = n())    # return new summary dataframe with column n_rows

  n_rows
1  37152

Lets try and stratify that by nationality and BID status:

c19_df %>% 
  group_by(malaysian, bid) %>%     # group data by unique values in column age_cat
  summarise(n_rows = n())   # return number of rows *per group*

`summarise()` has grouped output by 'malaysian'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   malaysian [2]
  malaysian   bid n_rows
      <int> <int>  <int>
1         0     0   1958
2         0     1   2076
3         1     0  27288
4         1     1   5830

The above command can be shortened by using the count() function instead. count() does the following:

Groups the data by the columns provided to it
Summarises them with n() (creating column n)
Un-groups the data

c19_df %>%
  count(malaysian, bid)

  malaysian bid     n
1         0   0  1958
2         0   1  2076
3         1   0 27288
4         1   1  5830

Proportions

Proportions can be added by piping the table to mutate() to create a new column. Define the new column as the counts column (n by default) divided by the sum() of the counts column (this will return a proportion).

bid_summary <- c19_df %>% 
  count(malaysian, bid) %>%                     # group and count by gender (produces "n" column)
  mutate(                                # create percent of column - note the denominator
    percent = round((n / sum(n))*100,2)) 

# print
bid_summary

  malaysian bid     n percent
1         0   0  1958    5.27
2         0   1  2076    5.59
3         1   0 27288   73.45
4         1   1  5830   15.69

Tip

Using these structure we can very easily modify these summary statistics using ggplot() or html tables kable() or presentation ready tables using flextable() (flextable is not covered in this course but you can check it out here). An example of a plot is as follows:

c19_df %>% 
  count(malaysian, bid) %>%                     
  mutate(percent = round((n / sum(n))*100,2)) %>%     
  ggplot()+                       # pass new data frame to ggplot
    geom_col(                     # create bar plot
      mapping = aes(   
        x = malaysian,            # map outcome to x-axis
        fill = bid,               # map age_cat to the fill
        y = percent))             # map the counts column `n` to the height

Or a nice little table summary:

c19_df %>% 
  count(malaysian, bid) %>%
  mutate(malaysian = factor(malaysian,
                               levels=c("0","1"),
                               labels=c("non-Malaysian", "Malaysian")),
         bid = factor(bid,
                               levels=c("0","1"),
                               labels=c("Hospital", "BID")),) %>%
  mutate(percent = round((n / sum(n))*100,2))%>%
  knitr::kable(format="html", caption = "COVID-19 fatalities by Nationality and Place of Death",) %>% kableExtra::kable_minimal()

COVID-19 fatalities by Nationality and Place of Death
malaysian	bid	n	percent
non-Malaysian	Hospital	1958	5.27
non-Malaysian	BID	2076	5.59
Malaysian	Hospital	27288	73.45
Malaysian	BID	5830	15.69

Summary statistics

One major advantage of dplyr and summarise() is the ability to return more advanced statistical summaries like median(), mean(), max(), min(), sd() (standard deviation), and percentiles. You can also use sum() to return the number of rows that meet certain logical criteria. As above, these outputs can be produced for the whole data frame set, or by group.

The syntax is the same - within the summarise() parentheses you provide the names of each new summary column followed by an equals sign and a statistical function to apply. Within the statistical function, give the column(s) to be operated on and any relevant arguments (e.g. na.rm = TRUE for most mathematical functions).

You can also use sum() to return the number of rows that meet a logical criteria. The expression within is counted if it evaluates to TRUE. For example:

sum(age_years < 18, na.rm=T)
sum(gender == "male", na.rm=T)
sum(response %in% c("Likely", "Very Likely"))

Below, c19_df data are summarised to describe the days delay from death to announcement (column days_death_state), by state.

c19_df %>%                 # begin with linelist, save out as new object
  group_by(state) %>%      # group all calculations by hospital
  mutate(across(contains("date"), ~as.Date(., format = "%Y-%m-%d")), #change character to dates\ fromat
         days_death_state=date_announced-date) %>% # calculate the delay for each death
  summarise(                                                         # only the below summary columns will be returned
    deaths       = n(),                                                # number of rows per group
    delay_max   = max(days_death_state, na.rm = T),                    # max delay
    delay_mean  = round(mean(days_death_state, na.rm=T), digits = 1),  # mean delay, rounded
    delay_sd    = round(sd(days_death_state, na.rm = T), digits = 1),  # standard deviation of delays, rounded
    delay_3     = sum(days_death_state >= 3, na.rm = T),               # number of rows with delay of 3 or more days
    pct_delay_3 = scales::percent(delay_3 / deaths)                    # convert previously-defined delay column to percent (scales gives the % sign behind)
  )

# A tibble: 16 × 7
   state             deaths delay_max delay_mean delay_sd delay_3 pct_delay_3
   <chr>              <int> <drtn>    <drtn>        <dbl>   <int> <chr>      
 1 Johor               4740 253 days   4.8 days      11.8    1987 42%        
 2 Kedah               2756 338 days  10.0 days      18.6    1531 56%        
 3 Kelantan            1428 153 days   7.5 days      12.5     920 64%        
 4 Melaka              1213 386 days   5.2 days      15       460 38%        
 5 Negeri Sembilan     1546 274 days   8.7 days      19.5     659 43%        
 6 Pahang              1037 325 days   5.2 days      19       392 38%        
 7 Perak               2164 178 days   4.0 days       9.4    1029 48%        
 8 Perlis               199  16 days   2.5 days       2.4      67 34%        
 9 Pulau Pinang        2085 396 days   3.7 days      10.9     958 46%        
10 Sabah               3211 237 days   4.9 days       8.7    2062 64%        
11 Sarawak             1795 167 days   4.5 days       9.6     900 50%        
12 Selangor           11024 370 days  17.5 days      23.5    8688 79%        
13 Terengganu           905 150 days   3.2 days       7       421 47%        
14 W.P. Kuala Lumpur   2861 315 days  11.0 days      21.3    2026 71%        
15 W.P. Labuan          159  14 days   1.3 days       1.3      13 8%         
16 W.P. Putrajaya        29 388 days  20.9 days      71.6      14 48%

Tip

Use sum() with a logic statement to “count” rows that meet certain criteria (==)
Note the use of na.rm = TRUE within mathematical functions like sum(), otherwise NA will be returned if there are any missing values
Use the function percent() from the scales package to easily convert to percents
- Set accuracy = to 0.1 or 0.01 to ensure 1 or 2 decimal places respectively
Use round() from base R to specify decimals
To calculate these statistics on the entire dataset, use summarise() without group_by()
You may create columns for the purposes of later calculations (e.g. denominators) that you eventually drop from your data frame with select().

Conditional statistics

You may want to return conditional statistics - e.g. the maximum of rows that meet certain criteria. This can be done by subsetting the column with brackets [ ].

c19_df %>% 
  group_by(state) %>% 
  summarise(
    max_age_msian = median(age[malaysian == "1"], na.rm = T),
    max_age_non_msian = median(age[malaysian == "0"], na.rm = T)
  )

# A tibble: 16 × 3
   state             max_age_msian max_age_non_msian
   <chr>                     <dbl>             <dbl>
 1 Johor                      64                45  
 2 Kedah                      65                45  
 3 Kelantan                   68                58  
 4 Melaka                     64                42.5
 5 Negeri Sembilan            67                45  
 6 Pahang                     65                46  
 7 Perak                      70                46  
 8 Perlis                     71                44  
 9 Pulau Pinang               70                45  
10 Sabah                      69                59  
11 Sarawak                    71                42  
12 Selangor                   63                47  
13 Terengganu                 67.5              50  
14 W.P. Kuala Lumpur          66                47  
15 W.P. Labuan                61                56  
16 W.P. Putrajaya             68                NA

Percentiles

Percentiles and quantiles in dplyr deserve a special mention. To return quantiles, use quantile() with the defaults or specify the value(s) you would like with probs =.

# get default percentile values of age (0%, 25%, 50%, 75%, 100%)
c19_df %>% 
  summarise(age_percentiles = quantile(age, na.rm = TRUE))

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

  age_percentiles
1               0
2              51
3              64
4              75
5             130

Or manually defined percentiles that are grouped

# get manually-specified percentile values of age (5%, 50%, 75%, 98%)
c19_df %>% 
  group_by(malaysian) %>%
  summarise(
    age_percentiles = quantile(
      age,
      probs = c(.05, 0.5, 0.75, 0.98), 
      na.rm=TRUE)
    )

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

`summarise()` has grouped output by 'malaysian'. You can override using the
`.groups` argument.

# A tibble: 8 × 2
# Groups:   malaysian [2]
  malaysian age_percentiles
      <int>           <dbl>
1         0              30
2         0              48
3         0              56
4         0              86
5         1              35
6         1              66
7         1              76
8         1              92

Tip

Do keep in mind that there any many ways to skin the cat! And always there will be more efficient ways to do things as you progress through R- Here is an example from the rstatix package

c19_df %>% 
  group_by(malaysian) %>%
  rstatix::get_summary_stats(age, type = "quantile")

# A tibble: 2 × 8
  malaysian variable     n  `0%` `25%` `50%` `75%` `100%`
      <int> <fct>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1         0 age       4034     0    41    48    56    130
2         1 age      33118     0    54    66    76    110

`across()` multiple columns

You can use summarise() across multiple columns using across(). This makes life easier when you want to calculate the same statistics for many columns. Place across() within summarise() and specify the following:

.cols = as either a vector of column names c() or “tidyselect” helper functions (explained below)
.fns = the function to perform (no parentheses) - you can provide multiple within a list()

c19_df %>% 
  group_by(state) %>% 
  mutate(across(contains("date"), ~as.Date(., format = "%Y-%m-%d")), #change character to dates\ fromat
         days_toAnnounce_state=date_announced-date,
         day_toDeath_state=date-date_positive) %>%
  summarise(across(.cols = c(day_toDeath_state, days_toAnnounce_state), # columns
                   .fns = list("mean" = mean, "sd" = sd),    # multiple functions 
                   na.rm=T))                                 # extra arguments

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(...)`.
ℹ In group 1: `state = "Johor"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

# A tibble: 16 × 5
   state      day_toDeath_state_mean day_toDeath_state_sd days_toAnnounce_stat…¹
   <chr>      <drtn>                                <dbl> <drtn>                
 1 Johor      5.563924 days                          9.17  4.796414 days        
 2 Kedah      6.058418 days                          6.43  9.989115 days        
 3 Kelantan   5.427871 days                          6.74  7.468487 days        
 4 Melaka     7.693322 days                          9.28  5.209398 days        
 5 Negeri Se… 6.683700 days                          6.99  8.657827 days        
 6 Pahang     7.459016 days                         13.1   5.176471 days        
 7 Perak      4.489372 days                          7.60  3.978281 days        
 8 Perlis     6.015075 days                         10.0   2.457286 days        
 9 Pulau Pin… 4.852278 days                          4.75  3.738129 days        
10 Sabah      5.930551 days                         15.8   4.892245 days        
11 Sarawak    5.973816 days                          9.05  4.523120 days        
12 Selangor   7.434960 days                         12.1  17.509434 days        
13 Terengganu 5.896133 days                          6.78  3.240884 days        
14 W.P. Kual… 6.166375 days                         12.1  11.021671 days        
15 W.P. Labu… 6.232704 days                          6.47  1.314465 days        
16 W.P. Putr… 7.862069 days                          7.29 20.896552 days        
# ℹ abbreviated name: ¹days_toAnnounce_state_mean
# ℹ 1 more variable: days_toAnnounce_state_sd <dbl>

Here are those “tidyselect” helper functions you can provide to .cols = to select columns:

everything() - all other columns not mentioned
last_col() - the last column
where() - applies a function to all columns and selects those which are TRUE
starts_with() - matches to a specified prefix. Example: starts_with("date")
ends_with() - matches to a specified suffix. Example: ends_with("_end")
contains() - columns containing a character string. Example: contains("time")
matches() - to apply a regular expression (regex). Example: contains("[pt]al")
num_range() -
any_of() - matches if column is named. Useful if the name might not exist. Example: any_of(date_onset, date_death, cardiac_arrest)

c19_df %>% 
  group_by(state) %>%
  summarise(across(
    .cols = where(is.numeric),  # all numeric columns in the data frame
    .fns = mean,
    na.rm=T))

# A tibble: 16 × 6
   state               age  male   bid malaysian comorb
   <chr>             <dbl> <dbl> <dbl>     <dbl>  <dbl>
 1 Johor              60.9 0.558 0.150     0.923  0.802
 2 Kedah              63.1 0.536 0.191     0.975  0.812
 3 Kelantan           66.7 0.505 0.233     0.978  0.852
 4 Melaka             62.2 0.564 0.166     0.964  0.823
 5 Negeri Sembilan    64.5 0.563 0.133     0.948  0.829
 6 Pahang             63.2 0.606 0.149     0.959  0.808
 7 Perak              66.9 0.577 0.187     0.976  0.859
 8 Perlis             68.1 0.568 0.121     0.980  0.915
 9 Pulau Pinang       66.0 0.584 0.227     0.911  0.791
10 Sabah              65.4 0.601 0.379     0.800  0.799
11 Sarawak            68.1 0.581 0.199     0.987  0.930
12 Selangor           59.4 0.592 0.216     0.830  0.740
13 Terengganu         64.9 0.534 0.145     0.981  0.871
14 W.P. Kuala Lumpur  61.7 0.585 0.256     0.799  0.659
15 W.P. Labuan        59.3 0.591 0.277     0.824  0.686
16 W.P. Putrajaya     64.7 0.690 0.103     1      0.793

`gtsummary()` package

Descriptives statistics approaches in R are numerous.

I initially heavily utilised dplyr and the janitor (you can find a tutorial here) and tableone (you can find a tutorial here) packages which are both fantastic packages. More recently however, I discovered gtsummary. And lets just say its the bomb! Its my absolutely favourite package for descriptive analysis (and we will explore some of its other powerful extensions later).

If you want to print your summary statistics in a pretty, publication-ready graphic, you can use the gtsummary package and its function tbl_summary(). The code can seem complex at first, but the outputs look very nice and print to your RStudio Viewer panel as an HTML image. s

Summary table

The default behavior of tbl_summary() is quite incredible - it takes the columns you provide and creates a summary table in one command. The function prints statistics appropriate to the column class: median and inter-quartile range (IQR) for numeric columns, and counts (%) for categorical columns. Missing values are converted to “Unknown”. Footnotes are added to the bottom to explain the statistics, while the total N is shown at the top.

c19_df %>% 
  select(age, state, male, malaysian, bid) %>%  # keep only the columns of interest
  tbl_summary()                                 # default

Characteristic	N = 37,152¹
age	64 (51, 75)
state
Johor	4,740 (13%)
Kedah	2,756 (7.4%)
Kelantan	1,428 (3.8%)
Melaka	1,213 (3.3%)
Negeri Sembilan	1,546 (4.2%)
Pahang	1,037 (2.8%)
Perak	2,164 (5.8%)
Perlis	199 (0.5%)
Pulau Pinang	2,085 (5.6%)
Sabah	3,211 (8.6%)
Sarawak	1,795 (4.8%)
Selangor	11,024 (30%)
Terengganu	905 (2.4%)
W.P. Kuala Lumpur	2,861 (7.7%)
W.P. Labuan	159 (0.4%)
W.P. Putrajaya	29 (<0.1%)
male	21,369 (58%)
malaysian	33,118 (89%)
bid	7,906 (21%)
¹ Median (IQR); n (%)

Adjustments

by =
You can stratify your table by a column (e.g. by outcome), creating a 2-way table.

statistic =
Use an equations to specify which statistics to show and how to display them. There are two sides to the equation, separated by a tilde ~. On the right side, in quotes, is the statistical display desired, and on the left are the columns to which that display will apply.

c19_df %>% 
  select(age) %>%               # keep only columns of interest 
  tbl_summary(                  # create summary table
    statistic = age ~ "{mean} ({sd})") # print mean of age

Characteristic	N = 37,152¹
age	63 (17)
¹ Mean (SD)

digits =
Adjust the digits and rounding. Optionally, this can be specified to be for continuous columns only (as below).

label =
Adjust how the column name should be displayed. Provide the column name and its desired label separated by a tilde. The default is the column name.

missing_text =
Adjust how missing values are displayed. The default is “Unknown”.

type =
This is used to adjust how many levels of the statistics are shown. The syntax is similar to statistic = in that you provide an equation with columns on the left and a value on the right. Two common scenarios include:

type = all_categorical() ~ "categorical" Forces dichotomous columns (e.g. fever yes/no) to show all levels instead of only the “yes” row
type = all_continuous() ~ "continuous2" Allows multi-line statistics per variable, as shown in a later section

c19_df %>% 
  mutate(across(contains("date"), ~as.Date(., format = "%Y-%m-%d")), #change character to dates\ fromat
         days_delay=date_announced-date,
         days_admitted=date-date_positive,
         vaccinated=ifelse(is.na(date_dose2), "unvaccinated", "vaccinated")) %>%
  select(age, male, malaysian, bid, vaccinated,
         comorb, days_delay, days_admitted) %>% # keep only columns of interest
  tbl_summary(     
    by = malaysian,                                               # stratify entire table by outcome
    statistic = list(all_continuous() ~ "{mean} ({sd})",        # stats and format for continuous columns
                     all_categorical() ~ "{n} / {N} ({p}%)"),   # stats and format for categorical columns
    digits = all_continuous() ~ 1,                              # rounding for continuous columns
    type   = all_categorical() ~ "categorical",                 # force all categorical levels to display
    label  = list(                                              # display labels for column names
      malaysian      ~ "Nationality",                           
      age            ~ "Age (years)",
      male           ~ "Gender",
      bid            ~ "Brought-in-dead",
      comorb         ~ "Comorbids",
      vaccinated     ~ "Vaccine status",
      days_admitted  ~ "Duration between diagnosis and death (days) ",
      days_delay     ~ "Duration between death and announcement (days)"),
    missing_text = "NA"                                    # how missing values should display
  )

Characteristic	0, N = 4,034¹	1, N = 33,118¹
Age (years)	49.5 (14.3)	64.3 (16.1)
Gender
0	1,649 / 4,034 (41%)	14,134 / 33,118 (43%)
1	2,385 / 4,034 (59%)	18,984 / 33,118 (57%)
Brought-in-dead
0	1,958 / 4,034 (49%)	27,288 / 33,118 (82%)
1	2,076 / 4,034 (51%)	5,830 / 33,118 (18%)
Vaccine status
unvaccinated	3,808 / 4,034 (94%)	24,226 / 33,118 (73%)
vaccinated	226 / 4,034 (5.6%)	8,892 / 33,118 (27%)
Comorbids
0	2,175 / 4,034 (54%)	5,717 / 33,118 (17%)
1	1,859 / 4,034 (46%)	27,401 / 33,118 (83%)
Duration between death and announcement (days)	15.0 (19.4)	8.9 (18.2)
Duration between diagnosis and death (days)	4.1 (14.6)	6.6 (10.0)
¹ Mean (SD); n / N (%)

Multi-line stats for continuous variables

If you want to print multiple lines of statistics for continuous variables, you can indicate this by setting the type = to “continuous2”. You can combine all of the previously shown elements in one table by choosing which statistics you want to show. To do this you need to tell the function that you want to get a table back by entering the type as “continuous2”. The number of missing values is shown as “Unknown”.

c19_df %>%
  mutate(across(contains("date"), ~as.Date(., format = "%Y-%m-%d")), #change character to dates\ fromat
         days_delay=date_announced-date,
         days_admitted=date-date_positive) %>%
  select(age, days_delay, days_admitted) %>% # keep only columns of interest
  tbl_summary(                               # create summary table
    type = all_continuous() ~ "continuous2", # indicate that you want to print multiple statistics 
    statistic = all_continuous() ~ c(
      "{mean} ({sd})",                       # line 1: mean and SD
      "{median} ({p25}, {p75})",             # line 2: median and IQR
      "{min}, {max}")                        # line 3: min and max
    )

Characteristic	N = 37,152
age
Mean (SD)	63 (17)
Median (IQR)	64 (51, 75)
Range	0, 130
days_delay
Mean (SD)	10 (18)
Median (IQR)	3 (2, 7)
Range	0, 396
days_admitted
Mean (SD)	6 (11)
Median (IQR)	4 (0, 9)
Range	0, 724

`rstatix()` for inferential statistics

Inferential statistics is a set of statistical procedures that allows us to draw conclusions about an entire population from a representative sample. The importance of inferential statistics lies in its ability to:

Generalize about a population: Inferential statistics enables researchers to make predictions or inferences about a population based on the observations made in a sample.
Test hypotheses: With inferential statistics, researchers can test a hypothesis to determine its statistical significance.
Study relationships: It allows researchers to examine the relationships between different variables in a sample and to generalize these relationships to the broader population.

We will not focus on the mathematics behind inferential statistics but more so the impementation within R. Nontheless, here is a quick summary of commonly used inferential statistical tests and when they are used:

T-tests (One-sample, Independent Two-sample, and Paired): These are used when we want to compare the means of one or two groups. For example, comparing the average height of men and women.
Analysis of Variance (ANOVA): ANOVA is used when comparing the means of more than two groups. For example, comparing the average income of people in three different cities.
Chi-square test: The chi-square test is used to determine if there is a significant association between two categorical variables. For example, examining the relationship between gender and voting behavior.
Correlation and Regression: Correlation is used to measure the strength and direction of the linear relationship between two variables. Regression is used to predict the value of one variable based on the value of another.

Each test has assumptions that need to be satisfied for the results to be accurate, so it’s essential to choose the right test for the data and research question at hand.

While it is possible to run these tests in base() R, we will skip that in this course since our grounding has all been carried out within the tidyverse(). If you wish to study inferential statistics using base () R you can have a look at this.

For the purposes of this course we shall take a deeper look at the rstatix package:

T-test

Use a formula syntax to specify the numeric and categorical columns for a two sample t-test:

c19_df %>% 
  t_test(age ~ male)

# A tibble: 1 × 8
  .y.   group1 group2    n1    n2 statistic     df        p
* <chr> <chr>  <chr>  <int> <int>     <dbl>  <dbl>    <dbl>
1 age   0      1      15783 21369      9.05 32644. 1.55e-19

Or a one-sample t-test

c19_df %>% 
  t_test(age ~ 1, mu = 60)

# A tibble: 1 × 7
  .y.   group1 group2         n statistic    df         p
* <chr> <chr>  <chr>      <int>     <dbl> <dbl>     <dbl>
1 age   1      null model 37152      30.8 37151 2.73e-206

Or one sample t-tests by group

c19_df %>% 
  group_by(male) %>% 
  t_test(age ~ 1, mu = 60)

# A tibble: 2 × 8
   male .y.   group1 group2         n statistic    df         p
* <int> <chr> <chr>  <chr>      <int>     <dbl> <dbl>     <dbl>
1     0 age   1      null model 15783      26.0 15782 2.3 e-146
2     1 age   1      null model 21369      18.0 21368 6.59e- 72

Shapiro-Wilk test

Note: Sample size must be between 3-5000

c19_df %>% 
  head(500) %>%            # first 500 rows of case linelist, for example only
  shapiro_test(age)

# A tibble: 1 × 3
  variable statistic             p
  <chr>        <dbl>         <dbl>
1 age          0.967 0.00000000413

Wilcoxon rank sum test

c19_df %>% 
  wilcox_test(age ~ malaysian)

# A tibble: 1 × 7
  .y.   group1 group2    n1    n2 statistic     p
* <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>
1 age   0      1       4034 33118 31201584.     0

Kruskal-Wallis test

Also known as the Mann-Whitney U test.

c19_df %>% 
  kruskal_test(age ~ state)

# A tibble: 1 × 6
  .y.       n statistic    df         p method        
* <chr> <int>     <dbl> <int>     <dbl> <chr>         
1 age   37152     1367.    15 2.03e-282 Kruskal-Wallis

Chi-squared test

The chi-square test function accepts a table, so first we create a cross-tabulation. There are many ways to create a cross-tabulation but here we use tabyl() from janitor and remove the left-most column of value labels before passing to chisq_test().

c19_df %>% 
  tabyl(malaysian, bid) %>% 
  select(-1) %>% 
  chisq_test()

# A tibble: 1 × 6
      n statistic     p    df method          p.signif
* <dbl>     <dbl> <dbl> <int> <chr>           <chr>   
1 37152     2459.     0     1 Chi-square test ****

`gtsummary()` package for Inferential statistics

Use gtsummary if you are looking to add the results of a statistical test to a pretty table that was created with this package. Performing statistical tests of comparison with tbl_summary is done by adding the add_p function to a table and specifying which test to use. It is possible to get p-values corrected for multiple testing by using the add_q function. Run ?tbl_summary for details.

T-tests

Compare the difference in means for a continuous variable in two groups. For example, compare the mean age by patient outcome.

c19_df %>% 
  select(age, malaysian) %>%                 # keep variables of interest
  tbl_summary(                               # produce summary table
    statistic = age ~ "{mean} ({sd})",       # specify what statistics to show
    by = malaysian) %>%                      # specify the grouping variable
  add_p(age ~ "t.test")                      # specify what tests to perform

Characteristic	0, N = 4,034¹	1, N = 33,118¹	p-value²
age	49 (14)	64 (16)	<0.001
¹ Mean (SD)
² Welch Two Sample t-test

Wilcoxon rank sum test

Compare the distribution of a continuous variable in two groups. The default is to use the Wilcoxon rank sum test and the median (IQR) when comparing two groups. However for non-normally distributed data or comparing multiple groups, the Kruskal-wallis test is more appropriate.

c19_df %>% 
  select(age, malaysian) %>%                     # keep variables of interest
  tbl_summary(                                   # produce summary table
    statistic = age ~ "{median} ({p25}, {p75})", # specify what statistic to show (this is default so could remove)
    by = malaysian) %>%                          # specify the grouping variable
  add_p(age ~ "wilcox.test")                     # specify what test to perform (default so could leave brackets empty)

Characteristic	0, N = 4,034¹	1, N = 33,118¹	p-value²
age	48 (41, 56)	66 (54, 76)	<0.001
¹ Median (IQR)
² Wilcoxon rank sum test

Kruskal-wallis test

Compare the distribution of a continuous variable in two or more groups, regardless of whether the data is normally distributed.

c19_df %>% 
  select(age, state) %>%                         # keep variables of interest
  tbl_summary(                                   # produce summary table
    statistic = age ~ "{median} ({p25}, {p75})", # specify what statistic to show (default, so could remove)
    by = state) %>%                              # specify the grouping variable
  add_p(age ~ "kruskal.test")                    # specify what test to perform

Characteristic	Johor, N = 4,740¹	Kedah, N = 2,756¹	Kelantan, N = 1,428¹	Melaka, N = 1,213¹	Negeri Sembilan, N = 1,546¹	Pahang, N = 1,037¹	Perak, N = 2,164¹	Perlis, N = 199¹	Pulau Pinang, N = 2,085¹	Sabah, N = 3,211¹	Sarawak, N = 1,795¹	Selangor, N = 11,024¹	Terengganu, N = 905¹	W.P. Kuala Lumpur, N = 2,861¹	W.P. Labuan, N = 159¹	W.P. Putrajaya, N = 29¹	p-value²
age	62 (49, 73)	64 (52, 76)	68 (58, 77)	64 (50, 75)	66 (55, 76)	64 (53, 75)	70 (58, 79)	71 (60, 81)	69 (55, 79)	67 (56, 78)	71 (59, 79)	60 (48, 71)	67 (57, 77)	63 (50, 74)	60 (48, 72)	68 (60, 73)	<0.001
¹ Median (IQR)
² Kruskal-Wallis rank sum test

Chi-squared test

Compare the proportions of a categorical variable in two groups. The default statistical test for add_p() when applied to a categorical variable is to perform a chi-squared test of independence with continuity correction, but if any expected call count is below 5 then a Fisher’s exact test is used.

c19_df %>% 
  select(malaysian, bid) %>% # keep variables of interest
  tbl_summary(by = bid) %>%  # produce summary table and specify grouping variable
  add_p()                    # specify what test to perform

Characteristic	0, N = 29,246¹	1, N = 7,906¹	p-value²
malaysian	27,288 (93%)	5,830 (74%)	<0.001
¹ n (%)
² Pearson's Chi-squared test

Correlations in R

Correlation between numeric variables can be investigated using the tidyverse
corrr package. It allows you to compute correlations using Pearson, Kendall tau or Spearman rho. The package creates a table and also has a function to automatically plot the values.

correlation_tab <- c19_df %>%
  mutate(across(contains("date"), ~as.Date(., format = "%Y-%m-%d")), #change character to dates\ fromat
         days_delay=as.numeric(date_announced-date),
         days_admitted=as.numeric(date-date_positive)) %>%
  select(age, days_delay, days_admitted) %>%                         # keep only columns of interest
  correlate()                                        # create correlation table (using default pearson)

Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'

correlation_tab                                                      # print

# A tibble: 3 × 4
  term              age days_delay days_admitted
  <chr>           <dbl>      <dbl>         <dbl>
1 age           NA         -0.0998       -0.0307
2 days_delay    -0.0998    NA            -0.0531
3 days_admitted -0.0307    -0.0531       NA

Plot a scatteplot of correlations

## plot correlations 
rplot(correlation_tab)

Finally you can create a nifty little heatmap. You can calculate a correlation data frame using the correlate() function from corrr, reshape it into a long format with melt(), and then create a heatmap with ggplot2:

correlation_tab %>% autoplot(triangular="full") +
  geom_text(aes(label=round(r, digits=2)), size=4)+
  theme_minimal(16)

Acknowledgements

Material for this lecture was borrowed and adopted from

tidyverse() for Describing data

Importing some data

Get counts

Proportions

Summary statistics

Conditional statistics

Percentiles

across() multiple columns

gtsummary() package

Summary table

Adjustments

Multi-line stats for continuous variables

rstatix() for inferential statistics

T-test

Shapiro-Wilk test

Wilcoxon rank sum test

Kruskal-Wallis test

Chi-squared test

gtsummary() package for Inferential statistics

T-tests

Wilcoxon rank sum test

Kruskal-wallis test

Chi-squared test

Correlations in R

Acknowledgements

`tidyverse()` for Describing data

`across()` multiple columns

`gtsummary()` package

`rstatix()` for inferential statistics

`gtsummary()` package for Inferential statistics