Practical Statistics 3: Powering up! Descriptive and Inferential Statistics
COVID-19 Vaccinations and Death in Malaysia
Task 1: Descriptive statistics using tidyverse
Question: Compute the summary statistics (count, mean, standard deviation, minimum, and maximum) of age using tidyverse functions.
Steps:
Install and load the tidyverse package.
Filter the dataset to remove missing values in the “age” column (Note there are no missing values in the dataset- the task is simply meant to simulate the code that would be required if there were).
Use the summary functions from dplyr to compute the required summary statistics. In this case- count, mean, standard deviation, minimum, and maximum
# A tibble: 1 × 8
.y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 age Female Male 15783 21369 9.05 32644. 1.55e-19
Task 4: Inferential statistics using gtsummary
Question: Test if there is a significant difference in age between Malaysians and non-Malaysians using the t-test, and present the results in a table using gtsummary.
Steps:
Recode the “malaysian” variable to factor (Tip: Use the factor function).
Use the tbl_summary() function to present the results.
Solution:
# Step 1c19_df$malaysian <-factor(c19_df$malaysian, levels =c(0, 1), labels =c("Non-Malaysian", "Malaysian"))# Step 2t_test_result <- c19_df %>%select(age, malaysian) %>%# keep variables of interesttbl_summary( # produce summary tablestatistic = age ~"{mean} ({sd})", # specify what statistics to showby = malaysian) %>%# specify the grouping variableadd_p(age ~"t.test") t_test_result
Characteristic
Non-Malaysian, N = 4,0341
Malaysian, N = 33,1181
p-value2
age
49 (14)
64 (16)
<0.001
1 Mean (SD)
2 Welch Two Sample t-test
Task 5: Correlations using corrr
Question: Compute the correlation between age, male, bid, and malaysian variables, and represent it in a correlation plot (Note: The selection of categorical variables is by design- just to practice the selection and presentation)
Steps:
Install and load the corrr package.
Create a subset of the data with the selected variables.
Compute the correlation matrix (Note: Try ?network_plot and see how this can be used)