Practical session 1: Untangling Data with Tidyverse: Data wranggling in R

COVID-19 Vaccinations and Death in Malaysia

Task 1: Calculate the average age for deaths by state and find the state with the highest average age.

Steps:

  1. First, we need to group the data by state.

  2. Then, we can summarize the average age per state using summarise.

  3. Use arrange to sort the average age in descending order to find the state with the highest average age.

Solution:

c19_df %>% 
  group_by(state) %>%
  summarise(avg_age = mean(age, na.rm = TRUE)) %>%
  arrange(desc(avg_age))
# A tibble: 16 × 2
   state             avg_age
   <chr>               <dbl>
 1 Perlis               68.1
 2 Sarawak              68.1
 3 Perak                66.9
 4 Kelantan             66.7
 5 Pulau Pinang         66.0
 6 Sabah                65.4
 7 Terengganu           64.9
 8 W.P. Putrajaya       64.7
 9 Negeri Sembilan      64.5
10 Pahang               63.2
11 Kedah                63.1
12 Melaka               62.2
13 W.P. Kuala Lumpur    61.7
14 Johor                60.9
15 Selangor             59.4
16 W.P. Labuan          59.3

Task 2: Determine the proportion of male to female deaths in each state.

Steps:

  1. Using mutate, create a new column called gender using ifelse to convert the male column to ‘Male’ and ‘Female’.

  2. Group the data by state and gender.

  3. Summarise the count of each gender in each state.

  4. Create a new column with the proportion of each gender in each state.

Solution:

c19_df %>%
  mutate(gender = ifelse(male == 1, "Male", "Female")) %>%
  group_by(state, gender) %>%
  summarise(count = n()) %>%
  mutate(proportion = count / sum(count))
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
# A tibble: 32 × 4
# Groups:   state [16]
   state           gender count proportion
   <chr>           <chr>  <int>      <dbl>
 1 Johor           Female  2096      0.442
 2 Johor           Male    2644      0.558
 3 Kedah           Female  1280      0.464
 4 Kedah           Male    1476      0.536
 5 Kelantan        Female   707      0.495
 6 Kelantan        Male     721      0.505
 7 Melaka          Female   529      0.436
 8 Melaka          Male     684      0.564
 9 Negeri Sembilan Female   675      0.437
10 Negeri Sembilan Male     871      0.563
# ℹ 22 more rows

Task 3: Determine the total number of deaths by month and year.

Steps:

  1. Convert the date column to Date type if it’s not already.

  2. Use mutate to create new columns year and month using the year and month functions from the lubridate package.

  3. Group the data by year and month.

  4. Use summarise to count the number of deaths.

Solution:

c19_df %>%
  mutate(date = as.Date(date),
         year = year(date),
         month = month(date)) %>%
  group_by(year, month) %>%
  summarise(deaths = n())
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 40 × 3
# Groups:   year [4]
    year month deaths
   <dbl> <dbl>  <int>
 1  2020     3     48
 2  2020     4     55
 3  2020     5     12
 4  2020     6      6
 5  2020     7      4
 6  2020     8      3
 7  2020     9     10
 8  2020    10    129
 9  2020    11    116
10  2020    12    133
# ℹ 30 more rows

Task 4: Determine if comorbidities are more common in Malaysian or non-Malaysian deaths.

Steps:

  1. Create a new column nationality that categorizes malaysian into ‘Malaysian’ and ‘Non-Malaysian’ using mutate and ifelse.

  2. Group by nationality.

  3. Summarise the average comorbidity rate (comorb).

Solution:

c19_df %>%
  mutate(nationality = ifelse(malaysian == 1, "Malaysian", "Non-Malaysian")) %>%
  group_by(nationality) %>%
  summarise(avg_comorb = mean(comorb, na.rm = TRUE))
# A tibble: 2 × 2
  nationality   avg_comorb
  <chr>              <dbl>
1 Malaysian          0.827
2 Non-Malaysian      0.461

Task 5: Find out the most common vaccine brand combination that was administered.

Steps:

  1. Use mutate to create a new column brands_combo that concatenates brand1, brand2, and brand3.

  2. filter to keep only those rows where brands_combo is not empty.

  3. Group by brands_combo.

  4. Count the number of occurrences for each vaccine brand combination using summarise.

Solution:

c19_df %>%
  unite(col="profile", 
        brand1:brand3, 
        sep="_")  %>%
  filter(profile != "--") %>%
  group_by(profile) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
# A tibble: 32 × 2
   profile                  count
   <chr>                    <int>
 1 __                       22437
 2 Sinovac_Sinovac_          4693
 3 Pfizer_Pfizer_            2733
 4 Sinovac__                 2672
 5 Pfizer__                  2141
 6 AstraZeneca__              707
 7 Pfizer_Pfizer_Pfizer       591
 8 Sinovac_Sinovac_Pfizer     548
 9 AstraZeneca_AstraZeneca_   242
10 Sinovac_Sinovac_Sinovac    145
# ℹ 22 more rows