Untangling Data with Tidyverse: Data wranggling in R

Data cleaning is arguably the most crucial step in the data analysis pipeline because it directly impacts the accuracy and reliability of insights drawn from the data.

Garbage in, garbage out, as the saying goes.

Without thoroughly cleaning the data, we might be working with incorrect, inconsistent, or irrelevant information, leading to potentially faulty conclusions. Data cleaning ensures the data is correctly formatted, accurate, complete, and ready for analysis. It involves dealing with missing values, removing duplicates, correcting inconsistencies, and handling outliers. Only after a rigorous data cleaning process can we trust that our analysis or model will give us meaningful, actionable insights.

Thus, while it can be a time-consuming process, skipping or skimping on data cleaning can lead to wasting even more time and resources downstream, as we try to interpret misleading results or troubleshoot models that aren’t performing as expected.

The “Tidyverse”

There are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and the collection of packages is often referred to as the “tidyverse” because of their dependence on and presumption of tidy data.

A subset of the “Tidyverse” packages include:

  • ggplot2: a plotting system based on the grammar of graphics

  • magrittr: defines the %>% operator for chaining functions together in a series of operations on data

  • dplyr: a suite of (fast) functions for working with data frames

  • tidyr: easily tidy data with pivot_wider() and pivot_longer() functions (also separate() and unite())

A complete list can be found here (https://www.tidyverse.org/packages).

We can call in the tidyverse as mentioned in the introduction like this:

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data Frames

The data frame (or data.frame) is a key data structure in statistics and in R.

The basic structure of a data frame is that there is one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation.

Tibbles

Another type of data structure that we need to discuss is called the tibble! It’s best to think of tibbles as an updated and stylish version of the data.frame.

Before we go any further, tibbles are data frames, but they have some new bells and whistles to make your life easier.

How tibbles differ from data.frame

  • Input type remains unchanged - data.frame is notorious for treating strings as factors; this will not happen with tibbles
  • Variable names remain unchanged - In base R, creating data.frames will remove spaces from names, converting them to periods or add “x” before numeric column names. Creating tibbles will not change variable (column) names.
  • There are no row.names() for a tibble - Tidy data requires that variables be stored in a consistent way, removing the need for row names.
  • Tibbles print first ten rows and columns that fit on one screen - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.

Converting to data.frame or tibble

Use the as.data.frame() or as_tibble()

Calling in some data

For the purposes of this session lets utilise the Malaysian COVID-19 deaths linelinest maintained by the Ministry of Health on their Github page. Codes for each column are as follows:

  1. date: yyyy-mm-dd format; date of death

  2. date_announced: date on which the death was announced to the public (i.e. registered in the public linelist)

  3. date_positive: date of positive sample

  4. date_doseN: date of the individual’s first/second/third dose (if any)

  5. brandN: p = Pfizer, s = Sinovac, a = AstraZeneca, c = Cansino, m = Moderna, h = Sinopharm, j = Janssen, u = unverified (pending sync with VMS)

  6. state: state of residence

  7. age: age as an integer; note that it is possible for age to be 0, denoting infants less than 6 months old

  8. male: binary variable with 1 denoting male and 0 denoting female

  9. bid: binary variable with 1 denoting brought-in-dead and 0 denoting an inpatient death

  10. malaysian: binary variable with 1 denoting Malaysian and 0 denoting non-Malaysian

  11. comorb: binary variable with 1 denoting that the individual has comorbidities and 0 denoting no comorbidities declared

Lets call in the data:

c19_df <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/linelist/linelist_deaths.csv")

Check the data structure

str(c19_df)
'data.frame':   37152 obs. of  15 variables:
 $ date          : chr  "2020-03-17" "2020-03-17" "2020-03-20" "2020-03-21" ...
 $ date_announced: chr  "2020-03-17" "2020-03-17" "2020-03-20" "2020-03-21" ...
 $ date_positive : chr  "2020-03-12" "2020-03-15" "2020-03-11" "2020-03-13" ...
 $ date_dose1    : chr  "" "" "" "" ...
 $ date_dose2    : chr  "" "" "" "" ...
 $ date_dose3    : chr  "" "" "" "" ...
 $ brand1        : chr  "" "" "" "" ...
 $ brand2        : chr  "" "" "" "" ...
 $ brand3        : chr  "" "" "" "" ...
 $ state         : chr  "Johor" "Sarawak" "Sabah" "Melaka" ...
 $ age           : int  34 60 58 50 80 39 57 69 48 73 ...
 $ male          : int  1 1 1 1 0 0 1 1 1 1 ...
 $ bid           : int  0 0 0 0 1 0 0 0 0 0 ...
 $ malaysian     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ comorb        : int  1 1 1 1 1 1 1 1 1 1 ...
dim(c19_df)
[1] 37152    15

The dplyr Package

The dplyr package was developed by Posit (formely RStudio) and is an optimized and distilled version of the older plyr package for data manipulation or wrangling.

Artwork by Allison Horst on the dplyr package

[Source: Artwork by Allison Horst]

The dplyr package does not provide any “new” functionality to R per se, in the sense that everything dplyr does could already be done with base R, but it greatly simplifies existing functionality in R.

One important contribution of the dplyr package is that it provides a “grammar” (in particular, verbs) for data manipulation and for operating on data frames.

With this grammar, you can sensibly communicate what it is that you are doing to a data frame that other people can understand (assuming they also know the grammar). This is useful because it provides an abstraction for data manipulation that previously did not exist.

Another useful contribution is that the dplyr functions are very fast, as many key operations are coded in C++.

dplyr grammar

Some of the key “verbs” provided by the dplyr package are

  • select(): return a subset of the columns of a data frame, using a flexible notation

  • filter(): extract a subset of rows from a data frame based on logical conditions

  • arrange(): reorder rows of a data frame

  • rename(): rename variables in a data frame

  • mutate(): add new variables/columns or transform existing variables

  • summarise() / summarize(): generate summary statistics of different variables in the data frame, possibly within strata

  • %>%: the “pipe” operator is used to connect multiple verb actions together into a pipelineArtwork by

select()

Lets convert the COVID-19 deaths linelist into a tibble first

c19_df <- as_tibble(c19_df)
str(c19_df)
tibble [37,152 × 15] (S3: tbl_df/tbl/data.frame)
 $ date          : chr [1:37152] "2020-03-17" "2020-03-17" "2020-03-20" "2020-03-21" ...
 $ date_announced: chr [1:37152] "2020-03-17" "2020-03-17" "2020-03-20" "2020-03-21" ...
 $ date_positive : chr [1:37152] "2020-03-12" "2020-03-15" "2020-03-11" "2020-03-13" ...
 $ date_dose1    : chr [1:37152] "" "" "" "" ...
 $ date_dose2    : chr [1:37152] "" "" "" "" ...
 $ date_dose3    : chr [1:37152] "" "" "" "" ...
 $ brand1        : chr [1:37152] "" "" "" "" ...
 $ brand2        : chr [1:37152] "" "" "" "" ...
 $ brand3        : chr [1:37152] "" "" "" "" ...
 $ state         : chr [1:37152] "Johor" "Sarawak" "Sabah" "Melaka" ...
 $ age           : int [1:37152] 34 60 58 50 80 39 57 69 48 73 ...
 $ male          : int [1:37152] 1 1 1 1 0 0 1 1 1 1 ...
 $ bid           : int [1:37152] 0 0 0 0 1 0 0 0 0 0 ...
 $ malaysian     : int [1:37152] 1 1 1 1 1 1 1 1 1 1 ...
 $ comorb        : int [1:37152] 1 1 1 1 1 1 1 1 1 1 ...

The select() function can be used to select columns of a data frame that you want to focus on.

Example

Suppose we wanted to take the last 5 columns only. There are a few ways to do this.

We could for example use numerical indices:

names(c19_df)[10:15]
[1] "state"     "age"       "male"      "bid"       "malaysian" "comorb"   

But we can also use the names directly:

subset <- select(c19_df, state:comorb)
head(subset)
# A tibble: 6 × 6
  state     age  male   bid malaysian comorb
  <chr>   <int> <int> <int>     <int>  <int>
1 Johor      34     1     0         1      1
2 Sarawak    60     1     0         1      1
3 Sabah      58     1     0         1      1
4 Melaka     50     1     0         1      1
5 Sarawak    80     0     1         1      1
6 Sarawak    39     0     0         1      1
Note

The : normally cannot be used with names or strings, but inside the select() function you can use it to specify a range of variable names.

You can also omit variables using the select() function by using the negative sign. With select() you can do

select(c19_df, -(state:comorb))

The select() function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a “2”, we could do

subset <- select(c19_df, ends_with(c("1", "2", "3")))
str(subset)
tibble [37,152 × 6] (S3: tbl_df/tbl/data.frame)
 $ date_dose1: chr [1:37152] "" "" "" "" ...
 $ brand1    : chr [1:37152] "" "" "" "" ...
 $ date_dose2: chr [1:37152] "" "" "" "" ...
 $ brand2    : chr [1:37152] "" "" "" "" ...
 $ date_dose3: chr [1:37152] "" "" "" "" ...
 $ brand3    : chr [1:37152] "" "" "" "" ...

Or if we wanted to keep every variable that starts with a “date”, we could do

subset <- select(c19_df, contains("date"))
str(subset)
tibble [37,152 × 6] (S3: tbl_df/tbl/data.frame)
 $ date          : chr [1:37152] "2020-03-17" "2020-03-17" "2020-03-20" "2020-03-21" ...
 $ date_announced: chr [1:37152] "2020-03-17" "2020-03-17" "2020-03-20" "2020-03-21" ...
 $ date_positive : chr [1:37152] "2020-03-12" "2020-03-15" "2020-03-11" "2020-03-13" ...
 $ date_dose1    : chr [1:37152] "" "" "" "" ...
 $ date_dose2    : chr [1:37152] "" "" "" "" ...
 $ date_dose3    : chr [1:37152] "" "" "" "" ...

You can also use more general regular expressions if necessary. See the help page (?select) for more details.

filter()

The filter() function is used to extract subsets of rows from a data frame. This function is similar to the existing subset() function in R but is quite a bit faster in my experience.

Artwork by Allison Horst on filter() function

[Source: Artwork by Allison Horst]

Example

Suppose we wanted to extract the rows of the c19_df data frame of deaths in individuals older than 60, we could do

c19_filter <- filter(c19_df, age > 59)
str(c19_filter)
tibble [22,276 × 15] (S3: tbl_df/tbl/data.frame)
 $ date          : chr [1:22276] "2020-03-17" "2020-03-22" "2020-03-21" "2020-03-22" ...
 $ date_announced: chr [1:22276] "2020-03-17" "2020-03-22" "2020-03-21" "2020-03-22" ...
 $ date_positive : chr [1:22276] "2020-03-15" "2020-03-20" "2020-03-17" "2020-03-14" ...
 $ date_dose1    : chr [1:22276] "" "" "" "" ...
 $ date_dose2    : chr [1:22276] "" "" "" "" ...
 $ date_dose3    : chr [1:22276] "" "" "" "" ...
 $ brand1        : chr [1:22276] "" "" "" "" ...
 $ brand2        : chr [1:22276] "" "" "" "" ...
 $ brand3        : chr [1:22276] "" "" "" "" ...
 $ state         : chr [1:22276] "Sarawak" "Sarawak" "Kelantan" "Pulau Pinang" ...
 $ age           : int [1:22276] 60 80 69 73 70 70 70 75 66 68 ...
 $ male          : int [1:22276] 1 0 1 1 1 1 1 1 1 0 ...
 $ bid           : int [1:22276] 0 1 0 0 0 0 0 0 0 0 ...
 $ malaysian     : int [1:22276] 1 1 1 1 1 1 1 1 1 1 ...
 $ comorb        : int [1:22276] 1 1 1 1 1 1 1 1 1 1 ...

You can see that there are now only 22276 rows in the data frame and the distribution of the age values is.

summary(c19_filter$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   60.0    66.0    73.0    73.7    81.0   130.0 

We can place an arbitrarily complex logical sequence inside of filter(), so we could for example extract the rows where age is greater than 60 and nationality (malaysian) is equal to 1

c19_filter <- filter(c19_df, age < 21 & malaysian==1)
select(c19_filter, date, malaysian, age)
# A tibble: 221 × 3
   date       malaysian   age
   <chr>          <int> <int>
 1 2020-10-05         1     1
 2 2020-11-08         1     2
 3 2020-11-15         1     8
 4 2020-11-04         1    15
 5 2021-02-03         1     0
 6 2021-01-21         1    18
 7 2021-02-09         1    20
 8 2021-02-13         1     9
 9 2021-05-10         1    17
10 2021-04-04         1     3
# ℹ 211 more rows

Now there are only 221 observations where both of those conditions are met.

Other logical operators you should be aware of include:

Operator Meaning Example
== Equals malaysian== 1
!= Does not equal malaysian!= 1
> Greater than age> 60
>= Greater than or equal to age>= 60
< Less than age< 60
<= Less than or equal to age<= 60
%in% Included in state %in% c("Selangor", "Johor")
is.na() Is a missing value is.na(date_dose_2)
Note

If you are ever unsure of how to write a logical statement, but know how to write its opposite, you can use the ! operator to negate the whole statement.

A common use of this is to identify observations with non-missing data (e.g., !(is.na(date_dose_2))).

arrange()

The arrange() function is used to reorder rows of a data frame according to one of the variables/columns. Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R. The arrange() function simplifies the process quite a bit.

Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.

c19_df <- arrange(c19_df, date)

We can now check the first few rows

head(select(c19_df, date, age), 3)
# A tibble: 3 × 2
  date         age
  <chr>      <int>
1 2020-03-17    34
2 2020-03-17    60
3 2020-03-20    58

Columns can be arranged in descending order too by useing the special desc() operator.

c19_df <- arrange(c19_df, desc(date))

Looking at the first three and last three rows shows the dates in descending order.

head(select(c19_df, date, age), 3)
# A tibble: 3 × 2
  date         age
  <chr>      <int>
1 2023-06-27    79
2 2023-06-26    83
3 2023-06-22    68
tail(select(c19_df, date, age), 3)
# A tibble: 3 × 2
  date         age
  <chr>      <int>
1 2020-03-20    58
2 2020-03-17    34
3 2020-03-17    60

rename()

Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function is designed to make this process easier.

Here you can see the names of the first six variables in the c19_df data frame.

head(c19_df[, 1:6], 3)
# A tibble: 3 × 6
  date       date_announced date_positive date_dose1   date_dose2   date_dose3  
  <chr>      <chr>          <chr>         <chr>        <chr>        <chr>       
1 2023-06-27 2023-07-01     2023-06-26    ""           ""           ""          
2 2023-06-26 2023-07-01     2023-06-10    "2021-06-30" "2021-07-21" ""          
3 2023-06-22 2023-06-24     2023-06-19    "2021-06-15" "2021-07-06" "2022-01-06"

These names are (arbitrarily again) unnecessarily long. Date doesn’t need to be repeated for each column name are no other columns that could potentially be confused with the first six. So we can modify cause we’re lazy to type long column names when analysing later on.

c19_df <- rename(c19_df, death = date,
                 announced = date_announced,
                 positive = date_positive,
                 dose1 = date_dose1,
                 dose2 = date_dose2,
                 dose3 = date_dose3)
head(c19_df[, 1:5], 3)
# A tibble: 3 × 5
  death      announced  positive   dose1        dose2       
  <chr>      <chr>      <chr>      <chr>        <chr>       
1 2023-06-27 2023-07-01 2023-06-26 ""           ""          
2 2023-06-26 2023-07-01 2023-06-10 "2021-06-30" "2021-07-21"
3 2023-06-22 2023-06-24 2023-06-19 "2021-06-15" "2021-07-06"

The syntax inside the rename() function is to have the new name on the left-hand side of the = sign and the old name on the right-hand side.

mutate()

The mutate() function exists to compute transformations of variables in a data frame. Often, you want to create new variables that are derived from existing variables and mutate() provides a clean interface for doing that.

Artwork by Allison Horst on mutate() function

[Source: Artwork by Allison Horst]

For example, with demographic data, we often want to categorise data

Here, we create a age_cat variable:

c19_df <- mutate(c19_df, age_cat= ifelse(age<20, "<19", 
                                         ifelse(age>59, ">60", "20-59")))

we can also recode category values within a variable, such as

c19_df <- mutate(c19_df,  brand2 = recode(brand2, 
                          "Pfizer" = 0, 
                          "Sinovac" = 1, 
                          "AstraZeneca" = 2, 
                          "Moderna" = 2, 
                          "Pending VMS sync" = 2, 
                          "Sinopharm" = 2))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `brand2 = recode(...)`.
Caused by warning:
! Unreplaced values treated as NA as `.x` is not compatible.
Please specify replacements exhaustively or supply `.default`.

group_by()

The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable.

For example, in this dataset, you might want to know what the number of deaths in each state is?

In conjunction with the group_by() function, we often use the summarise() function

Note

The general operation here is a combination of

  1. Splitting a data frame into separate pieces defined by a variable or group of variables (group_by())
  2. Then, applying a summary function across those subsets (summarise())
Example

We can create a separate data frame that splits the original data frame by state

state <- group_by(c19_df, state)

We can then compute summary statistics for each year in the data frame with the summarise() function.

summarise(state, age = mean(age, na.rm = TRUE),
          age_median= median(age, na.rm = TRUE))
# A tibble: 16 × 3
   state               age age_median
   <chr>             <dbl>      <dbl>
 1 Johor              60.9       60.9
 2 Kedah              63.1       63.1
 3 Kelantan           66.7       66.7
 4 Melaka             62.2       62.2
 5 Negeri Sembilan    64.5       64.5
 6 Pahang             63.2       63.2
 7 Perak              66.9       66.9
 8 Perlis             68.1       68.1
 9 Pulau Pinang       66.0       66.0
10 Sabah              65.4       65.4
11 Sarawak            68.1       68.1
12 Selangor           59.4       59.4
13 Terengganu         64.9       64.9
14 W.P. Kuala Lumpur  61.7       61.7
15 W.P. Labuan        59.3       59.3
16 W.P. Putrajaya     64.7       64.7

%>%

The pipeline operator %>% is very handy for stringing together multiple dplyr functions in a sequence of operations.

Notice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e. This nesting is not a natural way to think about a sequence of operations.

The %>% operator allows you to string operations in a left-to-right fashion, i.e

Example

Take the example that we just did in the last section.

That can be done with the following sequence in a single R expression.

c19_df %>% 
  select(state, age) %>%
  arrange(age) %>%
  mutate(age_cat = ifelse(age<20, "<19", 
                         ifelse(age>59, ">60", "20-59"))) %>%   
  group_by(state, age_cat) %>% 
  summarise(age = mean(age, na.rm = TRUE),
          age_median= median(age, na.rm = TRUE))
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
# A tibble: 46 × 4
# Groups:   state [16]
   state    age_cat   age age_median
   <chr>    <chr>   <dbl>      <dbl>
 1 Johor    20-59   45.8       45.8 
 2 Johor    <19      8.62       8.62
 3 Johor    >60     73.5       73.5 
 4 Kedah    20-59   46.9       46.9 
 5 Kedah    <19      7.47       7.47
 6 Kedah    >60     73.7       73.7 
 7 Kelantan 20-59   47.7       47.7 
 8 Kelantan <19     14.4       14.4 
 9 Kelantan >60     74.1       74.1 
10 Melaka   20-59   46.0       46.0 
# ℹ 36 more rows

This way we do not have to create a set of temporary variables along the way or create a massive nested sequence of function calls.

slice_*()

The slice_sample() function of the dplyr package will allow you to see a sample of random rows in random order.

The number of rows to show is specified by the n argument.

  • This can be useful if you do not want to print the entire tibble, but you want to get a greater sense of the values.
  • This is a good option for data analysis reports, where printing the entire tibble would not be appropriate if the tibble is quite large.
Example
slice_sample(c19_df, n = 10)
# A tibble: 10 × 16
   death   announced positive dose1 dose2 dose3 brand1 brand2 brand3 state   age
   <chr>   <chr>     <chr>    <chr> <chr> <chr> <chr>   <dbl> <chr>  <chr> <int>
 1 2021-0… 2021-09-… 2021-08… ""    ""    ""    ""         NA ""     Kedah    58
 2 2021-0… 2021-08-… 2021-08… ""    ""    ""    ""         NA ""     Johor    63
 3 2022-0… 2022-03-… 2022-03… ""    ""    ""    ""         NA ""     Nege…    91
 4 2020-0… 2020-09-… 2020-09… ""    ""    ""    ""         NA ""     Sabah    57
 5 2022-0… 2022-04-… 2022-03… "202… "202… ""    "Pfiz…      0 ""     Johor    76
 6 2021-0… 2021-05-… 2021-05… ""    ""    ""    ""         NA ""     Johor    58
 7 2022-0… 2022-03-… 2022-03… ""    ""    ""    ""         NA ""     Kela…    75
 8 2021-0… 2021-08-… 2021-08… "202… ""    ""    "Pfiz…     NA ""     Mela…    47
 9 2021-0… 2021-09-… 2021-08… ""    ""    ""    ""         NA ""     Sela…    55
10 2022-0… 2022-02-… 2022-02… ""    ""    ""    ""         NA ""     Sela…    96
# ℹ 5 more variables: male <int>, bid <int>, malaysian <int>, comorb <int>,
#   age_cat <chr>

You can also use slice_head() or slice_tail() to take a look at the top rows or bottom rows of your tibble. Again the number of rows can be specified with the n argument.

This will show the first 5 rows.

slice_head(c19_df, n = 5)
# A tibble: 5 × 16
  death    announced positive dose1 dose2 dose3 brand1 brand2 brand3 state   age
  <chr>    <chr>     <chr>    <chr> <chr> <chr> <chr>   <dbl> <chr>  <chr> <int>
1 2023-06… 2023-07-… 2023-06… ""    ""    ""    ""         NA ""     Perak    79
2 2023-06… 2023-07-… 2023-06… "202… "202… ""    "Sino…      1 ""     Kedah    83
3 2023-06… 2023-06-… 2023-06… "202… "202… "202… "Pfiz…      0 "Pfiz… Tere…    68
4 2023-06… 2023-06-… 2023-06… "202… "202… "202… "Sino…      1 "Pfiz… Perak    75
5 2023-06… 2023-06-… 2023-06… ""    ""    ""    ""         NA ""     Pula…    82
# ℹ 5 more variables: male <int>, bid <int>, malaysian <int>, comorb <int>,
#   age_cat <chr>

This will show the last 5 rows.

slice_tail(c19_df, n = 5)
# A tibble: 5 × 16
  death    announced positive dose1 dose2 dose3 brand1 brand2 brand3 state   age
  <chr>    <chr>     <chr>    <chr> <chr> <chr> <chr>   <dbl> <chr>  <chr> <int>
1 2020-03… 2020-03-… 2020-03… ""    ""    ""    ""         NA ""     W.P.…    57
2 2020-03… 2020-03-… 2020-03… ""    ""    ""    ""         NA ""     Kela…    69
3 2020-03… 2020-03-… 2020-03… ""    ""    ""    ""         NA ""     Sabah    58
4 2020-03… 2020-03-… 2020-03… ""    ""    ""    ""         NA ""     Johor    34
5 2020-03… 2020-03-… 2020-03… ""    ""    ""    ""         NA ""     Sara…    60
# ℹ 5 more variables: male <int>, bid <int>, malaysian <int>, comorb <int>,
#   age_cat <chr>

Pivoting in R

The tidyr package includes functions to transfer a data frame between long and wide.

  • Wide format data tends to have different attributes or variables describing an observation placed in separate columns.
  • Long format data tends to have different attributes encoded as levels of a single variable, followed by another column that contains values of the observation at those different levels.

Lets create a sample set based on the deaths dataset focussing only on brand2 uptake over time.

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.character), na_if, "")`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.

pivot_wider()

The pivot_wider() function is less commonly needed to tidy data as compared to its sister pivot_longer. It can, however, be useful for creating summary tables. As out sample dataset is already in long form- for the sake of this example we will pivot_wider first.

dose2_df <- dose2_df %>%
  pivot_wider(id_cols="date",
              names_from = "brand2",
              values_from = "count")
Tippy-tip

You use the kable() function in dplyr to make nicer looking html tables

dose2_df %>%
  mutate_all(~replace_na(., 0)) %>%
  head(10) %>%
  knitr::kable(format="html", caption = "Vaccinations among COVID-19 fatalities by Brand") %>% kableExtra::kable_minimal()
`mutate_all()` ignored the following grouping variables:
• Column `date`
ℹ Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message.
Vaccinations among COVID-19 fatalities by Brand
date unvaccinated Pfizer Sinovac Pending VMS sync AstraZeneca Moderna Sinopharm
2020-03-17 2 0 0 0 0 0 0
2020-03-20 1 0 0 0 0 0 0
2020-03-21 4 0 0 0 0 0 0
2020-03-22 4 0 0 0 0 0 0
2020-03-23 5 0 0 0 0 0 0
2020-03-24 2 0 0 0 0 0 0
2020-03-25 3 0 0 0 0 0 0
2020-03-26 5 0 0 0 0 0 0
2020-03-27 2 0 0 0 0 0 0
2020-03-28 6 0 0 0 0 0 0

pivot_longer()

Even if your data is in a tidy format, pivot_longer() is useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable.

dose2_df %>%
  pivot_longer(-date, 
               names_to = "brand2",
               values_to = "count")
# A tibble: 7,028 × 3
# Groups:   date [1,004]
   date       brand2           count
   <chr>      <chr>            <int>
 1 2020-03-17 unvaccinated         2
 2 2020-03-17 Pfizer              NA
 3 2020-03-17 Sinovac             NA
 4 2020-03-17 Pending VMS sync    NA
 5 2020-03-17 AstraZeneca         NA
 6 2020-03-17 Moderna             NA
 7 2020-03-17 Sinopharm           NA
 8 2020-03-20 unvaccinated         1
 9 2020-03-20 Pfizer              NA
10 2020-03-20 Sinovac             NA
# ℹ 7,018 more rows

separate() and unite()

The same tidyr package also contains two useful functions:

  • unite(): combine contents of two or more columns into a single column
  • separate(): separate contents of a column into two or more columns

First, we combine the first three columns into one new column using unite().

c19_df %>% select(brand1, brand2, brand3) %>%
  unite(col="profile", 
        brand1:brand3, 
        sep="_") %>%
  tail(10)
                                  profile
37143                                  __
37144                                  __
37145              Sinovac_Sinovac_Pfizer
37146                      Pfizer_Pfizer_
37147                                  __
37148                Pfizer_Pfizer_Pfizer
37149                Pfizer_Pfizer_Pfizer
37150 AstraZeneca_AstraZeneca_AstraZeneca
37151                                  __
37152                    Sinovac_Sinovac_

Next, we show how to separate the columns into three separate columns using separate() using the col, into and sep arguments.

c19_df %>% select(brand1, brand2, brand3) %>%
  unite(col="profile", 
        brand1:brand3, 
        sep="_") %>% 
  select(profile) %>%
  separate(col="profile", 
           into=c("brand1", "brand2", "brand3"), 
           sep="_") %>% 
  tail(10)
           brand1      brand2      brand3
37143                                    
37144                                    
37145     Sinovac     Sinovac      Pfizer
37146      Pfizer      Pfizer            
37147                                    
37148      Pfizer      Pfizer      Pfizer
37149      Pfizer      Pfizer      Pfizer
37150 AstraZeneca AstraZeneca AstraZeneca
37151                                    
37152     Sinovac     Sinovac            

Mutating joins

The dplyr package provides a set of functions for joining two data frames into a single data frame based on a set of key columns.

There are several functions in the *_join() family.

  • These functions all merge together two data frames
  • They differ in how they handle observations that exist in one but not both data frames.

Here, are the four functions from this family that you will likely use the most often:

Function What it includes in merged data frame
`left_join()` Includes all observations in the left data frame, whether or not there is a match in the right data frame
`right_join()` Includes all observations in the right data frame, whether or not there is a match in the left data frame
`inner_join()` Includes only observations that are in both data frames
`full_join()` Includes all observations from both data frames

[Source from R for Data Science]

Suppose we want to create a table that combines the information about COVID-19 deaths (c19_df) with the information about the expenditure (hies_df) at each state.

First lets take c19_df and aggregate it at the state level.

state_df <- c19_df %>% select(state) %>%
  group_by(state) %>%
  summarise(deaths=n())

Lets call in an external object to join

hies_df <- read.csv("https://raw.githubusercontent.com/dosm-malaysia/data-open/main/datasets/economy/hies_2019.csv")

Lets look at the data

table(hies_df$area_type)

 district       dun parliment     state 
      160       600       222        16 

Left Join

We can use the left_join() function to merge the state_df and hies_df datasets.

left_join(x = state_df, y = hies_df, by = join_by(state==area))
# A tibble: 20 × 7
   state        deaths area_type income_mean expenditure_mean  gini poverty_rate
   <chr>         <int> <chr>           <int>            <int> <dbl>        <dbl>
 1 Johor          4740 state            8013             4793 0.366          3.9
 2 Kedah          2756 state            5522             3359 0.354          8.8
 3 Kelantan       1428 state            4874             3223 0.378         12.4
 4 Melaka         1213 state            7741             4955 0.383          3.9
 5 Negeri Semb…   1546 state            6707             4350 0.391          4.3
 6 Pahang         1037 state            5667             3652 0.33           4.3
 7 Perak          2164 state            5645             3564 0.377          7.3
 8 Perlis          199 state            5476             3468 0.334          3.9
 9 Perlis          199 district         5476             3468 0.334          3.9
10 Pulau Pinang   2085 state            7774             4630 0.359          1.9
11 Sabah          3211 state            5745             2792 0.397         19.5
12 Sarawak        1795 state            5959             3448 0.387          9  
13 Selangor      11024 state           10827             5830 0.393          1.2
14 Terengganu      905 state            6815             4336 0.335          6.1
15 W.P. Kuala …   2861 state           13257             6913 0.35           0.2
16 W.P. Kuala …   2861 district        13257             6913 0.35           0.2
17 W.P. Labuan     159 state            8319             4097 0.333          3.1
18 W.P. Labuan     159 district         8319             4097 0.333          3.1
19 W.P. Putraj…     29 state           12840             7980 0.361          0.4
20 W.P. Putraj…     29 district        12840             7980 0.361          0.4
Note

The by argument indicates the column (or columns) that the two tables have in common. One more than one joining variable an be used for this statement

Quite obviously the join should give you the total number of rows on the left side of your statement. Note in the above case there are 20 rows because there are four districts with the same name as states.

Inner Join

The inner_join() function only retains the rows of both tables that have corresponding values. Here we can see the difference.

inner_join(x = state_df, y = hies_df, by = join_by(state==area))
# A tibble: 20 × 7
   state        deaths area_type income_mean expenditure_mean  gini poverty_rate
   <chr>         <int> <chr>           <int>            <int> <dbl>        <dbl>
 1 Johor          4740 state            8013             4793 0.366          3.9
 2 Kedah          2756 state            5522             3359 0.354          8.8
 3 Kelantan       1428 state            4874             3223 0.378         12.4
 4 Melaka         1213 state            7741             4955 0.383          3.9
 5 Negeri Semb…   1546 state            6707             4350 0.391          4.3
 6 Pahang         1037 state            5667             3652 0.33           4.3
 7 Perak          2164 state            5645             3564 0.377          7.3
 8 Perlis          199 state            5476             3468 0.334          3.9
 9 Perlis          199 district         5476             3468 0.334          3.9
10 Pulau Pinang   2085 state            7774             4630 0.359          1.9
11 Sabah          3211 state            5745             2792 0.397         19.5
12 Sarawak        1795 state            5959             3448 0.387          9  
13 Selangor      11024 state           10827             5830 0.393          1.2
14 Terengganu      905 state            6815             4336 0.335          6.1
15 W.P. Kuala …   2861 state           13257             6913 0.35           0.2
16 W.P. Kuala …   2861 district        13257             6913 0.35           0.2
17 W.P. Labuan     159 state            8319             4097 0.333          3.1
18 W.P. Labuan     159 district         8319             4097 0.333          3.1
19 W.P. Putraj…     29 state           12840             7980 0.361          0.4
20 W.P. Putraj…     29 district        12840             7980 0.361          0.4

Does inner_join give different results to left_join in the above example?

Right Join

The right_join() function is like the left_join() function except that it gives priority to the “right” hand argument.

right_join(x = state_df, y = hies_df, by = join_by(state==area))
# A tibble: 998 × 7
   state        deaths area_type income_mean expenditure_mean  gini poverty_rate
   <chr>         <int> <chr>           <int>            <int> <dbl>        <dbl>
 1 Johor          4740 state            8013             4793 0.366          3.9
 2 Kedah          2756 state            5522             3359 0.354          8.8
 3 Kelantan       1428 state            4874             3223 0.378         12.4
 4 Melaka         1213 state            7741             4955 0.383          3.9
 5 Negeri Semb…   1546 state            6707             4350 0.391          4.3
 6 Pahang         1037 state            5667             3652 0.33           4.3
 7 Perak          2164 state            5645             3564 0.377          7.3
 8 Perlis          199 state            5476             3468 0.334          3.9
 9 Perlis          199 district         5476             3468 0.334          3.9
10 Pulau Pinang   2085 state            7774             4630 0.359          1.9
# ℹ 988 more rows

What about now?

Acknowledgements

Material for this lecture was borrowed and adopted from

Additional Resources