Progression to Regression: An Introduction to Regression in R

Regression analysis is a powerful statistical tool that allows researchers to investigate relationships between variables, specifically the relationship between a dependent variable and one or more independent variables. In healthcare data, regression analysis is used for numerous purposes:

  1. Prediction and Forecasting: Regression analysis can help forecast future health trends based on historical data. For example, it can predict the prevalence of a disease based on risk factors.

  2. Identifying Risk Factors: Regression analysis can be used to identify risk factors for diseases. For example, it can help determine whether smoking is a significant risk factor for lung cancer.

  3. Cost-effectiveness Analysis: In health economics, regression analysis can help analyze the cost-effectiveness of different treatments or interventions.

  4. Evaluating Treatment Effects: Regression can be used to compare the effects of different treatments in a population, which can inform healthcare decision-making.

Here are some important regression models used in healthcare data analysis:

  1. Linear Regression: Linear regression is used when the dependent variable is continuous. For example, it could be used to examine the relationship between age (independent variable) and blood pressure (dependent variable).

  2. Logistic Regression: Logistic regression is used when the dependent variable is binary, i.e., it has two possible outcomes. It is often used to identify risk factors for diseases. For example, it could be used to investigate the impact of various factors (like age, sex, BMI, smoking status) on the likelihood of developing heart disease (Yes/No).

  3. Cox Regression: Cox regression, or proportional hazards regression, is a type of survival analysis used to investigate the effect of various factors on the time a specified event takes to happen. For example, it can be used to study the survival time of patients after being diagnosed with a certain disease.

  4. Poisson Regression: Poisson regression is used for count data. For instance, it can model the number of hospital admissions over a certain period.

  5. Multilevel Regression: Multilevel (or hierarchical) regression is used when data is grouped, such as patients within hospitals. This takes into account the potential correlation of patients within the same group.

  6. Nonlinear Regression: Nonlinear regression can be used when the relationship between the independent and dependent variables is not linear.

Remember, selecting the appropriate regression model depends largely on the type of data at hand and the specific research question. We will be introducing the approach to univariable and multivariable linear and logistic regression analysis in R.

Before we start lets get some important packages loaded

required_packages <- c("tidyverse", "lubridate", "gtsummary", "broom", "performance", "see", "flextable", "stats", "car")
not_installed <- required_packages[!(required_packages %in% installed.packages()[ , "Package"])]    
if(length(not_installed)) install.packages(not_installed)                                           
suppressWarnings(lapply(required_packages, require, character.only = TRUE))
Loading required package: tidyverse
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: gtsummary

Loading required package: broom

Loading required package: performance

Loading required package: see

Loading required package: flextable


Attaching package: 'flextable'


The following objects are masked from 'package:gtsummary':

    as_flextable, continuous_summary


The following object is masked from 'package:purrr':

    compose


Loading required package: car

Loading required package: carData


Attaching package: 'car'


The following object is masked from 'package:dplyr':

    recode


The following object is masked from 'package:purrr':

    some

Lets call in the data:

c19_df <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/linelist/linelist_deaths.csv")

#create a sequence of numbers to represent time
c19_df <- c19_df %>%
  arrange(date) %>%
  group_by(date) %>%
  mutate(date_index = cur_group_id(),
         age_cat=ifelse(age<20, "Less than 20",
                        ifelse(age %in% 20:40, "20-40",
                               ifelse(age %in% 40:60, "40-60",
                                      ifelse(age %in% 60:80, "60:80", ">80"))))) %>%
  ungroup()

Just to be consistent we what I said yesterday- before diving into the data always always skim the data first to get a quick feels

c19_df %>% 
  skimr::skim()
Data summary
Name Piped data
Number of rows 37152
Number of columns 17
_______________________
Column type frequency:
character 11
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date 0 1 10 10 0 1004 0
date_announced 0 1 10 10 0 1000 0
date_positive 0 1 10 10 0 999 0
date_dose1 0 1 0 10 22437 298 0
date_dose2 0 1 0 10 28034 267 0
date_dose3 0 1 0 10 35719 161 0
brand1 0 1 0 16 22437 8 0
brand2 0 1 0 16 28034 7 0
brand3 0 1 0 16 35719 6 0
state 0 1 5 17 0 16 0
age_cat 0 1 3 12 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 62.65 16.59 0 51 64 75 130 ▁▃▇▃▁
male 0 1 0.58 0.49 0 0 1 1 1 ▆▁▁▁▇
bid 0 1 0.21 0.41 0 0 0 0 1 ▇▁▁▁▂
malaysian 0 1 0.89 0.31 0 1 1 1 1 ▁▁▁▁▇
comorb 0 1 0.79 0.41 0 1 1 1 1 ▂▁▁▁▇
date_index 0 1 419.07 122.14 1 362 393 444 1004 ▁▇▆▁▁

Linear regression

The R function lm() perform linear regression, assessing the relationship between numeric response and explanatory variables that are assumed to have a linear relationship.

Provide the equation as a formula, with the response and explanatory column names separated by a tilde ~. Also, specify the dataset to data =. Define the model results as an R object, to use later.

Univariate linear regression

lm_results <- lm(age ~ malaysian, data = c19_df)

You can then run summary() on the model results to see the coefficients (Estimates), P-value, residuals, and other measures.

summary(lm_results)

Call:
lm(formula = age ~ malaysian, data = c19_df)

Residuals:
   Min     1Q Median     3Q    Max 
-64.26 -10.26   0.74  11.74  80.53 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  49.4727     0.2509  197.14   <2e-16 ***
malaysian    14.7875     0.2658   55.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.94 on 37150 degrees of freedom
Multiple R-squared:  0.07691,   Adjusted R-squared:  0.07689 
F-statistic:  3095 on 1 and 37150 DF,  p-value: < 2.2e-16

Alternatively you can use the tidy() function from the broom package to pull the results in to a table. What the results tell us is that Malaysians (change from 0 to 1) the age of COVID-19 death increases by 14.8 years and this is statistically significant (p<0.01).

tidy(lm_results)
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)     49.5     0.251     197.        0
2 malaysian       14.8     0.266      55.6       0

You can then also use this regression to add it to a ggplot, to do this we first pull the points for the observed data and the fitted line in to one data frame using the augment() function from broom.

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = malaysian)) + 
  ## add points for height 
  geom_point(aes(y = age)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

Note

When using a categorical variable as the predictor as we did above the plots can appear somewhat difficult to understand. We can try using a continuous on continuous method and implement it more simpler directly in ggplot and it should look like:

## add your data to a plot 
 ggplot(c19_df, aes(x = date_index, y = age)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

Multivariable linear regression

There is almost no difference when carrying out multivariable linear regression, instead we just add on the expression + to add more variables to the equation.

lm_results <- lm(age ~ malaysian + bid + male + state + comorb, 
                 data = c19_df)

Check the output

tidy(lm_results)
# A tibble: 20 × 5
   term                   estimate std.error statistic  p.value
   <chr>                     <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)             47.6        0.380  125.     0       
 2 malaysian               12.9        0.284   45.3    0       
 3 bid                     -1.25       0.209   -5.96   2.57e- 9
 4 male                    -1.41       0.165   -8.58   9.74e-18
 5 stateKedah               1.48       0.376    3.95   7.86e- 5
 6 stateKelantan            4.94       0.474   10.4    2.18e-25
 7 stateMelaka              0.763      0.504    1.51   1.30e- 1
 8 stateNegeri Sembilan     3.19       0.459    6.95   3.80e-12
 9 statePahang              1.92       0.537    3.57   3.56e- 4
10 statePerak               5.27       0.407   12.9    3.65e-38
11 statePerlis              6.10       1.13     5.38   7.61e- 8
12 statePulau Pinang        5.42       0.412   13.2    1.88e-39
13 stateSabah               6.45       0.362   17.8    1.26e-70
14 stateSarawak             6.06       0.436   13.9    6.18e-44
15 stateSelangor            0.0115     0.274    0.0422 9.66e- 1
16 stateTerengganu          3.05       0.569    5.36   8.55e- 8
17 stateW.P. Kuala Lumpur   2.96       0.373    7.93   2.21e-15
18 stateW.P. Labuan         0.245      1.26     0.194  8.46e- 1
19 stateW.P. Putrajaya      2.96       2.92     1.01   3.10e- 1
20 comorb                   3.00       0.210   14.3    4.79e-46

Pretty simple right!

Univariate logistic regression

The function glm() from the stats package (part of base R) is used to fit Generalized Linear Models (GLM). glm() can be used for univariate and multivariable logistic regression (e.g. to get Odds Ratios). Here are the core parts:

  • formula = The model is provided to glm() as an equation, with the outcome on the left and explanatory variables on the right of a tilde ~.

  • family = This determines the type of model to run. For logistic regression, use family = "binomial", for poisson use family = "poisson". Other examples are in the table below.

  • data = Specify your data frame

If necessary, you can also specify the link function via the syntax family = familytype(link = "linkfunction")). You can read more in the documentation about other families and optional arguments such as weights = and subset = (?glm).

Family Default link function
"binomial" (link = "logit")
"gaussian" (link = "identity")
"Gamma" (link = "inverse")
"inverse.gaussian" (link = "1/mu^2")
"poisson" (link = "log")
"quasi" (link = "identity", variance = "constant")
"quasibinomial" (link = "logit")
"quasipoisson" (link = "log")

When running glm() it is most common to save the results as a named R object. Then you can print the results to your console using summary() as shown below, or perform other operations on the results (e.g. exponentiate). If you need to run a negative binomial regression you can use the MASS package; the glm.nb() uses the same syntax as glm(). For a walk-through of different regressions, see the UCLA stats page. So lets try to run a logistic regression model first:

model <- glm(malaysian ~ age_cat, family = "binomial", data = c19_df)
summary(model)

Call:
glm(formula = malaysian ~ age_cat, family = "binomial", data = c19_df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7224   0.2231   0.2747   0.6605   0.7777  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          3.68075    0.08588  42.859  < 2e-16 ***
age_cat20-40        -2.63967    0.09382 -28.135  < 2e-16 ***
age_cat40-60        -2.26931    0.08895 -25.512  < 2e-16 ***
age_cat60:80        -0.42226    0.09566  -4.414 1.01e-05 ***
age_catLess than 20 -2.19915    0.18614 -11.814  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25526  on 37151  degrees of freedom
Residual deviance: 22412  on 37147  degrees of freedom
AIC: 22422

Number of Fisher Scoring iterations: 6

Can you spot a potential issue in the summary above?

c19_df %>% 
  mutate(age_cat = fct_relevel(age_cat, "Less than 20", after = 0)) %>% 
  glm(formula = malaysian ~ age_cat, family = "binomial") %>% 
  summary()

Call:
glm(formula = malaysian ~ age_cat, family = "binomial", data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7224   0.2231   0.2747   0.6605   0.7777  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.48160    0.16514   8.972  < 2e-16 ***
age_cat>80    2.19915    0.18614  11.814  < 2e-16 ***
age_cat20-40 -0.44052    0.16941  -2.600  0.00931 ** 
age_cat40-60 -0.07017    0.16676  -0.421  0.67394    
age_cat60:80  1.77689    0.17043  10.426  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25526  on 37151  degrees of freedom
Residual deviance: 22412  on 37147  degrees of freedom
AIC: 22422

Number of Fisher Scoring iterations: 6

Lets clean that up so we have a nice exponentiated and round up the hyper precise output:

model <- glm(malaysian ~ age_cat, family = "binomial", data = c19_df) %>% 
  tidy(exponentiate = TRUE, conf.int = TRUE) %>%        # exponentiate and produce CIs
  mutate(across(where(is.numeric), round, digits = 2))  # round all numeric columns
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, digits = 2)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
model
# A tibble: 5 × 7
  term                estimate std.error statistic p.value conf.low conf.high
  <chr>                  <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)            39.7       0.09     42.9        0    33.7      47.2 
2 age_cat20-40            0.07      0.09    -28.1        0     0.06      0.09
3 age_cat40-60            0.1       0.09    -25.5        0     0.09      0.12
4 age_cat60:80            0.66      0.1      -4.41       0     0.54      0.79
5 age_catLess than 20     0.11      0.19    -11.8        0     0.08      0.16

Multivariable logistic regression

The expressions utilised are exactly the same as in the multivariable linear regression model

log_reg <- glm(malaysian ~ age_cat + bid + comorb + male + state, 
               family = "binomial", data = c19_df)

tidy(log_reg, exponentiate = TRUE, conf.int = TRUE)
# A tibble: 23 × 7
   term                estimate std.error statistic   p.value conf.low conf.high
   <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 (Intercept)          38.4       0.112      32.5  2.49e-232  30.9      48.0   
 2 age_cat20-40          0.0709    0.101     -26.3  3.00e-152   0.0580    0.0861
 3 age_cat40-60          0.106     0.0943    -23.8  9.11e-125   0.0881    0.128 
 4 age_cat60:80          0.535     0.100      -6.26 3.75e- 10   0.438     0.648 
 5 age_catLess than 20   0.142     0.205      -9.51 1.94e- 21   0.0955    0.214 
 6 bid                   0.282     0.0398    -31.8  2.41e-221   0.261     0.305 
 7 comorb                3.45      0.0396     31.3  1.59e-215   3.20      3.73  
 8 male                  1.09      0.0394      2.29 2.20e-  2   1.01      1.18  
 9 stateKedah            3.37      0.139       8.76 1.88e- 18   2.58      4.45  
10 stateKelantan         2.75      0.194       5.22 1.76e-  7   1.91      4.09  
# ℹ 13 more rows

If you want to include two variables and an interaction between them you can separate them with an asterisk * instead of a +. Separate them with a colon : if you are only specifying the interaction. For example:

log_intrx_reg <- glm(malaysian ~ age_cat*bid + comorb + male + state, 
               family = "binomial", data = c19_df)

tidy(log_reg, exponentiate = TRUE, conf.int = TRUE)
# A tibble: 23 × 7
   term                estimate std.error statistic   p.value conf.low conf.high
   <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 (Intercept)          38.4       0.112      32.5  2.49e-232  30.9      48.0   
 2 age_cat20-40          0.0709    0.101     -26.3  3.00e-152   0.0580    0.0861
 3 age_cat40-60          0.106     0.0943    -23.8  9.11e-125   0.0881    0.128 
 4 age_cat60:80          0.535     0.100      -6.26 3.75e- 10   0.438     0.648 
 5 age_catLess than 20   0.142     0.205      -9.51 1.94e- 21   0.0955    0.214 
 6 bid                   0.282     0.0398    -31.8  2.41e-221   0.261     0.305 
 7 comorb                3.45      0.0396     31.3  1.59e-215   3.20      3.73  
 8 male                  1.09      0.0394      2.29 2.20e-  2   1.01      1.18  
 9 stateKedah            3.37      0.139       8.76 1.88e- 18   2.58      4.45  
10 stateKelantan         2.75      0.194       5.22 1.76e-  7   1.91      4.09  
# ℹ 13 more rows

Model building

You can build your model step-by-step, saving various models that include certain explanatory variables. You can compare these models with likelihood-ratio tests using lrtest() from the package lmtest, as below:

model1 <- glm(malaysian ~ age_cat, family = "binomial", data = c19_df)
model2 <- glm(malaysian ~ age_cat + bid, family = "binomial", data = c19_df)

lmtest::lrtest(model1, model2)
Likelihood ratio test

Model 1: malaysian ~ age_cat
Model 2: malaysian ~ age_cat + bid
  #Df LogLik Df  Chisq Pr(>Chisq)    
1   5 -11206                         
2   6 -10375  1 1661.8  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you prefer to instead leverage the computation available within you processor we can instead take the model object and apply the step() function from the stats package. Specify which variable selection direction you want use when building the model.

## choose a model using forward selection based on AIC
## you can also do "backward" or "both" by adjusting the direction
final_log_reg <- log_reg %>%
  step(direction = "both", trace = FALSE)

Check the best fitting models based on both a backward and forward method

log_tab_base <- final_log_reg %>% 
  broom::tidy(exponentiate = TRUE, conf.int = TRUE) %>%  ## get a tidy dataframe of estimates 
  mutate(across(where(is.numeric), round, digits = 2))          ## round 

#check
log_tab_base
# A tibble: 23 × 7
   term                estimate std.error statistic p.value conf.low conf.high
   <chr>                  <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
 1 (Intercept)            38.4       0.11     32.5     0       31.0      48.0 
 2 age_cat20-40            0.07      0.1     -26.3     0        0.06      0.09
 3 age_cat40-60            0.11      0.09    -23.8     0        0.09      0.13
 4 age_cat60:80            0.53      0.1      -6.26    0        0.44      0.65
 5 age_catLess than 20     0.14      0.21     -9.51    0        0.1       0.21
 6 bid                     0.28      0.04    -31.8     0        0.26      0.31
 7 comorb                  3.45      0.04     31.3     0        3.2       3.73
 8 male                    1.09      0.04      2.29    0.02     1.01      1.18
 9 stateKedah              3.37      0.14      8.76    0        2.58      4.45
10 stateKelantan           2.75      0.19      5.22    0        1.91      4.09
# ℹ 13 more rows

gtsummary() and regression

My favourite package again. The gtsummary package provides the tbl_regression() function, which will take the outputs from a regression (glm() in this case) and produce an nice summary table.

## show results table of final regression 
mv_tab <- tbl_regression(final_log_reg, exponentiate = TRUE)

#check
mv_tab
Characteristic OR1 95% CI1 p-value
age_cat
    >80
    20-40 0.07 0.06, 0.09 <0.001
    40-60 0.11 0.09, 0.13 <0.001
    60:80 0.53 0.44, 0.65 <0.001
    Less than 20 0.14 0.10, 0.21 <0.001
bid 0.28 0.26, 0.31 <0.001
comorb 3.45 3.20, 3.73 <0.001
male 1.09 1.01, 1.18 0.022
state
    Johor
    Kedah 3.37 2.58, 4.45 <0.001
    Kelantan 2.75 1.91, 4.09 <0.001
    Melaka 2.22 1.61, 3.14 <0.001
    Negeri Sembilan 1.30 1.00, 1.71 0.055
    Pahang 1.86 1.33, 2.65 <0.001
    Perak 2.85 2.11, 3.92 <0.001
    Perlis 2.75 1.10, 9.25 0.057
    Pulau Pinang 0.81 0.66, 1.00 0.044
    Sabah 0.26 0.23, 0.31 <0.001
    Sarawak 4.49 2.96, 7.15 <0.001
    Selangor 0.47 0.41, 0.54 <0.001
    Terengganu 3.16 1.96, 5.44 <0.001
    W.P. Kuala Lumpur 0.40 0.34, 0.47 <0.001
    W.P. Labuan 0.44 0.28, 0.72 <0.001
    W.P. Putrajaya 34,834 340, 10,238,707,683,122,566 >0.9
1 OR = Odds Ratio, CI = Confidence Interval

Its so good! But it doesn’t end there. tbl_uvregression() from the gtsummary package produces a table of univariate regression results. We select only the necessary columns from the linelist (explanatory variables and the outcome variable) and pipe them into tbl_uvregression(). We are going to run univariate regression on each of the columns we defined as explanatory_vars in the data

Within the function itself, we provide the method = as glm (no quotes), the y = outcome column (outcome), specify to method.args = that we want to run logistic regression via family = binomial, and we tell it to exponentiate the results.

The output is HTML and contains the counts

## define variables of interest 
explanatory_vars <- c("bid", "male", "comorb", "age_cat", "state")#lets leave state ut to increase the computation

#cteate a data subset and run the regression
univ_tab <- c19_df %>% 
  dplyr::select(explanatory_vars, malaysian) %>% ## select variables of interest

  tbl_uvregression(                         ## produce univariate table
    method = glm,                           ## define regression want to run (generalised linear model)
    y = malaysian,                            ## define outcome variable
    method.args = list(family = binomial),  ## define what type of glm want to run (logistic)
    exponentiate = TRUE                     ## exponentiate to produce odds ratios (rather than log odds)
  )

## view univariate results table 
univ_tab
Characteristic N OR1 95% CI1 p-value
bid 37,152 0.20 0.19, 0.22 <0.001
male 37,152 0.93 0.87, 0.99 0.029
comorb 37,152 5.61 5.24, 6.00 <0.001
age_cat 37,152
    >80
    20-40 0.07 0.06, 0.09 <0.001
    40-60 0.10 0.09, 0.12 <0.001
    60:80 0.66 0.54, 0.79 <0.001
    Less than 20 0.11 0.08, 0.16 <0.001
state 37,152
    Johor
    Kedah 3.21 2.49, 4.20 <0.001
    Kelantan 3.65 2.57, 5.37 <0.001
    Melaka 2.22 1.63, 3.10 <0.001
    Negeri Sembilan 1.51 1.19, 1.95 0.001
    Pahang 1.98 1.45, 2.79 <0.001
    Perak 3.40 2.55, 4.61 <0.001
    Perlis 4.08 1.72, 13.3 0.006
    Pulau Pinang 0.85 0.71, 1.03 0.094
    Sabah 0.34 0.29, 0.38 <0.001
    Sarawak 6.45 4.32, 10.1 <0.001
    Selangor 0.41 0.36, 0.46 <0.001
    Terengganu 4.37 2.76, 7.43 <0.001
    W.P. Kuala Lumpur 0.33 0.29, 0.38 <0.001
    W.P. Labuan 0.39 0.26, 0.61 <0.001
    W.P. Putrajaya 65,203 434, 166,983,382,533,055,616 >0.9
1 OR = Odds Ratio, CI = Confidence Interval

Now for some magic

## combine with univariate results 
mv_merge <- tbl_merge(
  tbls = list(univ_tab, mv_tab),                          # combine
  tab_spanner = c("**Univariate**", "**Multivariable**")) # set header names

#check
mv_merge
Characteristic Univariate Multivariable
N OR1 95% CI1 p-value OR1 95% CI1 p-value
bid 37,152 0.20 0.19, 0.22 <0.001 0.28 0.26, 0.31 <0.001
male 37,152 0.93 0.87, 0.99 0.029 1.09 1.01, 1.18 0.022
comorb 37,152 5.61 5.24, 6.00 <0.001 3.45 3.20, 3.73 <0.001
age_cat 37,152
    >80
    20-40 0.07 0.06, 0.09 <0.001 0.07 0.06, 0.09 <0.001
    40-60 0.10 0.09, 0.12 <0.001 0.11 0.09, 0.13 <0.001
    60:80 0.66 0.54, 0.79 <0.001 0.53 0.44, 0.65 <0.001
    Less than 20 0.11 0.08, 0.16 <0.001 0.14 0.10, 0.21 <0.001
state 37,152
    Johor
    Kedah 3.21 2.49, 4.20 <0.001 3.37 2.58, 4.45 <0.001
    Kelantan 3.65 2.57, 5.37 <0.001 2.75 1.91, 4.09 <0.001
    Melaka 2.22 1.63, 3.10 <0.001 2.22 1.61, 3.14 <0.001
    Negeri Sembilan 1.51 1.19, 1.95 0.001 1.30 1.00, 1.71 0.055
    Pahang 1.98 1.45, 2.79 <0.001 1.86 1.33, 2.65 <0.001
    Perak 3.40 2.55, 4.61 <0.001 2.85 2.11, 3.92 <0.001
    Perlis 4.08 1.72, 13.3 0.006 2.75 1.10, 9.25 0.057
    Pulau Pinang 0.85 0.71, 1.03 0.094 0.81 0.66, 1.00 0.044
    Sabah 0.34 0.29, 0.38 <0.001 0.26 0.23, 0.31 <0.001
    Sarawak 6.45 4.32, 10.1 <0.001 4.49 2.96, 7.15 <0.001
    Selangor 0.41 0.36, 0.46 <0.001 0.47 0.41, 0.54 <0.001
    Terengganu 4.37 2.76, 7.43 <0.001 3.16 1.96, 5.44 <0.001
    W.P. Kuala Lumpur 0.33 0.29, 0.38 <0.001 0.40 0.34, 0.47 <0.001
    W.P. Labuan 0.39 0.26, 0.61 <0.001 0.44 0.28, 0.72 <0.001
    W.P. Putrajaya 65,203 434, 166,983,382,533,055,616 >0.9 34,834 340, 10,238,707,683,122,566 >0.9
1 OR = Odds Ratio, CI = Confidence Interval

One final piece of magic! We can export our publication ready regression table into a document. Lovely!

mv_merge %>%
  as_flex_table() %>%
  flextable::save_as_docx(path = "regression.docx")

Model evaluation in r using the performance() and car()

Lets just run through quickly how we can evaluate linear and logistics models in R.

Linear model evaluation

Linearity and Homoscedasticity: These can be checked using a residuals vs fitted values plot.

For linearity, we expect to see no pattern or curve in the points, while for homoscedasticity, we expect the spread of residuals to be approximately constant across all levels of the independent variables.

check_model(lm_results)

Normality of residuals: This can be checked using a QQ plot.

If the residuals are normally distributed, the points in the QQ plot will generally follow the straight line.

check_normality(model)

Independence of residuals (No autocorrelation): This can be checked using the Durbin-Watson test.

If the test statistic is approximately 2, it indicates no autocorrelation.

check_autocorrelation(model)

No multicollinearity: This can be checked using Variance Inflation Factor (VIF).

A VIF value greater than 5 (some suggest 10) might indicate problematic multicollinearity.

check_collinearity(model)

Logistic model evaluation

Binary Outcome: Logistic regression requires the dependent variable to be binary or ordinal in ordinal logistic regression.

Observation Independence: Observations should be independent of each other. This is more of a study design issue than something you would check with a statistical test.

No Multicollinearity: Just like in linear regression, the independent variables should not be highly correlated with each other.

Just like in linear regression, a rule of thumb is that if the VIF is greater than 5 (or sometimes 10), then the multicollinearity is high.

# Checking for Multicollinearity:
check_collinearity(mv_model)

Linearity of Log Odds: While logistic regression does not assume linearity of the relationship between the independent variables and the dependent variable, it does assume linearity of independent variables and the log odds.

Checking for Linearity of Log Odds:

Logistic regression assumes that the log odds of the outcome is a linear combination of the independent variables. This is a complex assumption to check, but one approach is to look for significant interactions between your predictors and the log odds.

First, fit a model that allows for the possibility of a non-linear relationship:

logit_model_2 <- glm(outcome ~ predictor + I(predictor^2), family=binomial, data=df)

Then compare this model with your original model:

anova(logit_model, logit_model_2, test="Chisq")

If the model with the quadratic term is significantly better than the model without (i.e., p < 0.05), this could be a sign that the assumption of linearity of the log odds has been violated.

Acknowledgements

Material for this lecture was borrowed and adopted from