avsnitt Progress
0% färdig

R for Medical Research

Welcome to R for researchers

The purpose of this course is to teach you how to handle, analyze and present research data using R. In the modern era of data science and e-learning, researchers and students are increasingly aware of the importance of handling data themselves, irrespective of their background. Indeed, handling and analyzing data has become an integral part of many professions. The field of data science has exploded in recent years, such that students and researchers in all fields are exposed to data science in some way. The dissemination of powerful software and know-how has made it possible for anyone to handle data.

This course intends to teach you how to analyze and present data for research. The authors of this course are medical researchers with many years experience in teaching research methods, programming and data science.15 This includes both experimental studies and observational research. This course is an attempt to provide you with all you need to understand how to design a research study, select appropriate analyses and subsequently performing the analyses and present the research findings. We will be using R, which is an extremely powerful tool for research. There are currently more than 15.000 packages (i.e libraries with useful, ready-made functions) available in R. These packages facilitate all aspect of analyzing data in R.

R is an easy language to learn and it offers very powerful tools for researchers, irrespective of the field. Anyone can learn R in a few days and produce impressive reports.

This course will give you all tools necessary to conduct research, regardless of whether your interest lies in medicine or finance. However, all examples and emphasis will be one methods in medical research.

A quick example

Let us see how quickly we can create the core components of a clinical study. We will be using data on patients with colon cancer.

1. Load data

After loading the survival package we can now load data on colon cancer patients.

data(colon)
2. View 10 random patients

Let’s see 10 random patients:

sample_n(colon, 10)
idstudyrxsexageobstructperforadherenodesstatusdifferextentsurgnode4timeetype
13906951Lev+5FU05100010231020081
7973991Lev+5FU16600070230125402
10885441Lev0460107123013131
5132571Lev0571008123012322
3351681Obs1500004133107172
2841421Lev1280009133012461
4562281Lev15300011230019761
188941Lev+5FU026001NA0241128691
15987991Obs06900010221019291
198991Lev171001NA124012191
3. Quick overview of all columns (variables, features)

Let us get a quick overview of variable names, types, missingness, means, etc.

summarizeColumns(colon)
nametypenameandispmedianmadminmaxnlevs
idnumeric0465.0000000268.2512426465.0343.963219290
studynumeric01.00000000.00000001.00.0000110
rxfactor0NA0.6609257NANA6086303
sexnumeric00.52099030.49969371.00.0000010
agenumeric059.754574811.945669661.011.860818850
obstructnumeric00.19375670.39534690.00.0000010
perfornumeric00.02906350.16802980.00.0000010
adherenumeric00.14531750.35251560.00.0000010
nodesnumeric363.65971463.57158102.01.48260330
statusnumeric00.49515610.50011110.00.0000010
differnumeric462.06291390.51419812.00.0000130
extentnumeric02.88697520.48801733.00.0000140
surgnumeric00.26587730.44191820.00.0000010
node4numeric00.27448870.44637640.00.0000010
timenumeric01537.5457481946.70381051855.01184.5974833290
etypenumeric01.50000000.50013461.50.7413120
4. Create a correlation plot to inspect distributions among 8 selected variables

We will create a correlation matrix using the first 8 columns in the colon dataset. We will also color the data points according to the treatment received, which is specified by the rx variable.

ggpairs(colon,
        columns=1:8,
        aes(color = rx))
5. Get a glimpse of missing patterns in data

This will result in an image that resembles our dataframe. The columns represent the variables and the rows the patients. The cells for variables with missing values are drawn in black, as denoted by NA (Not Available) in the legend.

vis_dat(colon)
6. Use a sophisticated prediction model to impute (fill in) missing data
library(mice)

myImputation <- mice(colon)

colon2 <- complete(myImputation)

vis_dat(colon2)
7. Create a regression model to study association between patient characteristics and survival

We will create a logistic regression model using the glm() function. The outcome studied is the very first argument (status) and the variables that we study are listed after ~ (tilde sign), separated by +marks. We also specify which data frame to use, and that the distribution is binomial (i.e logistic regression). Finally, we tell R to create a publication ready forest plot with variable names, categories, number of observations, odds ratios and p-values.

# First, we create the logistic regression model
my_model <- glm(status ~ age + sex + rx + nodes,
                data=colon2,
                family="binomial")

# Second, we plot the results
forest_model(my_model)
8. Report the exact regression equation

Our regression model is actually a survival model and the journal requires us to provide the exact regression equation.

extract_eq(my_model, use_coefs = TRUE)
\(\log\left[ \frac { P( \operatorname{status} = \operatorname{1} ) }{ 1 – P( \operatorname{status} = \operatorname{1} ) } \right] = -0.5 + 0(\operatorname{age}) – 0.04(\operatorname{sex}) – 0.04(\operatorname{rx}_{\operatorname{Lev}}) – 0.62(\operatorname{rx}_{\operatorname{Lev+5FU}}) + 0.19(\operatorname{nodes}) + \epsilon\)

Pretty good, right?

As you can see, R is amazing.