Introductory Concepts

Introduction to R for Medical Researchers

The purpose of this course is to teach you how to manage, analyze and present research data (or any other data). Data and data science has become increasingly popular in recent years. This is explained partly by the following trends:

  • The tremendous increase in data collection. Everything is digital these days, meaning that possibilities to collect data has increased and consequently we gather data at an unprecedented pace.
  • Advances in tools and techniques for data analyses. Software and techniques for data analysis has evolved rapidly, particularly within the field of machine learning and artificial intelligence.
  • Dissemination of capabilities to analyze data. Nowadays anyone can analyze data.1

Data science is a very broad field which may include virtually any type of data analysis. Data science is the intersection between study design, statistics and programming. Interestingly, in recent years many students and professionals have gained knowledge in all three of these, meaning that they are capable of designing a study, selecting appropriate analysis methods and write the code that performs the results. Indeed, nowadays physicians, nurses, chemists, physicists, developers, engineers and many other can deploy sophisticated analyses using the software and methods discussed in this book and course.

Objectives

This course will provide you with sufficient knowledge to carry out research projects yourself. However, you will not become an expert by reading only this book. To become an expert in data science it is necessary to dive deeper into each topic covered here. However, you probably don’t have to become an expert; having basic understanding of most topics and perhaps deeper understanding of a few selected ones is generally the best approach.

A typical research project includes the following:

  • Importing the data.
  • Cleaning the data.
  • Transforming the data: e.g create new variables.
  • Visualize the data.
  • Describe the data using visualizations and descriptive measures.
  • Develop mathematical models to:
    • Elucidate relationships between variables: e.g estimate the effect of age on risk of cancer.
    • Enable prediction: e.g predict cancer occurrence.
    • etc
  • Presentation of data: communicate the results to the reader.

Additional details

Importing data simply means that you open the data and make it accessible in R, which is the software that we will be using. Data refers to a rectangular table consisting of rows and columns. The rows are observations (e.g patients) and the columns are variables. When using R, such tables are referred to as data frames (Figure 1).

Figure 2.

R can import almost any kind of data. Files from Excel, SPSS, SAS, CSV etc. can be imported seamlessly to R.

After you import data, you must tidy it. Tidying implies making sure that the data structure is suitable for analyses and presentation. In 99.9% of cases, data will require tidying.

When the data structure is correct, you continue to data transformation, which means manipulating data. For example, you may need to create a new variable based on data in other variables, or based on a formula, etc. Modifying existing data, or creating new data, is referred to as data transformation, or mutation.

Tidying and transforming data may collectively be referred to as data wrangling. When data wrangling is completed, we continue to descriptive statistics, visualizations, statistical modeling, testing hypotheses and making inferences (i.e drawing conclusions based on our results).

Visualizing data is fundamental. A good illustration can change everything. Visualizations should be clear and exciting.

Statistical models, either classical regression models or modern machine learning algorithms, aim to decipher and elucidate patterns in the data. A statistical model is a mathematical description of data. It can be used for estimating the effect of variables or even make predictions. Many statistical frameworks rest on complicated mathematical fundamentals. Fortunately, all statistical frameworks are easy to use in R.2

Presentation / communication: if you cannot communicate your results in an efficient and clear way, then the impact of your research will decrease. We will therefore discuss presentation techniques in detail in this course.

Other data types

The majority of chapters in this book discuss data analysis using data frames (tables consisting of columns and rows). But there are many other data types that can be analyzed in R. Sound recordings, photographs, and text can also be analyzed in R. Analysis of such data types lie within the field of artificial intelligence (neural networks), which we also will discuss briefly.

Prerequisites

Previous experience in programming is helpful but certainly not necessary to complete this course. Anyone can complete this course and learn how to use R effectively for research.

Programming for non-programmers

Everyone can learn to write computer programs. The better you get, the easier your work becomes. This course will use the programming language called R, which is a powerful language for data science and research. We will emphasize on the techniques most frequently used in research and most students complete this course within 5 to 10 days.

Required software

  • R: R is a language and software environment for data analysis and presentation.
  • RStudio: RStudio was created to facilitate the work in R. RStudio makes life very easy.
  • R packages: You can extend the functionality of R by installing packages. For example, if you want to use Cox regression, you install a package which includes all necessary functions for Cox regression. There are approximately 15,000 packages available.

About R and Rstudio

R is a language developed specifically for analyzing and visualizing data. R is extremely effective for this purpose. It was developed at Bell Laboratories by John Chambers and colleagues. The basic installation of R provides a wide variety of statistical and graphical techniques, and is highly extensible. In addition, there are over 15.000 packages which are freely available. Using these packages, you can build websites, e-books (this book is written in R) and web applications in R.

How to install R

Download R (the desktop version) from CRAN (The Comprehensive R Archive Network):

1. Download and install R (make sure you select the correct operating system.)

How to install RStudio

RStudio is an IDE (integrated development environment), which means that it facilitates programming. Rstudio provides an interface which is easy to use. Download RStudio from the official website:

  1. Download and install RStudio
  • Select RStudio for desktop, the free version, for your operating system.

The following image (Figure 2) shows the RStudio interface.

As seen in Figure 2, there are four panes in RStudio. The source/script pane is used for writing your R program. You can write multiple lines and execute them one by one. You can also enter R commands in the Console pane, however, it is not suitable for writing longer pieces of code. The console pane will also present results from executed R commands. To run code in the console pane, just press enter. To run code in the script/source pane, select the rows you want to run and press Run in the top right of the box.

Another example

Let’s see a simple example using patients with colon cancer. We want to plot how the number of cancer metastases (called nodes in this data frame) relates to the number of days survived (called time), as well as the treatment given (called rx).

We will not go into details on the code itself, but the comments, which follow the # sign, may help you interpret the code. R does not interpret anything following the # sign.

# Load library for creating plots
library(ggplot2)

# Load library that contain data on cancer patients
library(survival)

# Load data frame with cancer patients
data(colon)

# Create the plot
ggplot(data=colon, # tell R where the data is located
       aes(x=time, y=nodes)) + # tell R what to use as plot ax and grouping
  geom_point() + # adds points to the plot
  geom_smooth() +  # adds a regression line to the plot
  facet_wrap(~rx) # facets create subplots

It appears as if there is an inverse association between survival and number of metastases. In the subsequent chapters, we will get into coding and you’ll either love it immediately, or grow to love it!


  1. Without getting into details, anyone can do this.
  2. Research is teamwork. Whenever in (slightest) doubt, consult an expert (statistician, data scientist), in order to ensure the quality of the analyses.