avsnitt Progress
0% färdig

Loading built-in datasets

Numerous R packages have built in datasets. For example, in the survival package you will find both the lung dataset (lung cancer study) and colon dataset (colon cancer study). In fact the default installation of R comes with a package called datasets which is automatically loaded when you start R. You can actually view all the datasets available in the datasets package by typing the following command:

data()

Which displays the following output (only first 22 datasets shown):

Built in datasets are meant to be used as practice data; e.g to learn plotting data, modeling data, creating functions etc. If you wish to use any of the datasets in the datasets package you need to load it using the data() function. For example, this will load the AirPassengers dataset:

data("AirPassengers")

This will load the dataset but it will remain inactive until you actually use it. You will notice the note <Promise> on the right-hand side of the AirPassengers dataset in the environment pane, as shown below:

Once you use the dataset it will be loaded and active. Don’t worry about the <Promise> note, just go ahead and use the dataset as desired.

If you wish to use a dataset located in an R package, you need to load the package first. For example, the lung dataset is located in the survival package. These commands will load the survival package, then load the lung dataset and print a summary of the variables in the datset:

# Load the survival package
library(survival)

# Load the lung cancer data
data(lung)

# Print first 10 rows
head(lung, 10)
insttimestatusagesexph.ecogph.karnopat.karnomeal.calwt.loss
330627411901001175NA
3455268109090122515
31010156109090NA15
5210257119060115011
18832601010090NA0
1210221741150805130
731026822706038410
113612712260805381
121825311708082516
716626112707027134

Although built-in datasets are useful you will need to import your own data at some point. There are 

Although built in datasets are useful you will need to import your own data at some point. There are several useful packages for importing data to R. Base R (the default installation) does provide several functions for importing data. In addition the readr package and readxl package are also very useful and they are both included in the tidyverse.

Importing data to R

File formats

There are many file formats for rectangular data (data with columns and rows). Some software, e.g SPSS, SAS, STATA, Excel etc, use specific file formats which cannot be used by other programs. This can become an issue when transfering data between software. The best method for avoiding such issues is to use a universal file format, which is a file format where values are separated by a universal delimiter. Such files are referred to as delimiter files or delimited text files.

Although it may sound complicated, it’s very simple. A delimited text file uses a delimiter to separate the values in each row. In other words, it is simply a text file which uses a delimiter to organize the data. Each line in a delimited text file corresponds to one row in a two-dimensional data table. Any character may be used to separate the values, but the most common delimiters are the commatab and semicolon.

CSV files (Comma Separated Values)

In a comma separated values (CSV) file the data items are separated using the comma sign as the delimiter. Column headers (i.e variable names) can be included as the first line, and each subsequent line is a row of data. The lines are separated by line breaks.

The following fields in each record are delimited by commas, and each record by new lines.

month, patient, diagnosis
May, Adam, Diabetes
May, Uma, Rheumatoid arthritis
July, Robin, Heart disease

Note that the first row represents the column names (i.e variable names). There are 3 columns (month, patient, diagnosis). Each row represents one person. The comma sign separates the columns.

Virtually any statistical software can import and export CSV files. Hence, CSV is a pure, simple and safe way of storing and exporting data. One drawback of the CSV format is that it does not keep information on the variables themselves; e.g in Excel you can define characteristics of a specific column (you can specify the variable as numeric, integer, date time etc). This information cannot be stored in a CSV file and is therefore lost when exporting to CSV. You need to re-specify such variable specifications after importing a CSV file to R.

You should mostly prefer to obtain, store and export data in CSV format. There are, however, packages that enable you to import other file formats, such as Excel, STATA, SAS files, etc.

In a tab-separated values (TSV) file, the data items are separated using tabs as the delimiter, instead of comma.

In summary: use CSV files.

Your working directory (wd)

Importing data to R requires that you know where the data is located. R can import data from your computer or from online data sources. Most users will just import data located on their computers. This is where your working directory comes in.

What is a working directory?

For the vast majority of cases, the working directory is simply the folder were most – if not all – files related to the project are located. This could for example be a folder on your desktop or elsewhere on your computer. It is recommended that you create a folder for your project and put all files related to the project in that folder. Keeping all your files in one folder will save you time and headache going forward.

If you followed the recommendations regarding projects in R (refer to R Projects), then the working directory is equal to the project folder. We recommend that each research study is created as an R Project.

7.2.2.2 What is my working directory?

If you have created an R Project then your working directory is your project folder. If you need to, for any reason, check where the working directory is located, then type the following:

getwd()
[1] "/Users/Desktop/RForMedicalResearch"

If you need to change your working directory, then use the setwd()command, as follows:

setwd("/Users/Desktop/MyNewStudy")

7.2.2.3 Can I only read files from my working directory?

No, you can read files from any location on your computer. The difference is that files in your working directory can be reference directly, without specifying the exact location (file path) on your computer. Let’s see two examples. We will import two files using the read_csv() function, which we will explain later.

# patients.csv is located in my working directory:
my_file_1 <- read_csv("patients.csv")

# diagnosis.csv is located in a folder called DiagnosisData on my desktop
my_file_2 <- read_csv("/Users/Desktop/DiagnosisData/diagnosis.csv")

As evident in this example, reading files in your working directory is easy since you only need to specify the file name. R will look for the file patients.csv inside your working directory. In the second example, we read the file diagnosis.csv from another folder, and therefore we must specify the exact file path.

Importing data

We recommend the readr and readxl packages to import rectangular data to R. We will now install both packages:

install.packages("readr")
install.packages("readxl")

Then activate (load) them:

library(readr)
library(readxl)

The readr package loads many types of data. The readxl package only read Excel files (xls, xlsx). Note the following:

  1. The readr package is now part of the tidyverse package, so if you load tidyverse, then readr is automatically loaded.
  2. There are built-in functions (in R Base) that include functions very similar to those available in readr. However readr is generally faster and more predictable.
  3. When the readr package imports data, it converts it to a special type of data frame called tibble. A tibble is a data frame with a structure optimized for use with RStudio and other package in the tidyverse package. Nevertheless, a tibble is simply a data frame.

readr includes the following parsers (a parser is a functions for importing data):

  1. read_csv() for reading csv files that use comma (,) as the separator. These files use period (.) as the decimal place for numeric variables.
  2. read_csv2() for reading csv files that use semicolon (;) as the separator. These files use comma (,) as the decimal place for numeric variables.
  3. read_tsv() for reading tabs separated files
  4. read_fwf() for reading fixed-width files
  5. read_log() for reading web log files
  6. read_delim() reads in files with any delimiter; the delimiter must be explicitly specified.

These functions all have similar syntax: once you’ve learned one, you can use the others with ease.

The following example shows a CSV file suitable for import using read_csv():

name, bloodvalue,
Adam, 90.5,
Janet, 80.1,

Adam’s blood value is 90.5 and Janet’s is 80.1. Columns are separated using the comma (,) symbol and decimals use period (.). The next example shows a CSV file suitable for import using read_csv2():

name; bloodvalue;
Adam; 90,5;
Janet; 80,1;

Decimal places for Adam’s and Janet’s blood values are denoted using the comma symbol (,).

We will focus on the read_csv() function, which is the most commonly used parser for CSV files. To get help with the read_csv() function you simply run the command ?read_csv(). The question mark (?) tells R that you want to inspect the instructions for the function. The documentation is presented in the Help pane in RStudio, as follows:

Inspect the Usage heading. As evident, the documentation actually shows four functions: read_delimread_csvread_csv2read_tsv when we run ?read_csv(). This is suitable since these functions perform the same task on different types of data, and presenting them all on one page is suitable. The Usage heading shows you all the arguments that can be specified, which for read_csv() are as follows:

read_csv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = show_progress(), skip_empty_rows = TRUE)

The first argument is file, which is simply the file path. If the file you wish to import is located in your current working directory, then you just type the file name. If the file is not located in your current working directory, then you must specify the full path to the file.

Example 1: import the file data.csv, which is located in the current working directory

heights <- read_csv(file="data.csv")

Example 2: import the file data.csv, which is located in a subfolder (myfiles) in my working directory:

heights <- read_csv(file="/myfiles/data.csv")

Example 3: import the file data.csv, which is located in the folder myfiles on my desktop:

heights <- read_csv(file="/Desktop/myfiles/data.csv")

In other words, if your specification in the argument file does not contain an absolute path, the file name is relative to the current working directory.

There are several things to note.

  1. read_csv is a function. When you use R functions, you have to make sure that you have specified all arguments that must be specified. In most cases, only some of the arguments are mandatory to specify. Those who are not mandatory to specify will be set automatically to their default settings by R. You can see what the default settings are by viewing the specifications under the heading Usage; e.g you’ll see that the argument col_names is set to TRUE as the default settings, which means that R will set this argument to TRUE, unless you specify something else.
  2. In the examples above, we are only supplying one input to the function, and that is the input to the file argument. We could have written read_csv("data.csv") instead of read_csv(file="data.csv"). Since we have only supplied one input, R will automatically assign that to the frist argument, which is file and therefore it would still have been correct. These two following examples would also provide identical results:
heights <- read_csv(file="data.csv", col_names = TRUE)
heights <- read_csv("data.csv", TRUE)

If you want to skip naming the arguments, then you must supply their input in the exact order that they appear in the Usage instructions. This is, however, not recommended.

When you run read_csv() it will read the file and also print information on how it has defined the resulting column types. You should always check that the variables have been interpreted correctly.

In the following example, we will actually give read_csv() a CSV file directly, which is useful for testing purposes.

read_csv("name, bloodvalue, sex, education
Adam, 90.5, male, 1
Janet, 80.1, female, 2")