Pågår
avsnitt 8 av 17
Pågår

##### Johan Svensson mars 29, 2020

Numerous R packages have built in datasets. For example, in the survivalpackage you will find both the lung dataset (lung cancer study) and colon dataset (colon cancer study). In fact the default installation of R comes with a package called datasets which is automatically loaded when you start R. You can actually view all the datasets available in the datasets package by typing the following command:

data()

Which displays the following output (only first 22 datasets shown):

Built in datasets are meant to be used as practice data; e.g to learn plotting data, modelling data, creating functions etc. If you wish to use any of the datasets in the datasets package you need to load it using the data() function. For example, this will load the lung dataset:

This will load the dataset but it will remain inactive until you actually use it. You will notice the note <Promise> on the right hand side of the AirPassengers dataset in the environment pane, as shown below:

Once you use the dataset it will be loaded and active. Don’t worry about the <Promise> note, just go ahead and use the dataset as desired.

IF you wish to use a dataset located in a package, you need to load the package first. For example, the lung dataset is located in the survivalpackage. These commands will load the survival package, then load the lung dataset and finally print a summary of the variables in the datset:

##       inst            time            status           age
##  Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00
##  1st Qu.: 3.00   1st Qu.: 166.8   1st Qu.:1.000   1st Qu.:56.00
##  Median :11.00   Median : 255.5   Median :2.000   Median :63.00
##  Mean   :11.09   Mean   : 305.2   Mean   :1.724   Mean   :62.45
##  3rd Qu.:16.00   3rd Qu.: 396.5   3rd Qu.:2.000   3rd Qu.:69.00
##  Max.   :33.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00
##  NA's   :1
##       sex           ph.ecog          ph.karno        pat.karno
##  Min.   :1.000   Min.   :0.0000   Min.   : 50.00   Min.   : 30.00
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 75.00   1st Qu.: 70.00
##  Median :1.000   Median :1.0000   Median : 80.00   Median : 80.00
##  Mean   :1.395   Mean   :0.9515   Mean   : 81.94   Mean   : 79.96
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00
##  Max.   :2.000   Max.   :3.0000   Max.   :100.00   Max.   :100.00
##                  NA's   :1        NA's   :1        NA's   :3
##     meal.cal         wt.loss
##  Min.   :  96.0   Min.   :-24.000
##  1st Qu.: 635.0   1st Qu.:  0.000
##  Median : 975.0   Median :  7.000
##  Mean   : 928.8   Mean   :  9.832
##  3rd Qu.:1150.0   3rd Qu.: 15.750
##  Max.   :2600.0   Max.   : 68.000
##  NA's   :47       NA's   :14

For each variable the summary() function provides seven parameters. This is informative but not that tidy and neat that we would wish for a publication. Don’t worry, there is a package for that. You can create a descriptive table in just two lines of code, as follows:

#install.packages("table1")
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
##     units, units<-
# Show descriptive data for age, status and time, according to sex
table1(~ age + status + time | sex, data=lung)
1
(n=138)
2
(n=90)
Overall
(n=228)
age
Mean (SD)63.3 (9.14)61.1 (8.85)62.4 (9.07)
Median [Min, Max]64.0 [39.0, 82.0]61.0 [41.0, 77.0]63.0 [39.0, 82.0]
status
Mean (SD)1.81 (0.392)1.59 (0.495)1.72 (0.448)
Median [Min, Max]2.00 [1.00, 2.00]2.00 [1.00, 2.00]2.00 [1.00, 2.00]
time
Mean (SD)283 (213)339 (203)305 (211)
Median [Min, Max]224 [11.0, 1020]293 [5.00, 965]256 [5.00, 1020]

However, this chapter is not about generating descriptive data so let’s get back to importing data. Although built in datasets are useful you will need to import your own data at some point. There are several useful packages for importing data to R. Base R (the default installation) does provide several functions for importing data. In addition the readr package and readxl package are also very useful and they are both included in the tidyverse.

## Prerequisites

### File formats

There are many file formats for rectangular data (data with columns and rows). Such data may also be called flat files since it is two-dimensional (rows and columns). Some software, such as SPSS, SAS, STATA, Excel etc, use specific file formats which cannot be used by other programs. This can become an issue when transfering data between software. The best method for avoiding such issues is to use a universal file format, which is a file format where values are separated by a delimiter. Such files are referred to as delimiter files or delimited text files. Although it may sound complicated, it’s extremely simple. A delimited text file uses a delimiter to separate the values in each row. In other words, it is simply a text file which uses a delimiter to organize the data. Each line in a delimited text file corresponds to one row in a two-dimensional data table. Any character may be used to separate the values, but the most common delimiters are the commatab and semicolon:

#### CSV files (Comma Separated Values)

In a comma separated values (CSV) file the data items are separated using commas as a delimiter, while in a tab-separated values (TSV) file, the data items are separated using tabs as a delimiter. Column headers (i.e variable names) can be included as the first line, and each subsequent line is a row of data. The lines are separated by newlines.

For example, the following fields in each record are delimited by commas, and each record by new lines.

date, patient, diagnosis
25 May, Uma, Rheumatoid arthritis
15 July, Robin, Heart disease

Note that the first row represents the column names (i.e variable names).

Virtually any statistical software can import and export CSV files. Hence, CSV is a pure, simple and safe way of keeping and transfering data. One drawback of the CSV format is that it does not keep information on the variables themselves; e.g in Excel you can define characteristics of a specific column (you can specify the variable as numeric, integer, date time etc). This information cannot be stored in a CSV file and is therefore lost when exporting to CSV. Moreover, you need to re-specify such variable definitions after importing the CSV file.

The above CSV file would be imported to the following table in R:

datepatientdiagnosis
25 MayUmaRheumatoid arthritis
15 JulyRobinHeart disease

You should always prefer to obtain data in CSV format and export data to CSV when you transfer data. The CSV format is one of the most common forms of data storage. There are, however, packages that enable you to import other file formats, such as Excel files, SAS files etc.

Importing data to R requires that you know where the data is located. R can import data from your computer or from online data bases. Most users will just import data located on their computers. This is where your working directory comes in.

What is a working directory? For the vast majority of data analysts, the working directory is simply the folder were all files related to the project are located. This could for example be a folder on your desktop. It is recommended that you create a folder for your project and put all files related to the project in that folder. That folder could include the data files, a log file (where you can keep a diary of what you’re doing) and other files related to the project. Keeping all your files in one place will save you time and headache going forward.

Start by freating a folder called MyProject (e.g on your desktop). Then we will tell R where that folder (i.e MyProject) is located. This is done using the function setwd(), which is short for set working directory, as follows:

setwd("/Users/Desktop/MyProject")

The file path may look different on your computer, depending on your system. To make sure that R now uses the correct working directory, you use the getwd() function, which returns the path to the current working directory:

getwd()

#> "/Users/Desktop/MyProject"

If you want to change your working directory to a folder located within MyProject, you simply specify that in the path. For example, let’s say you create a new folder within the MyProject folder and that new folder is named Data files, which you want to use as your working directory. To do this you simply write the following command:

setwd("/Users/Desktop/MyProject/Data files")

### Must you use a working directory?

No, but it is recommended that you do so. You can, however, read files from anywhere on your computer by specifying the full path to that file. Hence, you can set a working directory but still read files from other folders on your computer.

## Getting started

In this chapter, you’ll learn how to import rectangular data files in R with the readr package and readxl package, which are part of the core tidyverse. If you haven’t already, start by installing and loading tidyverse.

#install.packages("tidyverse")
library(tidyverse)

readr imports flat files and converts them to a special type of data frame called tibble. A tibble is a data frame with a structure suitable for use with other tidyverse packages. For most purposes a tibble is simply a data frame. readr includes the following main parsers (functions for importing data):

• read_csv() for reading csv files that use comma (,) as the separator. These files use period (.) as the decimal place for numeric variables.
• read_csv2() for reading csv files that use semicolon (;) as the separator. These files use comma (,) as the decimal place for numeric variables.
• read_tsv() for reading tabs separated files
• read_fwf() for reading fixed-width files
• read_log() for reading web log files
• read_delim() reads in files with any delimiter, but the delimiter must be explicitly specified.

These functions all have similar syntax: once you’ve learned one, you can use the others with ease.

The following example shows a CSV file suitable for import using read_csv():

name, bloodvalue,
Janet, 80.1,

Adam’s blood value is 90.5 and Janet’s is 80.1. Columns are separated using the comma (,) symbol and decimals use period (.). The next example shows a CSV file suitable for import using read_csv2():

name; bloodvalue;
Janet; 80,1;

Decimal places for Adam’s and Janet’s blood values are denoted using the comma symbol (,).

Henceforth, we will focus on read_csv() since knowing this function is mandatory and will also enable you to use other importing functions (parsers). To get help with the read_csv() function you simply run the command ?read_csv(). The question mark (?) tells R that you want to view the instructions for the function. Here is the full specification of the read_csv() function. In the Help pane you’ll see the following:

Note the Usage heading, which shows you all the arguments that can be specified:

read_csv(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
n_max = Inf, guess_max = min(1000, n_max),
progress = show_progress(), skip_empty_rows = TRUE)

The first argument is file, which is simply the file path. If the file you wish to import is located in your current working directory, then you just type the file name. If the file is not located in your current working directory, then you must specify the full path to the file.

Example 1: import the file data.csv, which is located in the current working directory

heights <- read_csv(file="data.csv")

Example 2: import the file data.csv, which is located in a subfolder called myfiles, which is located in the current working directory:

# Data is located in the folder "My data" on the desktop
heights <- read_csv(file="/myfiles/data.csv")

Example 3: import the file data.csv, which is located in a folder on my desktop:

# Data is located in the folder "My data" on the desktop
heights <- read_csv(file="/Desktop/data.csv")

In other words, if your specification in the argument file does not contain an absolute path, the file name is relative to the current working directory.

There are several things to note.

1. When you use R functions, you have to make sure that you have specified all arguments that must be specified. In most cases, only some of the arguments are mandatory to specify. Those who are not mandatory to specify will be set automatically to their default settings by R. You can see what the default settings are by viewing the function specification under the heading Usage. Refer to the Usage instructions, you’ll see that col_names is set to TRUE as the default settings, which means that R will let the first row of the input data be used as the column names, and will not be included in the data frame.
2. In the examples above, we are only supplying one input to the function, and that is the input to the file argument. We could have written read_csv("data.csv") instead of read_csv(file="data.csv"). Since we have only supplied one input, R will automatically assign that to the frist argument, which is file and therefore it would still have been correct. These two following examples would also provide identical results:
heights <- read_csv(file="data.csv", col_names = TRUE)
heights <- read_csv("data.csv", TRUE)

If you want to skip naming the arguments, then you must supply their input in the exact order that they appear in the Usage instructions.

When you run read_csv() it prints out a column specification that gives the name and type of each column. It is important that you check that readr has interpreted the variables correctly.

In the following example, we will actually give read_csv() a CSV file directly (inline CSV), which is useful for testing purposes.

read_csv("name, bloodvalue, sex, education
Janet, 80.1, female, 2")
## # A tibble: 2 x 4
##   name  bloodvalue sex    education
##   <chr>      <dbl> <chr>      <int>
## 1 Adam        90.5 male           1
## 2 Janet       80.1 female         2

Note the following:

• read_csv() used the first line of the data for the column names.
• the columns name and sex are classified as <chr> which means character.
• the column bloodvalue is classified as <dbl>which means double. Numeric variables with decimals are referred to as double.
• the column education is classified as <int>which means integer. Numeric variables without decimals are integers.

### When the first row is not column names

The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

read_csv("Adam, 90.5, male, 1
Janet, 80.1, female, 2", col_names=FALSE)
## # A tibble: 2 x 4
##   X1       X2 X3        X4
##   <chr> <dbl> <chr>  <int>
## 1 Adam   90.5 male       1
## 2 Janet  80.1 female     2

You can also pass col_names a character vector which will be used as the column names:

read_csv("Adam, 90.5, male, 1
Janet, 80.1, female, 2", col_names=c("name", "bloodvalue", "sex", "education"))
## # A tibble: 2 x 4
##   name  bloodvalue sex    education
##   <chr>      <dbl> <chr>      <int>
## 1 Adam        90.5 male           1
## 2 Janet       80.1 female         2

### Specifying missing values

Most software for data analysis use some symbol for missing data. Missing data must always be explicitly classified as missing. R uses the symbol NA (Not Available) to indicate that a value is missing. Let’s read in the same data again but set Adam’s blood value to missing:

read_csv("name, bloodvalue, sex, education
Janet, 80.1, female, 2")
## # A tibble: 2 x 4
##   name  bloodvalue sex    education
##   <chr>      <dbl> <chr>      <int>
## 1 Adam        NA   male           1
## 2 Janet       80.1 female         2

The function detected that the blood value on the first row (Adam) was missing and set it to NA. If the data you’re importing has already specified missing data with a symbol or character, then you can pass that to the na argument of the read_csv() function. In the following example, the character string “MISSING” represents missing data and we will tell read_csv() that this is the case:

read_csv("name, bloodvalue, sex, education
Janet, 80.1, female, 2", na="MISSING")
## # A tibble: 2 x 4
##   name  bloodvalue sex    education
##   <chr>      <dbl> <chr>      <int>
## 1 Adam        NA   male           1
## 2 Janet       80.1 female         2

This is all you need to know to read most CSV files. You can also easily apply what you’ve learned to the other functions (e.g read_tsv()).

### Specifying the separator

CSV files may use other separators, such as semicolon. If that is the case then you need to use the read_delim() function and specify the separator, which in the following case is semicolon (;):

read_delim("name; bloodvalue; sex; education
Janet; 80.1; female; 2", delim=";")
## # A tibble: 2 x 4
##   name   bloodvalue  sex     education
##   <chr> <chr>         <chr>     <chr>
## 1 Adam  " 90.5"       " male"   " 1"
## 2 Janet " 80.1"       " female" " 2"

The function detected that the blood value on the first row (Adam) was missing and set it to NA. If the data you’re importing has already specified missing data with a symbol or character, then you can pass that to the na argument of the read_csv() function. In the following example, the character string “MISSING” represents missing data and we will tell read_csv() that this is the case:

read_csv("name, bloodvalue, sex, education
Janet, 80.1, female, 2", na="MISSING")
## # A tibble: 2 x 4
##   name  bloodvalue sex    education
##   <chr>      <dbl> <chr>      <int>
## 1 Adam        NA   male           1
## 2 Janet       80.1 female         2

This is all you need to know to read most CSV files. You can also easily apply what you’ve learned to the other functions (e.g read_tsv()).

### readr compared to base R

There is actually built-in functions to read CSV files. The base function is read.csv(). The reasons why we do not use read.csv() is the following: * The built-in functions are slower than readr. * readr provides a progress bar for large fiels so that you can see the import progress. * If speed is the most important aspect, see the fread() function in the data.table package.

• They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.

## Other types of data

To get other types of data into R, we recommend the packages listed below:

• haven reads SPSS, Stata, and SAS files.
• readxl reads excel files (both .xls and .xlsx).
• DBI, along with a database specific backend (e.g. RMySQLRSQLiteRPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

For hierarchical data: jsonlite for json, and xml2 for XML. Examples at https://jennybc.github.io/purrr-tutorial/.

5/5 (1 Review)