Chapter 10 Reading Data Files into R
All of the files storing data that we will read into R will be assumed to be text files.
Text files
Files with human readable characters such as letters, numbers, special characters (e.g., .red[“, ’, ?]), typically composed of multiple lines.
Can represent different characters by text encodings
There are many types of text files, e.g.,
-R files (.R)
-html files (.html) / cascading style sheet (.css)
-text files (.txt)
10.1 Downloading a file locally
Need to download a file to your local computer
To test out reading in datasets into R that you have stored locally on your computer, we will first download the gapminder dataset.
When you download a file, make sure you know which folder it is in and that you can find the file path for the downloaded file.
To load the gapminder dataset into R, you can use the read.csv function.
Before running
read.csv
, it is often convenient to first set the working directory in R first usingsetwd
This allows you to load in all the files from the folder containing your data files without typing in the full file path name every time
For example, if the gapminder data was stored in a folder with path “~/IntrotoR/Data, we could first type
- After running
setwd
, we can now load in the gapminder dataset (stored asgapminder_full.csv
on my computer) usingread.csv
To import a .csv file (or any other text file) in Rstudio, you can also click on the Import Dataset button on the top right panel.
This lets you view how the data would appear before it is loaded into R.
The read.table() function is the more general function for reading in rectangular data stored in text files.
- You can read in other types of text files: e.g., .tsv, .txt
read.table() can do the same thing as read.csv(). The main difference is that the default settings of read.csv() are setup to handle .csv files nicely.
For example, we could read in the gapminder data with read.table() using:
## country year population continent life_exp gdp_cap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
- You can look at the first few rows of a loaded data frame using
head
:
## [1] 1704 6
## country year population continent life_exp gdp_cap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
## 7 Afghanistan 1982 12881816 Asia 39.854 978.0114
## 8 Afghanistan 1987 13867957 Asia 40.822 852.3959
## [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
- Looking at the structure of the data frame using
str()
## 'data.frame': 1704 obs. of 6 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ population: int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
## $ life_exp : num 28.8 30.3 32 34 36.1 ...
## $ gdp_cap : num 779 821 853 836 740 ...
- Reading strings as non-factors:
## country year population continent life_exp gdp_cap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
## [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
- You can skip lines when you read in a text file.
- This is useful if you don’t want to read in the “header” line of the .csv file:
## V1 V2 V3 V4 V5 V6
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
- Or, skip the header and the first two rows of the .csv file
## V1 V2 V3 V4 V5 V6
## 1 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 2 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 3 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 4 Afghanistan 1977 14880372 Asia 38.438 786.1134
## 5 Afghanistan 1982 12881816 Asia 39.854 978.0114
## 6 Afghanistan 1987 13867957 Asia 40.822 852.3959
10.2 Opening a remote file using read.table()
- You can read in a dataset directly from a URL:
# Read in ignoring the fact that there is a header line
urlname <- paste("https://web.stanford.edu/~hastie/",
"ElemStatLearn/datasets/bone.data", sep="")
bone_dat <- read.table(urlname)
head(bone_dat, 4)
## V1 V2 V3 V4
## 1 idnum age gender spnbmd
## 2 1 11.7 male 0.01808067
## 3 1 12.7 male 0.06010929
## 4 1 13.75 male 0.005857545
# Set header=TRUE to read in top line as the variable names
bone_dat2 <- read.table(urlname, header=TRUE)
head(bone_dat2, 4)
## idnum age gender spnbmd
## 1 1 11.70 male 0.018080670
## 2 1 12.70 male 0.060109290
## 3 1 13.75 male 0.005857545
## 4 2 13.25 male 0.010263930