Chapter 1 Getting Started

1.1 R and RStudio

  • R

    • Free software developed by R Core Team
    • Available at https://www.r-project.org/
    • Software and packages are managed by the nonprofit organization “R Foundation”
  • RStudio

    • An integrated development environment (IDE) for programming in R.
    • Provides many add-ons to R available in a single interface.
    • Developed by RStudio, Inc.
    • Available in both free (AGPLv3) and commercial editions at https://www.rstudio.com

  • R and RStudio are separate things.

    • You should install R first before RStudio
  • R

  • RStudio

  • When you open Rstudio, it should look something like:

  • The left-hand panel of RStudio is where you can type in R code directly.

  • For example, we can treat R as a calculator and add and multiply numbers by typing them directly in the left-hand panel.

  • Typing and running R code line-by-line like this is referred to as using R in interactive mode.

  • When writing more complex code that you can reuse, it is usually better to write it in a separate file such as an R script (this type of file ends in .R).

  • To create a new R script, go to File –> New File –> R script in Rstudio.


  • As an example of writing and running R scripts, let’s write an R script that will simply print out the message “Hello World” whenever we run the script.

  • To do this we just write the following R code in the empty R script/l

"Hello World"

  • Before running the script, you can save the file as “hello_world.R”.

  • To run the script, just click the “Run” button located at the top right of your R script.

  • The message “Hello World” should appear in the R console below:

1.2 An Extended Example: the NYC flights data

  • To illustrate some of the capabilities of R for exploring and summarizing data, we will look at the “NYC flights” dataset.

  • This is a dataset that contains information on flights that departed from the New York City region in 2013.

  • This dataset is available in an R package called “nycflights13”

1.2.1 Installing R packages

  • To use an R package, you must first install it.

  • Installing the nycflights13 package can be done with the following command:

install.packages("nycflights13")
  • Note that, if a package has been installed previously, you don’t need to install it again in order to use it.

  • Once an R package has been installed, you can “load” it into your R session with the library function:
library(nycflights13)
  • Running the library command just makes the datasets in the nycflights13 package available for you to use in your R session.

1.2.2 NYC flights data details

  • There are 5 datasets in the nycflights13 package: airlines, airports, flights, planes, weather

  • Let’s first look at the planes dataset.

  • This dataset is stored as a data frame in R.

    • Using a dataframe is the most standard way to store a dataset in R
  • An R data frame has a certain number of rows (which usually represent different observations) and columns (which usually represent different variables).

    • I will refer to the variables in a data frame as “data variables”.

    • This is to distinguish it from R variables that you can create in your R session.


  • The planes data frame has 3322 rows and 9 columns.

  • The number of rows and columns of a data frame can be found by using the dim function.

dim(planes)
## [1] 3322    9
  • Each row of the planes data frame contains information about a specific airplane.

  • You can look at the contents of the first 6 rows of a data frame by using the head function

head(planes)
## # A tibble: 6 × 9
##   tailnum  year type               manufacturer model engines seats speed engine
##   <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
## 1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
## 2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
## 3 N103US   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
## 4 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
## 5 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
## 6 N105UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
  • The planes data frame has 9 variables.

    • tailnum: The tail number of the plane. This number is a unique identifier for each plane.
    • year: The year the plane manufactured.
    • type: The type of plane.
    • manufacturer: The manufacturer of the plane.
    • model: The model of the plane.
    • engines: The number of engines that the plane has.
    • seats: The number of seats that the plane has.
    • speed: Average cruising speed in mph.
    • engine: Type of engine.
  • Running the command help(planes) can give more information about this dataset.

1.2.3 Summarizing specific data variables

  • You can access individual variables from planes by using the $ operator.

  • For example, if we want to assign the values in the year column into a new R variable named plane_year, we do the following:

plane_year <- planes$year
  • After running the above line of code, plane_year is an R vector that has 3322 elements.

  • The length function tells us how many elements are in a vector

length(plane_year)
## [1] 3322
  • We can look at the first x elements of plane_year by using the syntax plane_year[1:x].

  • For example, let’s look at the first 5 elements of plane_year:

plane_year[1:5]
## [1] 2004 1998 1999 1999 2002

  • We can get a count of how many times each value of year occurs by using the table function
table(plane_year)
## plane_year
## 1956 1959 1963 1965 1967 1968 1972 1973 1974 1975 1976 1977 1978 1979 1980 1983 
##    1    2    2    1    1    1    1    1    1    3    3    2    2    4    4    1 
## 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 
##    5   23   17   40   75   60   90  108  109   59   48   54   55   74  174  206 
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
##  244  284  212  150  192  162  126  123  147   84   48   66   95   92
  • The above R output says that 147 of the planes in the planes data frame were manufactured in 2008 and 92 planes in the planes data frame were manufactured in 2013.

  • The table function is useful for data variables that have a relatively small number of distinct values.

  • For numeric data variables that are better thought of as continuous variables, one often summarizes these data variables by looking at things like the mean, median, or standard deviation.

  • Using the summary function on a single data variable gives you a useful “six-number summary” about that data variable:

summary(planes$seats)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   140.0   149.0   154.3   182.0   450.0

  • Another dataset available in the nycflights13 package is the weather data frame.

  • This data frame has 26115 rows and 15 columns.

dim( weather )
## [1] 26115    15
  • You can output all of the data variable names by using the names function:
names( weather )
##  [1] "origin"     "year"       "month"      "day"        "hour"      
##  [6] "temp"       "dewp"       "humid"      "wind_dir"   "wind_speed"
## [11] "wind_gust"  "precip"     "pressure"   "visib"      "time_hour"
  • One of the data variables is month. This just records what month the weather observation was made in
table( weather$month )
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 2226 2010 2227 2159 2232 2160 2228 2217 2159 2212 2141 2144

1.2.4 Subsetting Data

  • An important part of many data analyses is looking at data summaries of specific subsets of interest.

  • To create a new data frame which is a subset of the original data frame, you can use the subset function.

  • For example, if we only want to look at weather in the month of January, we can a new data frame which only contains January observations (where month equals 1)

JanuaryWeather <- subset(weather, month==1)
  • The JanuaryWeather data frame has 2226 observations
dim( JanuaryWeather )
## [1] 2226   15
  • The average temperature over the month of January is
mean( JanuaryWeather$temp )
## [1] 35.63566

  • One can take more complex subsets of a data frame by using logical expressions in the second argument of the subset function.

  • For example, if you wanted to create a data frame that only has observations in February that are above 40 degrees Fahrenheit, you could use the following code:

FebAbove40 <- subset(weather, month==2 & temp > 40)

1.2.5 Plotting Data

  • R has many functions that can aid data visualization.

  • For example, you can create a simple histogram of the temperature variable by using the vector weather$temp inside the hist function:

hist( weather$temp )

  • You could create separate boxplots of temperature for each month by using the modeling syntax temp ~ month within the boxplot function:
## Use x-axis label "Month" and y-axis label "Temperature" in the figure:
boxplot(temp ~ month, data=weather, xlab="Month",
        ylab="Temperature")

1.3 Using R as a calculator

  • When first starting with R, it can be helpful to note that R can be used as a basic calculator.

  • For example, if we just type in 42 + 17 into the R console, it should print out the sum:

42 + 17 
## [1] 59
  • We can compute the square root of 243, \(1.56^{124}\), and \(7.21 \times 8^{4}\), just by typing these expressions into the R console
sqrt(243)
## [1] 15.58846
1.56*124
## [1] 193.44
7.21*8^4
## [1] 29532.16

1.4 Variables in R

  • When starting to work with more complicated mathematical operations in R, it is often useful to store intermediate values in named variables instead of using R as a calculator in interactive mode.

  • For example, the following R code creates the variables x, y, z and assigns them the values \((42 + 17)\sqrt{43}\), \(7.21(8^{4}) + \ln(2.34)\), and
    \((42 + 17)\sqrt{43}/[ 7.21(8^{4}) + \ln(2.34) ]\) respectively.

x <- (42 + 17)*sqrt(43)
y <- 7.21*8^4 + log(2.34)
z <- x/y
z  ## print out the value of z
## [1] 0.01310022
  • Here, x, y, and z are examples of variables.

  • The pair of characters <- used together is known as the assignment operator in R. x <- 2 assigns the value 2 to the variable x.


  • In general, a variable is the named storage of a value (or an object) in memory.

  • Why do we need variables?

    • To reuse the same value later on.
    • To generalize an expression to use in many cases.
  • How to use variables in R?

    • To set the value of a variable, use assignment operator <-

    • To use the value, simply use the variable name as if it were its stored value.

  • For example, …

1.4.1 Rules for choosing variable names in R

  • Variables can be named however you want as long as you follow the several variable-naming rules that R has.

  • In R variable names can include the following:

    • letters: A-Z a-z
    • digits: 0-9
    • underscore and period: _ .
  • Additional rules:

    • Variable names must start with letters or a period (not underscore or digits)
    • If a variable name starts with a period, it cannot be followed by a number.
    • Variable names are case sensitive.
  • The following tables shows examples of valid and invalid variable names in R

Valid Invalid
i 2things
my_variable location@
answer42 _user.name
.name .3rd
  • While you are free to choose variable names however you like as long as you follow the variable-naming rules of R, making variable names descriptive is highly recommended.

  • Descriptive variable names make it easier to read code. This is very helpful if:

    • You are sharing your code or
    • Looking back at code you wrote many weeks/months ago
  • Using a consistent convention for naming variables is recommended:

https://r4ds.had.co.nz/workflow-basics.html

1.4.2 Variable Assignment

  • Variables can be assigned using either <- or =
x = 123    # Use = to assign a variable
y <- 123   # Or use <- to assign a variable
x   # Retrieve the value of x
## [1] 123
y   # Retrieve the value of y
## [1] 123
  • The pair of characters <- is the classic symbol used for variable assignment in R.

  • The use of <- instead of = is often recommended in R style guides:


  • <- and = will work the same if they are both used in the “usual way” (when assigning variables within or outside of a function).

  • One exception, is when used inside a function call. For example, if we use = in the function sd(x):

sd(x = c(1,2,3,4,5)) # only sets the argument x in sd(x) to (1,2,3,4,5)
## [1] 1.581139
#x      ## will return an error if we try to print x
sd(x <- c(1,2,3,4,5)) # This actually assigns the vector (1,2,3,4,5) to x
## [1] 1.581139
x
## [1] 1 2 3 4 5
  • However, using something like sd(x <- c(1,2,3,4,5)) where we assign variables in a function call is not really done that often.

  • It is not common to assign variables in a function call (I never do it).

  • Whenever, using a function f with a keyword such as x, you will generally want to call that function using f(x = ...)

  • So, in my opinion, there is not really a strong reason to prefer using <- over = for assignment.

  • There are other justifications for using <- such as the ability to do assignment from the left by using the reverse symbol ->

c(1, 2, 3, 4) -> a # Using c(1,2,3,4) = a will not work!  
a
## [1] 1 2 3 4

1.4.3 Types of variables

  • Variables can be used to store different types of values.

  • Common types include numeric, text, and logical values.

x <- 3.2
x
## [1] 3.2
  • Here, x is actually a vector (basically a collection of elements storing the same type of data).

  • It is a vector of length one (i.e., it only has one element).

  • This is the reason why you see [1] printed out next to the number 3.2.

    • This means that the first element of the vector x is \(3.2\).
  • R treats every variable as some type of collection (e.g., vectors, matrices, lists, etc.).

    • There are no separate data types in R for individual numbers.
  • The elements in a vector can have different types (or modes).

  • You can find the types of the elements in a vector by using the function typeof

y <- sqrt(1743)
typeof(y)  # double and integer are the two numeric types 
## [1] "double"
z <- 3 # R automatically treats every number as double
z
## [1] 3
typeof(z)
## [1] "double"
  • The other common types for the elements in a vector include
    • logical (TRUE or FALSE) values
    • character basically text, e.g., “hello”, “car”, …
y <- TRUE
typeof(y)   
## [1] "logical"
z <- "dog" # to define a character variable, place it inside quotes
typeof(z)
## [1] "character"
  • We will discuss these types in more detail later on when we discuss vectors, matrices, and lists.

1.5 R Operations with numbers

  • As we mentioned before, …
Operator Meaning Example Result
+ addition 5 + 8 13
- subtraction 90 - 10 80
* multiplication 4 * 7 28
/ division 7 / 2 3.5
%% remainder 7 %% 2 1
^ exponent 3 ^ 4 81
** exponent 3 ** 4 81
  • R operations with numbers have similar precedence rules to arithmetic operations
Operator Description Precedence
+, - addition and subtraction low
*, /, %% multiplication, division, remainder
**, ^ exponentiation
(expressions…) Parenthesis high
  • Examples of operation precedence can be seen when typing the following expressions into the R console:
1 + 2 *3 ^ 4 # power > mult/div > add/sub
## [1] 163
(1 + 2 ) *3 ^ 4 # parenthesis > power
## [1] 243

1.6 Brief introduction to vectors in R

  • The vector is probably the most fundamental data structure in R.

  • A vector is essentially a collection of elements that all have the same “type”.

    • For example, a vector can be composed of a collection of numbers or a collection of characters.

    • However, a vector cannot contain both numbers and characters.


  • As an example, we can create a vector named x that contains the numbers 1, 7, and 4.

    • This is done with the following R code:
x <- c(1, 7, 4)
  • The variable x is a vector of length 3. The first element of x is 1, the second element of x is 7, and the third element of x is 4.

  • You can access elements of the vector x by using the [i] syntax.

  • For example, if you wanted to look at the second element of x you would use:

x[2]
## [1] 7

  • For vectors that contain numeric values, R has many built-in functions that can compute summary statistics about the numbers the vector.

  • For example, if we create the vector y that has values 1, 3, 10, 8,

y <- c(1, 3, 10, 8)

then we can easily compute the minimum, median, maximum, and standard deviation of this vector with the following R code:

min(y)     ## minimum of y
## [1] 1
median(y)  ## median of y
## [1] 5.5
max(y)     ## maximum of y
## [1] 10
sd(y)      ## standard deviation of y
## [1] 4.203173

  • As we saw in the nycflights example, when we extract a data variable from a data frame, R returns the data variable as a vector.

  • For example, if we extract the seats variable from the planes data frame and assign it to a variable named num_seats, then num_seats will be a numeric vector

num_seats <- planes$seats
  • The vector num_seats has 3322 elements in it:
length( num_seats )
## [1] 3322
  • The 10th element of num_seats is 182:
num_seats[10]
## [1] 182
  • The mean of the elements in num_seats is 154.3
mean(num_seats)
## [1] 154.3164

and the largest number inside the num_seats vector is 450:

450
## [1] 450

1.7 Writing Comments in R

  • The comment symbol in R is the hashmark symbol #.

  • Comments allow you to write notes in English (or any other human language) within your R programs.

  • Comments are basically pieces of text the computer will ignore when interpreting your code.

  • You can use comments to help explain what your code is doing.

  • Writing comments becomes more helpful as your code becomes more complex.

  • Writing comments can make code more readable for others.


  • In R, the hashmark symbol # marks the beginning of a comment.

  • Everything on a line following the hashmark symbol is ignored.

  • In the following example, both the text “This is an example of a comment” and the assignment x <- 64 are ignored

# This is an example of a comment

x <- 42

# x <- 64

x
## [1] 42
  • Note also that you can write comments on the same line as an R statement.

    • Everything to the right of the hashmark # symbol will be ignored.
# More 
# examples 
# of comments

x <- 42  ## x <- 24 

# x <- 64

x
## [1] 42

1.8 Exercises

  1. Compute the number
\[\begin{equation} \frac{\sqrt{1.43 + 5^{1.2}}}{3} \end{equation}\]

directly in the R console.

  1. Write an R script that assigns the value …
\[\begin{equation} \ln\Big( 1 + \exp(-2^{1.4}) \Big) + \ln\Big(1 + 2\exp(3^{1.7}) \Big) \end{equation}\]

to a variable named x and prints the result in the Console when you run the script.

  1. Which of the following is NOT a valid variable name in R?
    • .independent_variable3
    • _independent_variable3
    • independent_variable3
    • independent.variable3