R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data analysts for developing statistical software and data analysis. Together with Stata, R is the most commonly used programming language among academic economists today. One big advantage R has over Stata or SPSS is that it is open-source and requires no expensive license to use. This means that it is more widely used for doing statistics in industry, and you will continue to benefit from learning it after leaving university. Another advantage of R is its wealth of associated packages including all kinds of additional functionality. R provides support for machine learning, geographic information systems, and scientific illustration and data visualisation.
To install R on your personal computer, go to https://www.r-project.org/ and follow the instructions for your operating system. R is available for MacOS, Windows, and UNIX. Follow the links to the mirror closest to your geographic location. If you're residing in the UK, I suggest that you download a UK mirror, for example https://cran.ma.imperial.ac.uk/.
To program conveniently in R, you will need an IDE (Integrated Development Environment). The IDE you will use for R is RStudio. Once you have R installed on your machine, go to https://rstudio.com/products/rstudio/ and install the Desktop version. This is also free. RStudio comes with a workspace which makes it easier to manage your R code files and data, a debugger, and a customizable interface. It also comes with an interface to help you keep track of plots and packages. Upon first opening RStudio you should see something like the image below. Here I have an .R-file named 'intro', which is where I write my code. Once a piece of code is run, the output - e.g. printing 'Hello, World!' to the screen - is displayed in the console below. To the right, you see the environment window, where data and variables will be stored.
As current Durham University students, you can also access R and RStudio virtually via AppsAnywhere. Go to https://appsanywhere.durham.ac.uk/login and login using your Durham ID and password.
Below is a very brief introduction to help you get started with the basics in R. However, there is an almost endless toolkit to learn to truly master the language, so I will link to some more detailed resources at the end. My advice to students is that whenever you face an exercise in your modules which asks you to compute something, work with data, or perform some statistical analysis, try to do it in R or your programming language of choice. By the time you graduate you will be much more comfortable with coding in general.
In R, as with most programming, we want to simplify computation by storing the values we use in variables that we can refer back to, rather than re-writing them every time. This also makes it easier to keep track of what the code does. To create the variable zero and assign it the value 0, we write:
zero <- 0 # Assign value 0 to variable 'zero'. Code preceded by a # is not computed, we use # to document our code!
zero
a <- b means that b is stored in a. Now whenever we write zero, R will remember that this means the value 0. We can also do arithmetic using variables:
one <- 1
two <- 2
one + two # 1+2=3
one / two # 1/2=0.5
two * one # 2*1=2
one ^ two # 1^2=1
Functions in R are much the same as in mathematics. They are pre-programmed code that takes an input, performs a given operation, and spits out an output. Eventually you will learn to write your own custom functions, but we stick to functions already within base R for now. A function is recognized by its name followed by a parenthesis. The input goes within the parenthesis. In RStudio, you can type ?'function name' in the command line to get further information about the function:
?mean()
mean(x, ...)
## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
Data structures are really just what they sound like: They describe how data objects are structured. R terminology is similar to what you have encountered in algebra in this respect. One-dimensional data is a vector, two-dimensional data is a matrix, and multidimensional (D>2) data are called arrays. We can create data structures by combining existing data
vector1 <- c(one, two)
vector2 <- c(two, zero)
length(vector1)
matrix <- cbind(vector1, vector2) # Vectors can be combinded into matrices
matrix
vector1 | vector2 |
---|---|
1 | 2 |
2 | 0 |
the cbind() function combines the two vectors of length = 2 into a two-by-two matrix. The horizontal sequences in a matrix are called rows, and the vertical sequences are called columns. Hence, cbind() is just short for column-bind. Similarly there is rbind() to combine rows horizontally. We can create data structures by combinining existing data like this, but we can also create new custom, empty data and fill them later:
array <- array(dim = c(nrow(matrix), ncol(matrix), length(vector1))) # Creates an empty 3D 2x2x2 array = two 2x2.
array
array[ , , 1] <- matrix
array[ , , 2] <- t(matrix)
We populate the first 2x2 in array with the 2x2 matrix, and the second 2x2 with the transpose of the matrix.
array[, , 1]
1 | 2 |
2 | 0 |
array[, , 2]
1 | 2 |
2 | 0 |
You will see that the array is simply storing the values of each matrix one after the other. As such arrays are the primary method to store higher dimensional data. The notation also provides a simple way to quickly retrieve data from a data object. For example, if I want to see what value is stored in the second row of the first column in Matrix:
matrix[2 , 1] # the value before the comma within the bracket specifies the row number, and the value after the comma the column number
Often we want to perform a given set of operations on multiple variables. In such cases it is not convenient to rewrite the code in every instance. Here is where for-loops come in handy. They allow us to loop an operation over several variables or elements in a data structure. If-statements allow us to specify conditions to the operations in the loop. For example, we may want to print every positive value in a vector. We also print a warning whenever the value is negative.
numbers <- c(10, -5, -2, 3, 1, 1, -0.5, 5)
for (i in 1:length(numbers)) {
if ( numbers[i] > 0 ) {
print( numbers[i] ) }
else { print( 'NEGATIVE!' ) }
}
[1] 10 [1] "NEGATIVE!" [1] "NEGATIVE!" [1] 3 [1] 1 [1] 1 [1] "NEGATIVE!" [1] 5
The condition is written inside the parantheses, followed by the conditional operation inside curly brackets. Conditions are expressed as so-called Boolean operators:
data.frame('Boolean operator' = c('<', '>', '<=', '>=', '==', '|', '&'),
'meaning' = c('less than', 'greater than', 'less than or equal to', 'greater than or equal to', 'equal to', 'or', 'and'))
Boolean.operator | meaning |
---|---|
<chr> | <chr> |
< | less than |
> | greater than |
<= | less than or equal to |
>= | greater than or equal to |
== | equal to |
| | or |
& | and |
Any value within any kind of data structure will have a class. It is important to know about classes because not all operations within R can be performed on all classes. Three classes that you will encounter frequently are 'numeric', 'character' and 'date'. You can get the class of an object by giving it as the input in the function class()
class(matrix[1,1])
The value in the first row of the first column of matrix is numeric. It's 0. However, if we want, we can combine different classes in the same data structure. The most common way to do so is with the data frame. This is basically a table. Below we set up a table containing the names of students (curiously named after famous British economists) and the module they take:
students <- data.frame(STUDENT = c('J. Robinson', 'R. Coase', 'J.M. Keynes', 'A.C. Pigou'),
MODULE = c('Monetary Economics', 'Law and Economics', 'Macroeconomics', 'Welfare Economics'))
students # table of module name, exam date and exam grade
STUDENT | MODULE |
---|---|
<chr> | <chr> |
J. Robinson | Monetary Economics |
R. Coase | Law and Economics |
J.M. Keynes | Macroeconomics |
A.C. Pigou | Welfare Economics |
We also have a table of the marks awarded for each module:
marks <- data.frame(MODULE = c('Monetary Economics', 'Law and Economics', 'Macroeconomics', 'Welfare Economics'),
ESSAY_MARK = c(65, 70, 75, 59),
EXAM_MARK = c(60, 80, 67, 70))
marks
class(marks$MODULE)
class(marks$ESSAY_MARK)
class(marks$EXAM_MARK)
MODULE | ESSAY_MARK | EXAM_MARK |
---|---|---|
<chr> | <dbl> | <dbl> |
Monetary Economics | 65 | 60 |
Law and Economics | 70 | 80 |
Macroeconomics | 75 | 67 |
Welfare Economics | 59 | 70 |
We see that the MODULE variable has class 'character', while the grade variables have class 'numeric. You can, for example, perform arithmetic operations on numeric data, but not on character data.
An important function for working with data.frames is the merge() function. This allows us to combine multiple tables by merging a common variable name. Let's merge the two tables 'students' and 'marks' to find out the mark earned by each student.
student.grades <- merge(students, marks, by = 'MODULE')
student.grades
MODULE | STUDENT | ESSAY_MARK | EXAM_MARK |
---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> |
Law and Economics | R. Coase | 70 | 80 |
Macroeconomics | J.M. Keynes | 75 | 67 |
Monetary Economics | J. Robinson | 65 | 60 |
Welfare Economics | A.C. Pigou | 59 | 70 |
by = 'MODULE' means that we merge based on the common variable 'MODULE' in the two tables. If we know what the grade was in each module, and which student took each module, we also know which grade each student got. Merge is a very valueble tool for working with multiple datasets. As economists, we often need to combine data from multiple sources, and in these cases we frequently combine data based on common identifying variables.
Now we can use a for-loop and the mean function to get the average mark of each student. We can also use an if-statement to show that a student achieved a 1st Class if they got an average mark of 70 or above:
student.grades$FINAL_MARK
for (i in 1:nrow(student.grades)) {
student.grades$FINAL_MARK[i] <- mean(student.grades$ESSAY_MARK[i], student.grades$EXAM_MARK[i])
if (student.grades$FINAL_MARK[i] >= 70) {
print(paste(student.grades$STUDENT[i], ' earned a 1st!'))
}
}
student.grades
NULL
[1] "R. Coase earned a 1st!" [1] "J.M. Keynes earned a 1st!"
MODULE | STUDENT | ESSAY_MARK | EXAM_MARK | FINAL_MARK |
---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> |
Law and Economics | R. Coase | 70 | 80 | 70 |
Macroeconomics | J.M. Keynes | 75 | 67 | 75 |
Monetary Economics | J. Robinson | 65 | 60 | 65 |
Welfare Economics | A.C. Pigou | 59 | 70 | 59 |
If we want to visualize our data, R offers functionality for that too. To make pretty plots, we want to install and load the package 'ggplot2' which is used to produce publication-quality graphs and figures.
library(ggplot2)
ggplot(data = student.grades, aes(y = FINAL_MARK, x = MODULE, fill = STUDENT)) +
geom_bar(stat='identity') +
ylim(0,100) +
theme_bw()
This document was meant to provide the basics to equip you to do simple exercises like the Economics of Sustainability seminar questions in R. However there is much more to learn. Below I list some good resources:
A highly recommended textbook introducing statistical analysis and machine learning with R. Also comes in a more advanced version 'Elements in Statistical Learning'.
A free manual to R from the core R development team. A much more in-depth version of this document.
A practical guide to ggplot - learn to make highly AESTHETIC figures and graphs!