Our work today

This code was prepared for a intro to R for the undegraduate seminar Tweeting Politica Crisis offered by Ernesto Calvo at the University of Maryland, College Park.

I cover some basic steps on learning R in this tutorial. The idea here is more to teach the students how to understand basic programming in R, how to navigate in the environment, and how and where to ask for help and learn more in the future.

This presentation was built using Rmarkdown, but I strongly suggest students to work directly in the R code attached to this presentation. All the materials can be download here.

This code has been adapted from previous materials by Eric Dunford, Natalia Bueno, and Rochelle Terman

Who am I and how did I learn R?

My name is Tiago Ventura. I am a Ph.D. student in Government and Politics at the University of Maryland, College Park. I have been working in R for the last 5 years since I started Master degree in Brazil. You can find my website here and fell free to send me emails if you need help with R. You can also find me at Chiconteague 4118.

I learned R through a fell different sources. Those are some of them:

Attending a lot of workshops and seminars about R.
Taking a lot of online courses, in particular on Datacamp.
Hadley Wickham’s book’s here and here

What’s R?

R is a versatile, open source programming/scripting language that’s useful both for statistics but also data science. Inspired by the programming language [S][S].

Open source software under GPL.
Superior (if not just comparable) to commercial alternatives. As of January 2019, R ranks 12th in the TIOBE index, which measures the popularity of programming languages. It’s widely used both in academia and industry especially in the circle of data scientists.
Available on all platforms (Unix, Windows, Linux).
As a result, if you do your analysis in R, anyone can easily replicate it.
Not just for statistics, but also general purpose programming.
Is object oriented (= R has objects) and functional (= You can write functions).
Large and growing community of peers.

Golden Rules of R

Everything that exists is an object. (*objected oriented**)
Everything that happens is a function call." (functional)

Rstudio

RStudio is the premier R graphical user interface (GUI) and integrated development environment (IDE) that makes R easier to use.

Tools –> Global Options

Before we begin, let’s set a few RStudio settings to improve your experience.

Click “Tools –> Global Options –> Appearance” to change your color and font settings.
Click “Tools –> Global Options –> Code” and check the box that says “Soft-wrap R source files” to wrap the text in your script to the width of the script pane.
Click “Tools –> Global Options –> Code –> Display” and check the boxes that say “Highlight selected line” and “Highlight R function calls”.

The basics: Navigating RStudio

Open RStudio! Then, open a new script by clicking “File –> New File –> R Script” or by pressing Ctrl + Shift + N (PC) or command + shift + N (Mac). After opening a new script, you should see four window “panes”.

Top left pane (input/script)

Enter code in this savable script file in the top left pane. This is a plain text file but with a .R extension. Enter 2 + 2 in your script and run a line of code by pressing command + enter (Mac) or Ctrl + enter (PC). Or, click the “Run” button at the top of the script.

A hashtag # tells R that you do not want that particular line or block of code to be run - this is called commenting your code. This is handy for making notes to yourself and you can even add hashtags after lines of runable code, on the same line.

The name of your script file is in the tab at the top of your script window - the name defaults to Untitled1. Be sure to save your script by clicking “File –> Save” or command + s (Mac) or Ctrl + s (PC). You can also click the floppy disk icon to save.

Bottom left pane (output/console)

Code output is displayed in the console in the bottom left pane. This space is also good for just noodling around and trying out code that you do not wish to save in your script.

In the console, the prompt > looks like a greater than symbol. If your prompt begins to look like a + symbol by mistake, simply click in your console and press the esc key on your keyboard as many times as necessary to return to the prompt.

R uses + when code is broken up across multiple lines and R is still expecting more code. A line of code does not usually end until R finds an appropriate stop parameter or punctuation that completes some code such as a closed round parenthesis ), square bracket ], curly brace }, or quotation mark '.

If the output in your console gets too messy, you can clear it by pressing control + l on both Mac and PC. This will not erase any saved data - it will simply make your console easier to read.

Top right pane (global environment)

Data are saved in R’s memory as “variables”. Variables are simply placeholders for a value, mathematical expression, word, function, or dataset! The global “Environment” tab in the upper right pane displays the variables you have assigned/saved. “Global” simply means that these variables are available for any task.

Bottom right pane (files, plots, packages, and help)

Here you find useful tabs for navigating your file system, displaying plots, installing packages, and viewing help pages. Press the control key and a number (1 through 9) on your keyboard to shortcut between these panes and tabs.

Installing a package in R

There are a number of packages that are supplied with the R distribution. These are known as ``base packages" and they are in the background the second one starts a session in R.

Packages are collections of R functions, data, and compiled code in a well-defined format.

# Install the package

install.packages("ggplot2")
install.packages("tidyverse")

# Activate the package
library("ggplot2")

Asking for help

? + object opens a help page for that specific object
?? + object searches help pages containing the name of the object

?mean
??mean
help(mean)

# The above three will do same. 

example(ls) # provides example for how to use ls 

help.search("visualization") # search functions and packages that have "visualization" in their descriptions

R Basics

Assigning an Object

In simple terms, an object is a bit of text that represents a specific value. Variable names can only contain letters, numbers, the underscore character, and (unlike Python) the period character. Whereas an object name like myobject.thing would point to the subclass or method thing of myobject in Python, R treats myobject.thing as its own entity.

# Numeric object
x <- 3

# String
my_name <- "Tiago"

# Where are the objects? 
ls()

[1] "my_name" "x"

# Remove an object
rm(x)

# Check again
ls()

[1] "my_name"

# Objects are flexible: you can rewrite them

my_name <- "Tiago Da Silva Ventura"

my_name

[1] "Tiago Da Silva Ventura"

Class of the objects

x <- 3

# Class
class(x)

[1] "numeric"

class(my_name)

[1] "character"

Remove objects

rm(my_name)

Object Coersion

When need be, an object can be coerced to be a different class.

[1] 3

# convert to a character

as.character(x)

[1] "3"

Here we transformed x – which was an object containing the value 3 – into a character. x is now a string with the text “3”.

Data Structures

There are also many ways data can be organized in R.

The same object can be organized in different ways depending on the needs to the user. Some commonly used data structures include:

vector
matrix
data.frame
list
array

Data Structures: Vector

# vector of numbers
X <- c(1, 2.3, 4, 5, 6.78, 6:10)
X

 [1]  1.00  2.30  4.00  5.00  6.78  6.00  7.00  8.00  9.00 10.00

# Class
class(X)

[1] "numeric"

# Size
length(X)

[1] 10

Data Structures: Data Frame

The most useful type of data for data analysis. It is like a spreadsheet in your R environment.

# Coercing
as.data.frame(X)

# Create a data frame

data <- data.frame(name="Tiago", last_name="ventura", school="UMD", age=30)

data

   name last_name school age
1 Tiago   ventura    UMD  30

Data Structures: Matrix

Same as a data frame, but with the same data type in the collumns

# Coerce to a matrix
as.matrix(X)

       [,1]
 [1,]  1.00
 [2,]  2.30
 [3,]  4.00
 [4,]  5.00
 [5,]  6.78
 [6,]  6.00
 [7,]  7.00
 [8,]  8.00
 [9,]  9.00
[10,] 10.00

Data Structures: List

List are extremely usefulf for more advanced applications. It works as a repository of multiple objects. It is like a big drawer where you can save your mess.

# coerce to a list

as.list(X)

[[1]]
[1] 1

[[2]]
[1] 2.3

[[3]]
[1] 4

[[4]]
[1] 5

[[5]]
[1] 6.78

[[6]]
[1] 6

[[7]]
[1] 7

[[8]]
[1] 8

[[9]]
[1] 9

[[10]]
[1] 10

# or

list<- list(X, data)

# See the list

str(list)

List of 2
 $ : num [1:10] 1 2.3 4 5 6.78 6 7 8 9 10
 $ :'data.frame':   1 obs. of  4 variables:
  ..$ name     : Factor w/ 1 level "Tiago": 1
  ..$ last_name: Factor w/ 1 level "ventura": 1
  ..$ school   : Factor w/ 1 level "UMD": 1
  ..$ age      : num 30

Data Structures: Accessing Data

One must understand the structure of an object in order to systematically access the material contained within it.

# open a saved dataframe

data <- mtcars

# Class

class(data)

[1] "data.frame"

# Structure
str(data)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# Accessing the Collumns
data[,1]

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4

# another way 
data$mpg

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4

# Or the rows
data[1,]

          mpg cyl disp  hp drat   wt  qsec vs am gear carb
Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

More ways to get information about your data frame

nrow(data)

[1] 32

ncol(data)

[1] 11

head(data)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(data)

                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

dim(data)

[1] 32 11

Data Management

But, first, where is my data exactly?

In your working directory.

R doesn’t intuitively know where your data is. If the data is in a special folder entitled “super secret research”, we have to tell R how to get there.

We can do this two ways:

Set the working directory to that folder
Establish the path to that folder

Every time R boots up, it does so in the same place, unless we tell it to go somewhere else.

getwd() # Get the current working directory

[1] "C:/Users/venturat/Dropbox/Workshops/UFPA_Intro_to_data_science_in_R/crash_course_GVPT"

Setting a new working director

setwd("C:/Users/Tiago Ventura/Dropbox/Workshops/UFPA_Intro_to_data_science_in_R/crash_course_GVPT")

Importing data

For basic programming tasks, you will mostly work importing data from a .csv, excel-ish, type of file. If you want to download data directly from twitter, using ther API, the data comes in a Json format, and processing is little more trick. Professor Calvo will do that for you, so it is more likely you will deal with .csv files.

Download this data here, and add in any folder on your computer.

library(tidyverse)
data = read.csv(file = "results.csv",
                stringsAsFactors = F)

data = read_csv("results.csv")

These functions have specific arguments that we are referencing: stringsAsFactors means that we don’t want all character vectors in the data.frame to be converted to Factors. header means the first row of the data are column names. sep means that entries are seperated by commas.

Exporting data

Exporing data is the same process in reverse. Assuming the we have the foreign, readstata13, and XLConnect packages loaded:

# .csv
write.csv(data,file="data.csv",row.names = F)

Descriptive Statistics

Now that we can get data into R, we want to explore and summarize what’s going on.

summary() allows for one to quickly summarize the distributions across a set of variables

summary(data)

      date             home_team          away_team        
 Min.   :1872-11-30   Length:39669       Length:39669      
 1st Qu.:1977-02-02   Class :character   Class :character  
 Median :1996-10-06   Mode  :character   Mode  :character  
 Mean   :1989-10-17                                        
 3rd Qu.:2008-01-22                                        
 Max.   :2018-07-10                                        
   home_score       away_score      tournament            city          
 Min.   : 0.000   Min.   : 0.000   Length:39669       Length:39669      
 1st Qu.: 1.000   1st Qu.: 0.000   Class :character   Class :character  
 Median : 1.000   Median : 1.000   Mode  :character   Mode  :character  
 Mean   : 1.748   Mean   : 1.188                                        
 3rd Qu.: 2.000   3rd Qu.: 2.000                                        
 Max.   :31.000   Max.   :21.000                                        
   country           neutral       
 Length:39669       Mode :logical  
 Class :character   FALSE:29848    
 Mode  :character   TRUE :9821

There are a wealth of useful summary operators that are built into R.

mean()
sd()
var()
range()
min()
max()
median()
quantile()
fivenum()
colMeans()
rowMeans()
table()

…to name a few!

Base Graphics

A rather flexible graphing language comes built into R. Though there are more powerful and easy to use graphical packages out there (e.g. ggplot2 and lattice), the base plotting functions offer a lot of functionality. The benefit of these functions is that they are easy to manipulate and use. - histograms: hist() - scatter plots: plot() - barplot: barplot() - pie chart: pie() - density plot: plot(density())

Histogram

hist(data$home_score)

hist(data$home_score,breaks=30,
     col="steelblue",border="white",
     ylab = "frequency",
     xlab = "Sepal Length",
     main = "Cool Histogram")
abline(v=mean(data$home_score),
       lty=2,col="red",lwd=5)

Base Graphics: Scatter Plots

plot(data$home_score,data$away_score)

Base Graphics: Scatter Plots

plot(data$home_score,data$away_score,
     pch=20,cex=4,col="#E58526",
     xlab="Sepal Length",
     ylab="Petal Length")
l = lowess(data$home_score,data$away_score)
lines(l,col="#E52643",lwd=7)

Base Graphics: Density Plots

dens = density(data$home_score)
plot(dens)

Basic Data Manipulations

Here we are going to use the same dataset we opened before with the data about soccer matches overt time.

data = read.csv(file = "results.csv",
                stringsAsFactors = F)

str(data)

'data.frame':   39669 obs. of  9 variables:
 $ date      : chr  "1872-11-30" "1873-03-08" "1874-03-07" "1875-03-06" ...
 $ home_team : chr  "Scotland" "England" "Scotland" "England" ...
 $ away_team : chr  "England" "Scotland" "England" "Scotland" ...
 $ home_score: int  0 4 2 2 3 4 1 0 7 9 ...
 $ away_score: int  0 2 1 2 0 0 3 2 2 0 ...
 $ tournament: chr  "Friendly" "Friendly" "Friendly" "Friendly" ...
 $ city      : chr  "Glasgow" "London" "Glasgow" "London" ...
 $ country   : chr  "Scotland" "England" "Scotland" "England" ...
 $ neutral   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Creating Variables

Recall that we can access a variables contents using the call sign $. We can also use this same call logic to create a new variable.

data$sum_of_gols <- data$home_score + data$away_score
head(data)

        date home_team away_team home_score away_score tournament    city
1 1872-11-30  Scotland   England          0          0   Friendly Glasgow
2 1873-03-08   England  Scotland          4          2   Friendly  London
3 1874-03-07  Scotland   England          2          1   Friendly Glasgow
4 1875-03-06   England  Scotland          2          2   Friendly  London
5 1876-03-04  Scotland   England          3          0   Friendly Glasgow
6 1876-03-25  Scotland     Wales          4          0   Friendly Glasgow
   country neutral sum_of_gols
1 Scotland   FALSE           0
2  England   FALSE           6
3 Scotland   FALSE           3
4  England   FALSE           4
5 Scotland   FALSE           3
6 Scotland   FALSE           4

We can also use other aspects of a data frame’s structure to the same end.

data[,'local'] <-paste(data$city, data$country)  # As Column 4, load the value 1 for all obs.

head(data) # Assign arbitrary name

        date home_team away_team home_score away_score tournament    city
1 1872-11-30  Scotland   England          0          0   Friendly Glasgow
2 1873-03-08   England  Scotland          4          2   Friendly  London
3 1874-03-07  Scotland   England          2          1   Friendly Glasgow
4 1875-03-06   England  Scotland          2          2   Friendly  London
5 1876-03-04  Scotland   England          3          0   Friendly Glasgow
6 1876-03-25  Scotland     Wales          4          0   Friendly Glasgow
   country neutral sum_of_gols            local
1 Scotland   FALSE           0 Glasgow Scotland
2  England   FALSE           6   London England
3 Scotland   FALSE           3 Glasgow Scotland
4  England   FALSE           4   London England
5 Scotland   FALSE           3 Glasgow Scotland
6 Scotland   FALSE           4 Glasgow Scotland

The creation of any variable follows this same logic as long as the vector being inserted is of the correct length.

nrow(data)

[1] 39669

# it does not work 
#data[,"New_Variable"] <- 1:20

data[,"id"] <- 1:nrow(data)# Works!

Ordinal Variables (`ifelse()` conditionals)

Often we need to chop up a distribution into an ordered variable. This is straightforward when using the ifelse() conditional statement. Essentially, we are saying: if the variable meets this criteria, code it as this; else do this.

For an example, let’s break the extra variable up into a dichotomous indicator.

data$home_vic <- ifelse(data$home_score>=data$away_score,"home victory","away victory")
data[,c("home_score","away_score", "home_vic")]

   home_score away_score     home_vic
1           0          0 home victory
2           4          2 home victory
3           2          1 home victory
4           2          2 home victory
5           3          0 home victory
6           4          0 home victory
7           1          3 away victory
8           0          2 away victory
9           7          2 home victory
10          9          0 home victory
11          2          1 home victory
12          5          4 home victory
13          0          3 away victory
14          5          4 home victory
15          2          3 away victory
16          5          1 home victory
17          0          1 away victory
18          1          6 away victory
19          1          5 away victory
20          0         13 away victory
21          7          1 home victory
22          5          1 home victory
23          5          3 home victory
24          5          0 home victory
25          5          0 home victory
 [ reached 'max' / getOption("max.print") -- omitted 39644 rows ]

Dropping Variables

Use negative values in the brackets to specify variables you’d like to drop.

head(data[,c(-3,-4,-5,-6,-7)])

        date home_team  country neutral sum_of_gols            local id
1 1872-11-30  Scotland Scotland   FALSE           0 Glasgow Scotland  1
2 1873-03-08   England  England   FALSE           6   London England  2
3 1874-03-07  Scotland Scotland   FALSE           3 Glasgow Scotland  3
4 1875-03-06   England  England   FALSE           4   London England  4
5 1876-03-04  Scotland Scotland   FALSE           3 Glasgow Scotland  5
6 1876-03-25  Scotland Scotland   FALSE           4 Glasgow Scotland  6
      home_vic
1 home victory
2 home victory
3 home victory
4 home victory
5 home victory
6 home victory

We can also subset out a variable.

new_data <- data[,c(1,2)]
head(new_data) # only selected two variables and made a new object.

        date home_team
1 1872-11-30  Scotland
2 1873-03-08   England
3 1874-03-07  Scotland
4 1875-03-06   England
5 1876-03-04  Scotland
6 1876-03-25  Scotland

Renaming Variables

Inevitably, you we’ll need to rename variables. Doing so is straightforward with the colnames() function.

colnames(data)

 [1] "date"        "home_team"   "away_team"   "home_score"  "away_score" 
 [6] "tournament"  "city"        "country"     "neutral"     "sum_of_gols"
[11] "local"       "id"          "home_vic"

# colnames behaves like any vector, and as such, we can access the information
# as we would any vector
colnames(data)[4]

[1] "home_score"

colnames(data)[4:5]

[1] "home_score" "away_score"

# Renaming a variable is as easy as inserting a new value in the data structure.
colnames(data)[4] <- "home-score"
colnames(data)

 [1] "date"        "home_team"   "away_team"   "home-score"  "away_score" 
 [6] "tournament"  "city"        "country"     "neutral"     "sum_of_gols"
[11] "local"       "id"          "home_vic"

colnames(data)[1:5] <- c("var1","var2","var3","var4","var5")
colnames(data)

 [1] "var1"        "var2"        "var3"        "var4"        "var5"       
 [6] "tournament"  "city"        "country"     "neutral"     "sum_of_gols"
[11] "local"       "id"          "home_vic"

head(data)

        var1     var2     var3 var4 var5 tournament    city  country
1 1872-11-30 Scotland  England    0    0   Friendly Glasgow Scotland
2 1873-03-08  England Scotland    4    2   Friendly  London  England
3 1874-03-07 Scotland  England    2    1   Friendly Glasgow Scotland
4 1875-03-06  England Scotland    2    2   Friendly  London  England
5 1876-03-04 Scotland  England    3    0   Friendly Glasgow Scotland
  neutral sum_of_gols            local id     home_vic
1   FALSE           0 Glasgow Scotland  1 home victory
2   FALSE           6   London England  2 home victory
3   FALSE           3 Glasgow Scotland  3 home victory
4   FALSE           4   London England  4 home victory
5   FALSE           3 Glasgow Scotland  5 home victory
 [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

Subsetting Data

As noted above, it’s straightforward to subset data given what we know about an object’s structure. But there are also a few functions that make our life easier.

# Let's subset the data just to games Brazil was playing. There are many ways to do
# this, let's explore a few.


data = read.csv(file = "results.csv",
                stringsAsFactors = F)


# (1) Use the what we know about boolean operators from last week. 
data[data$home_team=="Brazil",]

          date home_team away_team home_score away_score    tournament
424 1916-07-08    Brazil     Chile          1          1 Copa AmÃÂ©rica
427 1916-07-12    Brazil   Uruguay          1          2 Copa AmÃÂ©rica
460 1917-10-12    Brazil     Chile          5          0 Copa AmÃÂ©rica
486 1919-05-11    Brazil     Chile          6          0 Copa AmÃÂ©rica
491 1919-05-18    Brazil Argentina          3          1 Copa AmÃÂ©rica
495 1919-05-26    Brazil   Uruguay          2          2 Copa AmÃÂ©rica
496 1919-05-29    Brazil   Uruguay          1          0 Copa AmÃÂ©rica
498 1919-06-01    Brazil Argentina          3          3      Friendly
              city   country neutral
424   Buenos Aires Argentina    TRUE
427   Buenos Aires Argentina    TRUE
460     Montevideo   Uruguay    TRUE
486 Rio de Janeiro    Brazil   FALSE
491 Rio de Janeiro    Brazil   FALSE
495 Rio de Janeiro    Brazil   FALSE
496 Rio de Janeiro    Brazil   FALSE
498 Rio de Janeiro    Brazil   FALSE
 [ reached 'max' / getOption("max.print") -- omitted 544 rows ]

# More complex?
data[data$home_team=="Brazil" & data$away_team=="Argentina",]

           date home_team away_team home_score away_score    tournament
491  1919-05-18    Brazil Argentina          3          1 Copa AmÃÂ©rica
498  1919-06-01    Brazil Argentina          3          3      Friendly
655  1922-10-15    Brazil Argentina          2          0 Copa AmÃÂ©rica
660  1922-10-22    Brazil Argentina          2          1     Copa Roca
2136 1939-01-15    Brazil Argentina          1          5     Copa Roca
2139 1939-01-22    Brazil Argentina          3          2     Copa Roca
2222 1940-02-18    Brazil Argentina          2          2     Copa Roca
2225 1940-02-25    Brazil Argentina          0          3     Copa Roca
               city country neutral
491  Rio de Janeiro  Brazil   FALSE
498  Rio de Janeiro  Brazil   FALSE
655  Rio de Janeiro  Brazil   FALSE
660      SÃÂ£o Paulo  Brazil   FALSE
2136 Rio de Janeiro  Brazil   FALSE
2139 Rio de Janeiro  Brazil   FALSE
2222     SÃÂ£o Paulo  Brazil   FALSE
2225     SÃÂ£o Paulo  Brazil   FALSE
 [ reached 'max' / getOption("max.print") -- omitted 37 rows ]

# Subset and only give me the first column
data[data$home_team=="Brazil" & data$away_team=="Argentina", c("home_score", "away_score")]

      home_score away_score
491            3          1
498            3          3
655            2          0
660            2          1
2136           1          5
2139           3          2
2222           2          2
2225           0          3
2508           3          4
2509           6          2
2510           3          1
4129           1          2
4132           2          0
4684           5          1
5318           2          3
5319           5          2
5560           0          3
5814           0          0
6765           4          1
6768           3          2
7317           0          2
7319           2          1
9348           2          1
9634           2          0
10835          2          1
12660          0          0
12986          0          0
13611          2          1
15575          2          0
16035          0          1
16455          1          1
17736          1          1
18111          2          0
18926          2          2
21122          0          1
22039          2          1
22155          4          2
 [ reached 'max' / getOption("max.print") -- omitted 8 rows ]

Merging Data

Merging data is a must in quantitative political analysis by bringing various datasets together we can enrich our analysis. But this isn’t always straightforward. Sometimes observations can be dropped if one is not vigilant of the dimensions of each data frame being input.

The Basics

# Let's create two example data frames. Note that rep() is a function to repeat
# a sequence a specific number of times.

countries <- rep(c("China","Russia","US","Benin"),2) 
years <- c(rep(1999,4),rep(2000,4))

data1 <- data.frame(country=countries,
                   year=years,
                   repress = c(1,2,4,3,2,3,4,1),stringsAsFactors = F)

data2 <- data.frame(country=countries,
                   year=years,
                   GDPpc= round(runif(8,2e3,20e3),3),stringsAsFactors = F)

head(data1);head(data2)

  country year repress
1   China 1999       1
2  Russia 1999       2
3      US 1999       4
4   Benin 1999       3
5   China 2000       2
6  Russia 2000       3

  country year     GDPpc
1   China 1999  8942.119
2  Russia 1999 17715.519
3      US 1999  4948.356
4   Benin 1999 18299.354
5   China 2000 15222.150
6  Russia 2000 10466.718

# Merging the datasets: here we'll merge the data utilizing a unqiue identifier
# that is common across the two datasets

merge(data1,data2,by="country") # Just countries

   country year.x repress year.y     GDPpc
1    Benin   1999       3   1999 18299.354
2    Benin   1999       3   2000  3674.782
3    Benin   2000       1   1999 18299.354
4    Benin   2000       1   2000  3674.782
5    China   1999       1   1999  8942.119
6    China   1999       1   2000 15222.150
7    China   2000       2   1999  8942.119
8    China   2000       2   2000 15222.150
9   Russia   1999       2   1999 17715.519
10  Russia   1999       2   2000 10466.718
11  Russia   2000       3   1999 17715.519
12  Russia   2000       3   2000 10466.718
13      US   1999       4   1999  4948.356
14      US   1999       4   2000  8071.605
15      US   2000       4   1999  4948.356
 [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

merge(data1,data2,by=c("country","year")) # country-years

  country year repress     GDPpc
1   Benin 1999       3 18299.354
2   Benin 2000       1  3674.782
3   China 1999       1  8942.119
4   China 2000       2 15222.150
5  Russia 1999       2 17715.519
6  Russia 2000       3 10466.718
7      US 1999       4  4948.356
8      US 2000       4  8071.605

Loops

As one quickly notes, doing any task in R can become redundant. Loops and functions can dramatically increase our workflow when a task is systematic and repeatable.

Let’ say, we want to calculate how many games each country won when playing at their home.

# First difference of the goals
data$victory <- ifelse(data$home_score > data$away_score, 1, 0)

# To do this, we'd need to subset by each group and then calculate the mean.
sub <- data[data$home_team=="Brazil",]
victory_brazil <- sum(sub$victory)

sub <- data[data$home_team=="Argentina",]
victory_argentina <- sum(sub$victory)

group_means <- c(victory_brazil,victory_argentina) # combine
group_means

[1] 395 354

This works for few cases. But it would become quite the undertaking as the number of groups increased. loops just allows you to repeate the operation using some type of index on your data frame.

Here is where loops can make one’s life easier! By “looping through” all the respective groups, we can automate this process so that it goes a lot quicker.

A loop essentially works like this:

Specify a length of some thing you want to loop through. In our case, it’s the number of groups.
Set the code up so that every iteration only performs a manipulation on a single subset at a time.
Save the contents of each iteration in a new object that won’t be overwritten. Here we want to think in terms of “stacking” results or concatenating them.

In practice…

# (1) Specify the length
no.of.groups = unique(data$home_team) # only unique entries
no.of.groups

 [1] "Scotland"            "England"             "Wales"              
 [4] "Northern Ireland"    "USA"                 "Uruguay"            
 [7] "Austria"             "Hungary"             "Argentina"          
[10] "Belgium"             "France"              "Netherlands"        
[13] "Czechoslovakia"      "Switzerland"         "Sweden"             
[16] "Germany"             "Italy"               "Chile"              
[19] "Norway"              "Finland"             "Luxembourg"         
[22] "Russia"              "Denmark"             "Brazil"             
[25] "Japan"               "Paraguay"            "Canada"             
[28] "Estonia"             "Costa Rica"          "Guatemala"          
[31] "Spain"               "Poland"              "Yugoslavia"         
[34] "New Zealand"         "Romania"             "Latvia"             
[37] "Portugal"            "China"               "Australia"          
[40] "Lithuania"           "Turkey"              "Mexico"             
[43] "Aruba"               "Egypt"               "Haiti"              
[46] "Philippines"         "Bulgaria"            "Jamaica"            
[49] "Kenya"               "Bolivia"             "Peru"               
[52] "Honduras"            "Guyana"              "Uganda"             
[55] "Belarus"             "El Salvador"         "Barbados"           
[58] "Ireland"             "Trinidad and Tobago" "Greece"             
[61] "CuraÃÂ§ao"            "Dominica"            "Guadeloupe"         
[64] "Israel"              "Suriname"            "French Guyana"      
[67] "Cuba"                "Colombia"            "Ecuador"            
[70] "St. Kitts and Nevis" "Panama"              "Slovakia"           
[73] "Manchukuo"           "Croatia"             "Nicaragua"          
 [ reached getOption("max.print") -- omitted 216 entries ]

1:length(no.of.groups)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[70] 70 71 72 73 74 75
 [ reached getOption("max.print") -- omitted 216 entries ]

# (2) Make the code iterable: Check what you are repeating 
sub = data[data$home_team==no.of.groups[1],]
# Here, just by changing where we are in the vector "no.of.groups", we can draw 
# out a unique subset

# (3) Save the contents
container = c() # create an empty container that we can append to.

Now combine all these elements if a special base function called for(){} – note that all the code goes in-between the brackets. Here we need to establish an arbitrary iterator, which I’ll call i in the example below. i will take the value of each entry in the vector 1:length(no.of.groups), e.g. i=1 then i=2, and so on given how many groups we have.

container = c() # Empty Container
for ( i in 1:length(no.of.groups) ){
  sub = data[data$home_team==no.of.groups[i],] 
  mu <- sum(sub$victory)
  container <- c(container,mu) 
}

container

 [1] 208 298 117 102 208 188 206 250 354 208 270 231 136 171 278 312 271
[18] 204 158 100  22 177 211 395 182 128  70  73 176 102 238 194 108  79
[35] 186  70 180 188 161  52 118 291  16 229 101  43 146 135 184  81 124
[52] 136  49 159  39 118  56 141 170 127  72  28  58  95  87  24  57 124
[69]  97  47  85  59   0  88  14
 [ reached getOption("max.print") -- omitted 216 entries ]

# Or

data.frame(country=no.of.groups, container)

            country container
1          Scotland       208
2           England       298
3             Wales       117
4  Northern Ireland       102
5               USA       208
6           Uruguay       188
7           Austria       206
8           Hungary       250
9         Argentina       354
10          Belgium       208
11           France       270
12      Netherlands       231
13   Czechoslovakia       136
14      Switzerland       171
15           Sweden       278
16          Germany       312
17            Italy       271
18            Chile       204
19           Norway       158
20          Finland       100
21       Luxembourg        22
22           Russia       177
23          Denmark       211
24           Brazil       395
25            Japan       182
26         Paraguay       128
27           Canada        70
28          Estonia        73
29       Costa Rica       176
30        Guatemala       102
31            Spain       238
32           Poland       194
33       Yugoslavia       108
34      New Zealand        79
35          Romania       186
36           Latvia        70
37         Portugal       180
 [ reached 'max' / getOption("max.print") -- omitted 254 rows ]

Functions

Really often, we have specific tasks that we have to implement all the time.

Building a function for these tasks can really make life easier, and often it makes one’s work more reproducible and transparent.

For example, consider the example above, it is likely we will perform this by group sum calculation a lot of times, therefore, it is interesting to convert this to a function.

Let’s go through the process of building our own functions in R. In basic terms, a function is a specific set of arguments that perform a specific task.

Let’s build a simple function that adds two values. Here the function will have two arguments, or put differently, two values that need to be entered for the function to perform. As you’ll note, this looks a lot like the set up for a loop!

add_me <- function( argument1, argument2 ){
  value <- argument1 + argument2
  return(value) # "return" means "send this back once the function is done"
}

add_me(2,3)

[1] 5

add_me(100,123)

[1] 223

add_me(60,3^4)

[1] 141

# We can set "default" values for an argument, so if there is no inputs, the
# function will still run.
add_me <- function( argument1=1, argument2=2 ){
  value <- argument1 + argument2
  return(value) 
}
add_me()

[1] 3

add_me(4,5)

[1] 9

The basic structure is the following:

## name.of.the.function <- function(x,y,z){
##                  ## tells R that this is a function and define the 
##          ## arguments it will have, here (x,y,z)
##
##      out <- what the function does.
##
##      return(out) ## defines the output of the function
## }
## closes the function

Now, let’s build a function for our sum loop that we constructed in the last section. The arguments we would need are straight forward. We need the data, the name of the group column, and the name of the value column.

group_sum <- function(data, group.var, value.var) {
    
    no.of.groups = unique(data[, group.var])
    # Does anyone know why I am accessing the data this way?
    
    container = c()  # Empty Container
    
    for (super_arbitrary_iterator in 1:length(no.of.groups)) {
        sub = data[data[, group.var] == no.of.groups[super_arbitrary_iterator], 
            ]
        mu <- sum(sub[, value.var])
        container <- rbind(container, mu)  # return as matrix
    }
    
    # Lastly, create a data frame
    data_frame = data.frame(no.of.groups, container)
    
    
    return(data_frame)
}


# Recall the fake country data?
group_sum(data, group.var = "home_team", value.var = "victory")  # beautiful!

          no.of.groups container
mu            Scotland       208
mu.1           England       298
mu.2             Wales       117
mu.3  Northern Ireland       102
mu.4               USA       208
mu.5           Uruguay       188
mu.6           Austria       206
mu.7           Hungary       250
mu.8         Argentina       354
mu.9           Belgium       208
mu.10           France       270
mu.11      Netherlands       231
mu.12   Czechoslovakia       136
mu.13      Switzerland       171
mu.14           Sweden       278
mu.15          Germany       312
mu.16            Italy       271
mu.17            Chile       204
mu.18           Norway       158
mu.19          Finland       100
mu.20       Luxembourg        22
mu.21           Russia       177
mu.22          Denmark       211
mu.23           Brazil       395
mu.24            Japan       182
mu.25         Paraguay       128
mu.26           Canada        70
mu.27          Estonia        73
mu.28       Costa Rica       176
mu.29        Guatemala       102
mu.30            Spain       238
mu.31           Poland       194
mu.32       Yugoslavia       108
mu.33      New Zealand        79
mu.34          Romania       186
mu.35           Latvia        70
mu.36         Portugal       180
 [ reached 'max' / getOption("max.print") -- omitted 254 rows ]

# change whatever you want here. The function is super general
group_sum(data, group.var = "away_team", value.var = "victory")  # beautiful!

          no.of.groups container
mu             England       109
mu.1          Scotland       154
mu.2             Wales       184
mu.3  Northern Ireland       189
mu.4            Canada        97
mu.5         Argentina       141
mu.6           Hungary       188
mu.7    Czechoslovakia       121
mu.8           Uruguay       219
mu.9            France       136
mu.10          Austria       154
mu.11      Switzerland       202
mu.12      Netherlands       126
mu.13          Belgium       155
mu.14          Germany       115
mu.15           Norway       190
mu.16           Sweden       191
mu.17            Italy       103
mu.18            Chile       195
mu.19          Finland       238
mu.20           Russia       104
mu.21       Luxembourg       146
mu.22          Denmark       164
mu.23           Brazil       100
mu.24              USA       140
mu.25      Philippines        84
mu.26          Estonia       134
mu.27      El Salvador       123
mu.28       Costa Rica       137
mu.29         Paraguay       207
mu.30       Yugoslavia       113
mu.31           Poland       167
mu.32         Portugal       116
mu.33            Spain        79
mu.34          Romania       141
mu.35        Australia        74
mu.36           Mexico       128
 [ reached 'max' / getOption("max.print") -- omitted 252 rows ]

Final Suggestions

This was a super rushed crash course in R. If you need help, you can always find me here.

The next step in R would be to introduce to you the tidyverse packages. The tidyverse is a family of R packages developed by Hadley Wickham and his colleagues that apply the same language and structure to different tasks in R. In summary, the tidyverse makes duties as data management, cleaning and visualization super easy.

We don’t have time today, but here you can find a workshop I prepare for graduate students about using the tidyverse packages in R.

Crash Course in R

Tiago Ventura

2020-04-21

Our work today

Who am I and how did I learn R?

What’s R?

Golden Rules of R

Rstudio

Tools –> Global Options

The basics: Navigating RStudio

Top left pane (input/script)

Bottom left pane (output/console)

Top right pane (global environment)

Bottom right pane (files, plots, packages, and help)

Installing a package in R

Asking for help

R Basics

Assigning an Object

Class of the objects

Remove objects

Object Coersion

Data Structures

Data Structures: Vector

Data Structures: Data Frame

Data Structures: Matrix

Data Structures: List

Data Structures: Accessing Data

Data Management

Importing data

Exporting data

Descriptive Statistics

Base Graphics

Histogram

Base Graphics: Scatter Plots

Base Graphics: Scatter Plots

Base Graphics: Density Plots

Basic Data Manipulations

Creating Variables

Ordinal Variables (ifelse() conditionals)

Dropping Variables

Renaming Variables

Subsetting Data

Merging Data

Loops

Functions

Final Suggestions

Ordinal Variables (`ifelse()` conditionals)