Intro to R

Python is not the only language used in the field of data science. Another frequently used language is “R”. Before I dig deeper in the the Udacity course, I have decided to spend a little time exploring this language. As my guide, I will be using the Datacamp free tutorial on the language as well as looking at some data on my own. First, however, I need to get the thing installed. (The tutorial includes an online R interactive console. I prefer, however, to have the language installed on my mac as well.)

Installing R

Here are the steps I took to install R with homebrew on my Mac (running macOS Sierra Version 10.12.5).

brew tap homebrew\science
brew install r

After the install was completed, I followed the instructions and ran the following script:

R CMD javareconf JAVA_CPPFLAGS="-I/System/Library/Frameworks/JavaVM.framework/Headers -I$(/usr/libexec/java_home | grep -o '.*jdk')"

Lastly, I followed instructions on this post to set up an R gui:

brew install r-gui
brew linkapps r-gui

(Note: I found that the r-gui is somewhat unstable on my mac. After several crashes, I have opted to work with “R” from a terminal window. This seems much more stable.)

This created an app in the applications folder for the R console. Lastly, I created a shortcut in the doc bar to launch “R”.

Before I start exploring data on my own, it makes sense to review some of the basics of R and R data types.

Basic R Data Types

Variables

R has a similar hierarchy of data types as Python pandas. These types include variables, vectors, and data frames.

To assign a value to a variable we use the “<-“ notation.

> some_variable <- 100
> some_variable
[1] 100

Note: you can display the contents of a variable by just entering the name in a prompt.

Vectors

To create a vector, we use a similar syntax combined with the “c()” function. This function is named “c()” because it combines values.

> some_vector <- c(10,20,30,0,-10)
> some_vector
[1]  10  20  30   0 -10

It is possible, with R, to name the elements in a vector. This makes it easier to reference the data rather than just referring to the column index (which is also possible.)

> names(some_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
> some_vector
   Monday   Tuesday Wednesday  Thursday    Friday
       10        20        30         0       -10
>
>

An interesting aspect to vectors is that they are indexed with the first element being 1. For example:

> some_vector <- c(10,20,30,0,-10)
> names(some_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
> some_vector
   Monday   Tuesday Wednesday  Thursday    Friday
       10        20        30         0       -10
>
> some_vector[3]
Wednesday
       30

Matrices

R also has the concept of matrices. As you might expect, these are two dimensional arrays of the same data type. To define a matrix, you can use the following notation:

> mymatrix <- matrix(1:9, byrow = TRUE, nrow=3)
> mymatrix
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
>

R Data Frame Basics

An R data frame…

is used for storing data tables. It is a list of vectors of equal length.

http://www.r-tutor.com/r-introduction/data-frame

The following GIT repo contains the results of a survey on people’s opinions about the San Andreas Fault https://github.com/fivethirtyeight/data/tree/master/san-andreas. The data from this set can be used to explore some of the features of R data frames.

Loading Data

R has built-in features that make it easy to load data from files. The following shows how to load data into a data frame from a “.csv” file.

> setwd("/Users/milesporter/data-science/data-sets/san-andreas")
> mydata = read.csv("earthquake_data.csv")

This creates a data frame object that we can then use to analyze the data in the CSV file.

Once we load the data into a data frame, we can see what the columns are by using the “colnames” function like so:

> colnames(mydata)
 [1] "In.general..how.worried.are.you.about.earthquakes."                                                                                                  
 [2] "How.worried.are.you.about.the.Big.One..a.massive..catastrophic.earthquake."                                                                          
 [3] "Do.you.think.the..Big.One..will.occur.in.your.lifetime."                                                                                             
 [4] "Have.you.ever.experienced.an.earthquake."                                                                                                            
 [5] "Have.you.or.anyone.in.your.household.taken.any.precautions.for.an.earthquake..packed.an.earthquake.survival.kit..prepared.an.evacuation.plan..etc..."
 [6] "How.familiar.are.you.with.the.San.Andreas.Fault.line."                                                                                               
 [7] "How.familiar.are.you.with.the.Yellowstone.Supervolcano."                                                                                             
 [8] "Age"                                                                                                                                                 
 [9] "What.is.your.gender."                                                                                                                                
[10] "How.much.total.combined.money.did.all.members.of.your.HOUSEHOLD.earn.last.year."                                                                     
[11] "US.Region"

To view a specific column of data, we can address it by name or column number. For example

mydata$Age

(Note: There are several ways to accomplish this using slightly different syntax including mydata(Age) mydata[“Age”], mydata[[“Age”]], and mydata[8].)

This will return all of the items in the “Age” column. The result is returned as a vector.

> mydata$Age
   [1] 18 - 29 18 - 29 18 - 29 18 - 29 18 - 29 18 - 29         18 - 29 18 - 29
  [10] 18 - 29 18 - 29 18 - 29         18 - 29 18 - 29 18 - 29 18 - 29 18 - 29
  [19] 30 - 44 30 - 44 18 - 29 30 - 44 18 - 29 18 - 29 30 - 44 18 - 29 30 - 44
  [28] 30 - 44 30 - 44 18 - 29 18 - 29 18 - 29 30 - 44 18 - 29 30 - 44 18 - 29
  [37] 18 - 29 18 - 29 18 - 29 30 - 44 30 - 44 18 - 29 45 - 59 45 - 59 18 - 29 ...
...
[1009] 60      60      30 - 44 30 - 44
Levels:  18 - 29 30 - 44 45 - 59 60

At first, it appears that when we reference a set of data in the data frame, it displays that data in a format that saves space. While this is true, it turns out that there is more going on. R will attempt to simplify an object if it can. When we reference mydata$Age, R recoginzes that this single column data frame could be cast as a vector. To avoid that we can do the following:

mydata["Age"]
or
> mydata[,"Age", drop=FALSE]

Both of these will display the data as follows:

> mydata[,"Age", drop=FALSE]
         Age
1    18 - 29
2    18 - 29
3    18 - 29
4    18 - 29
5    18 - 29
...
1009      60
1010      60
1011 30 - 44
1012 30 - 44
1013
>

Some Basic Statistics With R

Now that we have the data, let’s do some basic statistical analysis. We will start with a simple frequency distribution of the age in the dataset. (Note: This data set contains survey information about earth quakes. Rows in the data set contain individual responses, and the ages column contains data in age “buckets”.)

> library(MASS)
> age = mydata$Age
> age.freq = table(age)
> age.freq
age
        18 - 29 30 - 44 45 - 59      60
     12     215     257     275     254

I find this display to be somewhat hard to read, but we can re-format the data using the cbind function like so:

> cbind(age.freq)
        age.freq
              12
18 - 29      215
30 - 44      257
45 - 59      275
60           254

If I want to find out which bucket has the highest frequency, I can use the max function along with the which function like so:

> which(age.freq==max(age.freq))
45 - 59
      4

The number “4” in the display is actually showing us which column (not 0 based) has the maximum. This is more obvious when you look at the results of the cbind function above. The column [45-59] is the 4th column in the list of columns for age.freq.

What if we want to find the relative distribution of the age of people that took the survey. We can do that as follows:

> age.relfreq = age.freq / nrow(mydata)
> cbind(age.relfreq)
        age.relfreq
          0.0118460
18 - 29   0.2122409
30 - 44   0.2537019
45 - 59   0.2714709
60        0.2507404

As a sanity check, we can verify that the relative frequencies sum up to one easily:

sum(age.relfreq)

We can also graph this distribution as so:

> barplot(age.relfreq)

Dataframe Visualization

Summary

This post has covered only the most basic aspects of R. Check out my next post as I explore more of the features of this language, and more topics in data science in general.