Statistical Programming with R
From I573Wiki2007
Contents |
Downloads
- Download binaries
- If you are on Unix, it's generally a good idea to compile from source.
- Documentation - theres a huge amount.
- The R manual is a good document to get started with
- R Data Import and Export is useful when you need work with data saved in other packages
- A variety of short introductory tutorials as well as documents on specific topics in statistical modeling can be found here
- Help
- The R-help mailing list is a superb resource. Has a lot of renoowned experts on the list and if you can learn quite a bit of statistics in addition to getting R help just be reading the archives of the list. Note: they do expect that you have done your homework! See the posting guide
- There are also special interest groups (SIG) for various topics such as financial modeling, geography, epidemiology etc.
- Some reference tables that map commands from Numpy and Matlab to R
- A useful reference for Python/Matlab users learning R
Books
A number of books are available ranging from intriductions to R along with examples up to advanced statistical modeling using R. Some books that I think are good for the new R user
- Data Analysis and Graphics Using R: An Example-based Approach by John Maindonald, John Braun
- Introductory Statistics with R by Peter Dalgaard
- Using R for Introductory Statistics by John Verzani
IDE's and Editors
In general, you will be writing code that is more than three lines, and you'll want to reuse code. So you'll be using some form of text editor. You can use Windows Notepad or vi and cut-and-paste code into your R session. However life is much easier (and more efficient) if you use a proper text editor that integrates with R.
- EMACS + ESS - works on Linux, Windows and OS X. Invaluable if you're an EMACS user
- TINN-R a small but useful editor for Windows
- KATE is the KDE text editor and has support for R code
- You can also do R in Eclipse
A number of GUI's also exist for R. They will let you do a lot of the basic analysis and will probably let you get up and running faster than just starting from the command line
The R Programming Language
What is R and Why Use It?
- R is a environment for modeling as well as a language for statistical computing
- R has been used in
- Chemometrics, cheminformatics
- Ecology
- Fishery
- Psychology
- Political Science
- Econometrics
- Finance
- Geography
- .. and many other areas
- The language is similar to Matlab
- R is strongly based on vectors and matrices, as is Matlab
- Matlab is good for numerical simulation and such
- R can be slow for this type of problem. R excels in statistical modeling
- It is possible to use R just for modeling purposes, as it has a large number of statistical modeling techniques built in. Thus you can load data, and immediately build a linear regression model, a neural network model and so on. If you use one of the R GUI's, there is no code involved
- If you need to develop and prototype algorithms, R is a good candidate
- Interpreted language - immediate results
- Good support for vectorization
- Avoids writing explicity loops
- Faster than explicit loops
- Similar to the map function in Python and Lisp
- For many use cases, interpreted R is sufficiently fast. If more speed is needed, core code can be written in C and usde from within R.
- R integrates with other languages well
- C code can be linked to R. C code can also call R functions
- Java code can be easily called from R and vice versa. See various pakages at rosuda.org
- Python can be used in R and vice versa using RPy
- R has good support for publication quality graphics, though there is a learning curve. See the R Graph Gallery to see examples of what R graphics can do. The code is also available for each graph, so this is a very nice learning tool.
- A variety of 2D plots are available - scatter, bar, pie, box etc.
- 3D plots are possible, including OpenGL based graphics, allowing for interactive 3D plots
Working in R
- Help
- To get help on a function do: ?function_name
- If you don't know the function name but have an idea of the concept use: help.search()
- R Site Search indexes package documentation and mailing list archives. An invaluable resource
- A very useful Firefox extension that can be used to search the above site
- You can also use the RSiteSearch() from within R to search the above site
- To exit from prompt: quit()
- To see the variables that have been declared in your workspace: ls()
- To enter code, you can type it at the prompt. For larger code in a separate file, say, code.R we can load it in the workspace by: source('code.R'). (You need to specify the full path to the file)
- After you've finished working, you can save all your work by doing: save.image(). This creates a binary file that is automatically loaded when you start R at a later time.
- You can save individual objects or multiple objects using the save() command
Basic Types in R
R has a number of primitive types: These will include
- numeric - indicates an integer or floating point number. In general, R does not care whether something is integer or floating point (though it is possible to specify this using a variables mode)
- character - an alphanumeric symbol
- logical - indicates a boolean value. Possible values are TRUE and FALSE
There are also some more complex types:
- vector - a 1D collection of objects of the same type
- list - a one dimensional container that can contain arbitrary primitives. Each element can have a name associated with it, so that we can using traditional indexing (x1) or use associative indexing (x$firstElement)
- matrix - a two dimensional structure that contains objects of a single type.
- data.frame - a generalization of the matrix type. In this case, each column can be of a different type. This is the workhorse data structure for import and export of data
Simple types
Some examples of these types
> x <- 1.2 > x <- 1.2 > y <- 'abcdegf' > y [1] "abcdegf" > z <- c(1,2,3,4,5) # a vector of numerics > z [1] 1 2 3 4 5
Matrices and data.frames
We can also make a matrix from a vector:
> z <- c(1,2,3,4,5,6,7,8,9,10)
> m <- matrix(z, nrow=2, ncol=5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
NOTE: when converting a vector to a matrix, the matrix is constructed column-wise
Now lets say we wanted to have a data structure with compound names and numbers of atoms, along with an indicator of whether they are toxic or not. These means we want to store a set of names (character), numbers (numeric) and TRUE/FALSE values (logical)
> x <- c('aspirin', 'potassium cyanide', 'penicillin', 'sodium hydroxide')
> y <- c(21, 3, 41,3)
> z <- c(FALSE, TRUE, FALSE, TRUE)
> d <- data.frame(x,y,z)
> d
x y z
1 aspirin 21 FALSE
2 potassium cyanide 3 TRUE
3 penicillin 41 FALSE
4 sodium hydroxide 3 TRUE
But what do the columns mean? If we were to look at it later on, we might forget what each column represents. So we should add names:
> names(d) <- c('Name', 'NumAtom', 'Toxic')
> d
Name NumAtom Toxic
1 aspirin 21 FALSE
2 potassium cyanide 3 TRUE
3 penicillin 41 FALSE
4 sodium hydroxide 3 TRUE
Lists
The 'list type is another very useful type and is essentially a 1D container of arbitrary objects:
> x <- 1.0 > y <- 'hello there' > z <- TRUE > s <- c(1,2,3) > l <- list(x,y,z,s) > l [[1]] [1] 1.0 [[2]] [1] "hello there" [[3]] [1] TRUE [[4]] [1] 1 2 3
We would then get the 2nd element of the list by doing
l[[2]]If we want to get say the 2nd and 4th elements we have to write it as
l[c(2,4)]
We can also name the elements of the list, so that we can access elements more naturally:
> x <- 'oxygen' > z <- 8 > m <- 32 > l <- list(name='oxygen', atomicNumber=z, molWeight=m) > l $name [1] "oxygen" $atomicNumber [1] 8 $molWeight [1] 32
We can then get the molecular weight by writing
> l$molWeight [1] 32
Identifying types
R allows us to easily identify the type of an object using the class() function. This also allows us to assign arbitrary types to any object (which is useful when we are working with models). An example of this is:
> x <- 1 > class(x) [1] "numeric" > x <- 'hello world' > class(x) [1] "character" > x <- matrix(c(1,2,3,4), nrow=2) > class(x) [1] "matrix"
Indexing
This is a fundamental operation in R and is applied to vectors and matrices (and data.frames and lists).
- All indices for whatever type start at 1 (and not 0 like other languages)
- For a 1D vector, x we get the i'th value by doing: x[i].
- For a 2D matrix, x, we get the i,j element by doing: x[i,j]
But it gets interesting when we can specify an index vector, i.e., a vector of values indicating which positions to select (or ignore).
Lets say we have a vector x and a matrix y:
> x
[1] 1 2 3 4 5 6 7 8 9 10
> y
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
Some ways to get the elements of x:
- The 5th element of x: x[5]
- Only the 3rd, 4th and 5th elements of x: x[c(3,4,5)]
- Everything but the the 3rd, 4th and 5th elements of x: x[-c(3,4,5)]
Some ways to get the elements of y:
- The element in the first column and second row: y[2,1]
- The elements in the first column only: y[,1]
- The elements in the first two columns: y[,c(1,2)]
- The elements in the first two columns and the last two rows: y[c(2,3), c(1,2)]
As in the case of the vector, we can ignore specific columns by preceding the index specification by a minus sign.
Functions in R
Functions are pretty much the same as in other languages. A simple function would be
my_add <- function(x,y) {
return(x+y)
}
We can then simply call the function as: my_add(1,2)
Also note, that we don't have to declare types (like Python & Ruby).
Loop Constructs and Functional Programming
Loops can be writtenn using the familiar for construct. Note that compared to the usual for loop in C or Java, we can loop directly over elements in a vector (or list). In fact the traditional loop that we write as:
for (i = 0; i < 10; i++) print(i)is written in R in terms of looping over the elements of a vector containing the numbers 0 to 9:
for (i in c(0,2,3,4,5,6,7,8,9)) {
print(i)
}
NOTE A simple way to generate contigous sequences of numbers is to write:
> x <- 1:10 > x [1] 1 2 3 4 5 6 7 8 9 10We can also generate sequences of different forms using the seq() command
Now, if we have the numbers 1 to 10 and we'd like to get the squares of them, we can write a simple loop:
for (i in 1:10) {
print(i*i)
}
Good R code tries to avoid loops. So lets first write a function that works on a single number:
my_square <- function(a) {
print(a*a)
}
Next, we want to apply this function to each of the elements in a:
> sapply(x, my_square) [1] 1 [1] 4 [1] 9 [1] 16 [1] 25 [1] 36 [1] 49 [1] 64 [1] 81 [1] 100 [1] 1 4 9 16 25 36 49 64 81 100
Note that at the end it also returns the squared values. This is because the print() command returns the value that was printed. Thus our function will print and then return the square of the argument
We can also do this type of thing with matrices. In this case, we can apply a function to each row or column (or both) of a matrix. In this case, the function will receive a vector of elements and not a single element.
So lets take a matrix:
> x <- matrix(1:10, nrow=5)
> x
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
and we now want the sums of the rows. So our function would be:
my_sum <- function(a) {
return(sum(a))
}
Then, to get the sums of each row we do
> apply(x, 1, my_sum) [1] 7 9 11 13 15
Loading Data From Files
R has a number of functions to read data from text files
For a CSV file we can do
> d <- read.csv('afile.txt', header=TRUE)
This will give us a data.frame and if there was a header line, the columns will be named appropriately.
If we're working with tab-delimited files we can do
> d <- read.table('afile.txt', header=TRUE, sep='\t')
R also has a number of functions to read data files generated by other programs (SAS, Stata, SPSS, Minitab)
Building a Regression Model
We have enough to go on to build a regression model. We will work with a data file that is in the SVN repository and looks like
y var1 temp var3 1 2 33 23 2 2.4 35 45 3 4 37 34 4 3.4 38 84 5 5.1 39 60 6 5.8 34 41
The first step is to read it into R
> d <- read.table('data.txt', header=TRUE)
> d
y var1 temp var3
1 1 2.0 33 23
2 2 2.4 35 45
3 3 4.0 37 34
4 4 3.4 38 84
5 5 5.1 39 60
6 6 5.8 34 41
We then build a regression model using y and temp, i.e., we want to fit:
y = a * temp + b
We do this by
> model <- lm(y ~ temp, d)
NOTE the intercept term is implied We can then get a summary of the model
> summary(model)
Call:
lm(formula = y ~ temp, data = d)
Residuals:
1 2 3 4 5 6
-1.5357 -1.1786 -0.8214 -0.1429 0.5357 3.1429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.0714 13.0244 -0.62 0.569
temp 0.3214 0.3611 0.89 0.424
Residual standard error: 1.911 on 4 degrees of freedom
Multiple R-Squared: 0.1653, Adjusted R-squared: -0.04337
F-statistic: 0.7922 on 1 and 4 DF, p-value: 0.4237
If we wanted to fit a model like
y = a1 * var1 + a2 * temp + b
we would do
> model <- lm(y ~ var1 + temp, d)
If we wanted to use all the independent variables we can write
> model <- lm(y ~ ., d)
> summary(model)
Call:
lm(formula = y ~ ., data = d)
Residuals:
1 2 3 4 5 6
-0.013202 0.056981 -0.046416 -0.037875 0.044112 -0.003601
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.961298 0.610508 6.489 0.022938 *
var1 1.183618 0.021485 55.091 0.000329 ***
temp -0.187403 0.019343 -9.689 0.010486 *
var3 0.037781 0.002039 18.526 0.002901 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06697 on 2 degrees of freedom
Multiple R-Squared: 0.9995, Adjusted R-squared: 0.9987
F-statistic: 1300 on 3 and 2 DF, p-value: 0.0007687
The variable model is actually a list and has a number of components. We can get their names by doing
> names(model) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model"
Given this, we can get the residuals as a vector by doing
> model$residuals
1 2 3 4 5 6
-0.013202144 0.056981387 -0.046415634 -0.037874619 0.044112369 -0.003601359
Plotting Data
A common function is to plot two vectors against each other
> x <- 1:10 > y <- x*2 > plot(x,y)
This brings up a simple plot. All aspects of the plot can be modified. See the manpages for plot and par
Many model objects will implement their own plotting routines for that type of object. So using the model object we generated using lm() previously we can type
> plot(model)
and we will get a series of diagnostic plots characterizing th regression model.
Plots can be saved in a variety of formats
- JPEG
- PNG
- EPS
- WMF
Again, using our regression model we can plot the fitted versus observed y values by
> fitted <- model$fitted.values > plot(d$y, fitted)
Finally, we would save the graph as a PNG file by doing
> dev.copy(png, file='aplot.png') > dev.off()
