TP 0: R basics refresher
The purpose of this exercise is to introduce you to using
R (or remind you!) with a simple session,
and for you to start gaining some facility with
R commands; the code here is to get you started.
You will get the most out of the session if you also do
some exploring on your own:
check the help files for each function to learn
what default values and optional arguments are there,
and try out your own variations.
An excellent source of R documentation is the
Comprehensive R Archive Network, or CRAN.
There is a Swiss mirror site at
http://cran.ch.r-project.org/.
If you go to that site, you will find several links under 'Documentation'
(the fourth major entry on the left side).
'Official' documentation is available under 'Manuals';
other helpful documentation is under 'Contributed'.
To proceed, you will need to start R.
You can copy/paste the code below at the R prompt.
Basic R
0. Getting help in R.
You should become acquainted with the help facility within R,
it can be your friend!
The basic help command is
help();
within the parentheses
()
you would type (inside of double quotes) the name of a function
whose help file you want to see, e.g.
help("mean").
If you don't know the exact command name, use
help.search(),
with the name of the concept inside double quotes within the parentheses.
1. Working with data.
R has a number of functions to create data vectors, including:
c(), seq(), rep().
Find out what each of these do, and make some data vectors of your choice using each.
To get some practice using statistical functions
and performing small calculations in R,
create a weight and corresponding height
vector for computing body mass index
(bmi) (this example is from
the book Introductory Statistics with R by Peter Dalgaard):
weight <- c(60,72,57,90,95,72)
height <- c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91)
bmi <- weight / height^2
bmi
# Type this in R to see the computed values
plot(height,weight)
# A basic scatterplot
The #
sign indicates a comment:
anything occurring after this sign on a line is ignored by
R
(but can be very useful in programming at it provides a means for
documenting your code).
These data vectors are too small to really require summaries;
we can quickly generate some artificial data to get practice using
R
for simple graphical and numerical summaries.
R can generate (pseudo) random numbers from different
probability distributions; here is an example:
simdata <- rchisq(100,8)
Now, make a number of different histograms for this same
data set, varying the number of bars (also called 'bins' or 'cells').
It will be easiest to see several at once if you first set up
the plotting region for multiple plots, as in the first line below:
par(mfrow=c(2,2))
hist(simdata)
# this is the default version
hist(simdata,freq=FALSE)
# what's the difference here?
hist(simdata,freq=FALSE,breaks=2)
# experiment with breaks
bps <- c(0,2,4,6,8,10,15,25)
# make your own breakpoints
hist(simdata,freq=FALSE,breaks=bps)
You should also make a boxplot of this simulated data set for a
different type of visualization:
boxplot(simdata).
Approximate the 5 number summary
(min, Q1, median, Q3, max) by looking
at the boxplot, and then obtain the 5 number summary in
R with
quantile(simdata).
Obtain other summary statistics for your simulated data: mean,
standard deviation (SD), median, interquartile range (IQR), median absolute
deviation from the median (MAD).
If you don't know which R
functions there are to compute these, use
help.search().
hellung data
It is a little more interesting to look at real data.
Load the package ISwR, and examine the variables in
hellung (don't forget to read the help file).
library(ISwR)
?hellung
data(hellung)
attach(hellung)
You can find the variable names with
names(hellung),
and can summarize the data set with
summary(hellung).
Which of the variables does it not make sense to summarize like this?
Make a boxplot of the variable
conc.
Now, make side by side boxplots of
conc,
one for each value of
glucose;
do the same with
diameter (we will learn more about this notation soon):
boxplot(conc ~ glucose)
boxplot(diameter ~ glucose)
Does the distribution (pattern of variability) of either
variable appear to depend on the presence or absence of glucose?
Do we have enough information to decide
whether glucose is causing any difference?
Simulated microarray data
Now, we will get a little closer to the scale of microarray data
by simulating a much larger data set.
We will use the normal (Gaussian) distribution for convenience -
you should not take this to mean that it is realistic for microarray data:
it is not!
Usually, preprocessed data values
are stored in a data matrix or data frame with
rows representing probes or probe sets (genes)
and columns representing mRNA samples (chips).
Note that this is the transpose of the usual data matrix in statistics,
where rows represent individuals and columns are for different variables.
With a microarray, we have measured thousands of variables on each sample,
but usually only with a relatively small number of samples.
In order to be able to reproduce results of a simulation experiment,
it is a good idea to set (or save) a seed value for the random number generator,
see ?set.seed.
You should set a seed here so that you will be able to reproduce your results,
you can choose the argument for
set.seed
to be your lucky number.
After you have set the seed, create a (simulated) data matrix of
gene expression values with 30,000 rows (genes) and 10 columns (arrays/chips)
with (independent) normal random variables.
Gene expression data are not really independent across different genes,
or normally distributed either, but we will just pretend here.
Useful functions here are
rnorm and
matrix.
If you want to be fancy, you could attach names to the arrays
(see ?colnames).
One extremely useful feature of R is its subsetting
(or indexing) capability, via the subset operator
[ ].
For example, to access the first column of
my.data, you would type
my.data[,1];
if instead you wanted to see rows 2, 4, and 5, you would type
my.data[c(2,4,5),].
Play around with subsetting your large matrix.
For each single array (column), you can visually assess how closely the data
appear to follow the normal distribution with a QQ plot.
You can set up multiple plots per page (4 here),
and change the default plotting character as follows:
par(mfrow=c(2,2),pch=".")
(You remembered to read the help for
par, right??)
If you have called your data matrix
my.data,
you can get a QQ plot for the first array (column) by typing:
qqnorm(my.data[,1])
qqline(my.data[,1])
If the data are approximately normally distributed,
then the points should mostly fall on the line.
Since we have simulated normally distributed data here,
deviations from normal are essentially random variation
(assuming that we trust the random number generator).
You can look at all 10 QQ plots.
Say that we are now interested in comparing the distributions of
expression values across the different chips.
Then we might look at side by side boxplots, for example.
Make the 10 side by side boxplots;
if you have called your data matrix
my.data, you can easily accomplish this by typing
par(mfrow=c(1,1))
# to get just 1 plot
boxplot(data.frame(my.data))
(Remember to read the help for the function
data.frame).
Now that you're happy with subsetting (!),
redo the boxplots, this time ordering them by their medians
(so that the array with the smallest median is first, largest is last).
Useful functions here are
apply and
order, along with subsetting.
It might take you a little while to figure out how to do this,
but spend some time trying before you click
here
to see how I did it.
You will not need to save any R objects that you created
(unless you wish to),
so feel free to 'clean up' after yourself with
rm().
To remove all objects in your workspace (permanently and
irreversibly, so be careful!), type
rm(list=ls()),
or simply answer
n
when asked if you wish to save
your workspace image.
This question appears on the screen when you quit R;
to quit, type
q().
Before quitting, try just typing
q
without any parentheses.
This might help you to remember that you need the parentheses!