In this TP, you will compare several multiple hypothesis testing p-value adjustments. Here, we use the ones implemented in limma ; you could also use multtest (the results are virtually identical, so we won't do this). As usual, you should always make sure you read the help documentation for each function you do not already know, and it might be a good idea to review the ones you think you do know!
We will again use the e. Coli data. You should read in the data, compute RMA, fit a linear model and carry out empirical Bayes shrinkage (see the practical from last week. You might just want to check with topTable that you get the same top 50 genes that you did last week.
The p-values for the mod t statistics are in fit$p.value (assuming that you have called your eBayes result fit). It is convenient to store those in a variable with a shorter name (pv say). Have a look at ?p.adjust, then compute (and save) each of the adjustments you can make with this function. For example, to get the Bonferroni adjustment, you could type:
pv.bonf <- p.adjust(pv,"bonferroni")
It will also be convenient to save the sort order of the raw p-values, for example:
pv.order <- order(pv)
Use cbind() to combine the vectors of raw and adjusted p-values into a matrix, then pairs() to make pairwise scatterplots of the p-value vectors. What do you notice about the adjustments?
Now you can make a plot with all the adjusted p-values together. Here is an example how you might get the raw and Bonferroni adjusted p-values; you can add the other adjustments with more calls to the function lines():
plot((1:length(pv)),pv[pv.order],pch=".",xlab="Sort order", ylab="p-value")
abline(c(0,1/length(pv))) # expected under uniform
abline(h=0.05,lty=3) # significance level of 0.05
lines((1:length(pv)),pv.bonf[pv.order],lty=2,col="blue")
Which genes have adjusted p-values less than 0.05? (You can use which() or topTable() to find this out).
One thing that I usually do when I am reporting on data I have analyzed is to present a 'volcano' plot (B vs. M or abs(mod-t) vs. M). For the simple design in this data set, the M (log2 fold change) values are in fit$coefficients[,2]. The B-stat is contained in the lods component. Make the plot (e.g. B vs. M), then use abline() to add horizontal and vertical lines corresponding to B values of 0, 2, 4 and M values of 1 and -1. Use the plotting character pch='.' (pch is a possible argument to plot ). It's also nice to highlight points with 'large' values of B and/or M, you can use the function points() to do this. After you have made your best effort, you can see how I did it here. You can also try the same thing using abs(mod-t) instead of B.
For practice (remember that exam??), you can send me a short report on identifying differentially expressed genes in the e. Coli data. You should include a brief purpose of the investigation (i.e. to look for genes differentially expressed between lrp+ and lrp-), the linear model you are fitting to identify DE genes (including the design matrix), the p-value adjustment you are using and a short explanation, a volcano plot and a list of genes that appear to be 'interesting for followup', along with a short justification for calling them 'interesting' (e.g. FDR < 0.05). State how many genes are identified as DE using your criterion. In addition, include on a separate page a table of DE genes (or just the top 50 if you have more than 50). This report should not be more than 3 pages.
For the real exam, you will also need to include an exploratory/quality analysis and explanation of the model, fitting procedure, and quality measures as well as an explanation of RMA. If you want, you could also include these in your report, in which case your entire report will be at most 6 pages.
As usual, your report can be in English or French. Please send your report as a pdf file, and follow the naming convention: lastname05.pdf (e.g. my report would be goldstein05.pdf). Please email your report to me (darlene.goldstein at epfl.ch) by next Tuesday (any time).