MATH-474: TP 7 comments

hello everyone,

I hope you are all having fun working on TP 7 !! A few additional comments / tips:

For the RMA-QC, the weight image plots should be SQUARE. You can set this parameter using

par(pty="s") # pty = Plot TYpe ; s = Square

You can also set other graphical parameters using par, see ?par in R. For example, you can set the size of the plots, amount of space between plots, etc.....

Please do not include the plots themselves as pdfs in the final report pdf - this results in an unnecessarily HUGE (>10 MB) file. Please save them as jpegs and include the jpegs into the final report.

RMA can be seen as the result of a 3-step process:

background correction
quantile normalization
parameter estimation (chip and probe effects - the chip effect for an individual for a particular probe set / gene is the RMA value.

The bg corr is described near the beginning of Ben Bolstad's phd dissertation, he also describes quantile normalization, as do the various papers including Rafael Irizarry and Terry Speed as authors. In the original RMA measure, parameter estimation was done by median polish - the reason for this is that it is fast. Due to improvements in both programming and computer speed, it is now fast to do the estimation by robust regression - M-estimation by Iteratively Re-weighted Least Squares (IRLS). By default in fitPLM, the loss function is the Huber loss function, but there are other possibilities for the loss function.

When you explain the quality assessment, you need to explain all of this. Please note that both the original RMA measure, estimated by median polish, and the RMA-QC measure of RMA, estimated by M-estimation, are using the SAME MODEL. And as for measure of expression that you will use in your statistical analyses (DE and clustering), either measure is ok for me. You will get different values and a different gene list though - but don't worry about this.

I have added the original paper on the B-stat, just under the limma (mod-t) paper. For deciding which genes are DE, you will use one of these statistics (or, equivalently to |mod-t| -log10 adj p). Your volcano plot should correspond to the statistical approach that you use. If you use |mod-t| / -log10 adj p (which I recommend), then your volcano plot should not be for the B-stat.

If you do use the B-stat, then you need to explain where it comes from - the hierarchical empirical Bayes model described in Lönnstedt+Speed. Note that the default implementation of its calculation assumes a prior value for the proportion of DE genes (I think this is called c). Changing this value will move the threshold for DE compared to other values of c. You can avoid having to explain all this by using mod-t, but you would also need to explain how to get mod-t. Choose your own poison!! :)

For the cluster analysis, I really like hierarchical clustering using Ward's criterion for distance between clusters (and 1-cor distance between objects, ie samples). The reason is that Ward's tends to produce well-defined groups. Single linkage tends to produce 'stringy' groups, complete linkage tends to produce clusters that are not well-separated. Average linkage also does ok, but in general when I look at both I tend to prefer the Ward's grouping. But from your part, I think it's good to examine several possibilities, including both hierarchical and divisive. You don't need to include everything you did in your report, but in 'real-world' data analyses you would typically examine many possibilities before deciding what should go into the final report.

I also suggest using just a small list of genes for the clustering, not all 22K. These should be chosen to be the MOST VARIABLE expressions across samples. Why is that? If the expression does not vary across samples, how can that gene be useful for dividing samples into groups based on that gene? So sort the genes according to some measure of variability (variance or SD for example) and use the top 20 or top 50 or something like that.

I will mark your report using the exact same criteria as for the final report, so you can think of this as a preliminary / draft report. Your final report will be based on a different dataset, but you will do the exact same tasks - identify DE genes and do a cluster analysis.

You should submit both the report as a pdf and your (reproducible) R code. Your R code can be submitted as either a plain (ASCII) text file (.txt), or your .Rmd or .Rnw or .Snw file - these three contain both text chunks and code chunks.

Please don't hesitate to ask if you have any questions or difficulties.

Best regards,

Darlene