reports

reports

by Darlene Goldstein -
Number of replies: 0

hello everyone, 

I think I have sent out data assignments for final reports to everyone, but please contact me if you have not received yours.

I am rather worried that I have received a tp 7 (practice exam) report from only about half of students who are registered for the exam. It will be very difficult to get a good mark in this course if you do not avail yourself of the comments that I can provide to you for your exam !! I therefore urge you to send me a report for tp 7 if you have not done so already.

I have been able to send back commented versions to around half of students who submitted reports. The other half should get theirs soon - each report takes me 1-2 hours to comment, and my comments are very comprehensive so as to give you the most help possible.

Many of the reports do not appear to have followed the instructions, evaluation criteria and additional tips and guidelines. Here are links to those to refresh your memory:

https://moodlearchive.epfl.ch/2022-2023/pluginfile.php/2887540/mod_resource/content/2/eval.html

https://moodlearchive.epfl.ch/2022-2023/pluginfile.php/2887541/mod_resource/content/1/examGuidelines.html

https://moodlearchive.epfl.ch/2022-2023/pluginfile.php/3195247/mod_resource/content/1/Reports-Additional%20tips.pdf

tl;dr - you need to put mathematical explanations in your report

Here is some preliminary feedback for everyone:

1. Most reports are lacking in mathematical / statistical rigor. You absolutely MUST include in your report a (statistical) discussion of M-estimation (loss function, weights, residuals) as well as the fitting procedure (IRLS). Please see the course slides the Fox-Weisberg short appendix to the book An R Companion to Applied Regression (third edition; also attached). 

2. It's NOT a robust model, it's the fitting method that is robust (ie, downweighting outliers).

3. Because there are many more chips in your final data set than in tp 7, you don't need to include weight pseudo-images for each chip. You should include pseudo-images for any chip that you end up excluding, as well as a few 'typical' chips and any other 'interesting' chips. You probably don't need to include more than 6.

4. You must also make sure to (mathematically) define the NUSE. Your NUSE boxplots should also contain a solid horizontal line at 1 and a dashed horizontal line at 1.05.

5. It is not necessary to include both the NUSE boxplots and RLE boxplots. If you decide anyway to include RLE, you should make sure to carefully define it mathematically.

6. You need to define the linear model that you are using to identify DE genes. Many of the reports just jump into 'here's the design matrix' - design matrix for what model? Be clear and specific, again, mathematically / statistically.

7. The easiest (and in fact most  generalizable) way to identify DE genes is using the mod-t statistic (and a threshold for its adjusted p-value). You need to explain (mathematically / statistically) what this is - see Smyth 2004. For the hypothesis testing, you need to (1) clearly state the null and alternative hypotheses in terms of a parameter in your linear model (or in terms of a contrast if you choose to parametrize your model in such a way that it requires a contrast); (2) explain what the test statistic is and how it is computed (mathematically / statistically, not how you get it in R); (3) how you can get the p-value, ie, under the null hypothesis, ie, what is the null distribution of the test statistic (including df).

8. Multiple hypothesis testing: make sure to explain what the multiple hypothesis problem is, and clearly explain why it's a problem. Choose an appropriate p-value adjustment method (for probably all of you it should be BH, see Benjamini and Hochberg 1995), make sure to explain how to get it (mathematically, not in R), and be clear about exactly what error rate is being controlled.

9. Most of you already use mod-t / adjusted p but some of you then plot the B-stat on the vertical axis of your volcano plot. Your volcano plot should correspond to the method you use to identify DE genes. There is no p-value associated with the B-stat. Your volcano plot should instead have on the vertical axis either |mod-t| or (better in my opinion but either is ok) 
-log10(adjusted p).

10. In the cluster analysis there seems to be some confusion. You are supposed to cluster samples NOT genes. In addition, you should NOT use the DE genes for clustering. You should reduce the number of genes to ~ 20-50 or so, and you should choose these genes based on how variable their expression is across the samples. Why? Because let's say a particular gene's expression is constant across all the samples. Then how could it be useful for identifying a potential subgroup? It can't !! 

So .... filter your genes to a smaller number (many of the expression values will be in the noise range anyway), then look at some measure of variability across samples (for example, variance or coefficient of variation) and use the top 20-50 as measurements for clustering the samples (testis in tp 7, some other tissue in your final data set).

11. You only need to include a single clustering, you should not show a bunch of different combinations of types of clustering. Of course, you can look at as many possibilities as you like, and you can comment on other possibilities if you like, but only include 1 in your report. My general preference is to do agglomerative hierarchical clustering with 1-correlation distance / dissimilarity between samples and Ward's criterion for joining clusters. However, you should take a flexible approach - there could be other situations where a different approach would be preferable. Make sure that whatever you do, you explain in specific, mathematical / statistical terms exactly what you are doing. I should be able to read your explanation then program that and get the same results as you got. If that's not possible based on your explanation, then your explanation is not sufficiently clear.

Also: you do NOT cut a tree at a certain height to determine number of clusters - this is not a valid method. If you remember from the clustering lecture, I mentioned a study that compared a number of methods, all based on optimizing some criterion. Cutting the tree based on height does not optimize any criterion and you should never do that (no matter what you might read in some scientific journal articles !!).

12. If you make a heatmap, make sure to include a legend indicating what the colors mean, and also make sure to explain what specifically is being depicted (ie, RMA values as measures of gene expression).

13. You should include specific, primary references (NOT wikipedia, NOT course lecture notes / slides), and you should cite them in the text where you refer to them (no 'general' references). Please do not use footnotes for any reason.

Lastly, for those of you who didn't take Greek ( !!) , it's dendrogram NOT 'dendogram'; from δένδρον (déndron), meaning "tree", and γράμμα (grámma), meaning "drawing, mathematical figure"


Please don't hesitate to get in touch with me if you have any questions / problems / etc....... NO STRESS !! :)

Best regards,

Darlene