---
title: "Essentials statistics for learning experiments: a case study about MOOCs"
author: "Patrick Jermann"
date: "11/6/2018"
output:
  html_document: default
  pdf_document: default
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=90)
knitr::opts_chunk$set(comment = "", warning = FALSE, message = FALSE, echo = TRUE, tidy = FALSE, size="small")
```

# The MOOC dataset
The data we use in this document to demonstrate the use of basic statistics is about students' performance in academic exams (e.g. the algebra course taken at the university) and their activitiy online in a related Massive Open Online Course (e.g. the algebra MOOC). The general question we try to answer is: what is the relation between the use of the MOOC by students and their academic achievement. Does the use of MOOCs help succeed in courses ? Do only good students use the MOOCs ? Are MOOCs more helpful for better students ?

As a general remark, we should note that because we did not conduct a controlled experiment with random assignment of subjects to experimental groups, we cannot establish any causal relationships between MOOC use and academic performance. Rather, we use statistical tools to "explore" data.

## Loading Data
Let's load data. The first step is to read a CSV file into a dataframe, which is the data structure used by R to store data.

```{r echo=FALSE, results='hide', message=FALSE}

# Uncomment and run once if these packages are not installed yet.
# install.packages("gdata")
# install.packages("gplots")

library(gdata) # reorder
library(gplots) # plotmeans
library(ggfortify) # regression diagnostics
library(plyr) # ddply

```

```{r echo=TRUE}
# TODO: Adapt the path to your own installation.
# Set the working directory
setwd("/Volumes/GoogleDrive/My Drive/CEDE/TEACHING/courses/CS411-2018/")

# Read a csv file (header = T signals that the first line of the file contains the names of the variables).
# The <- operator assigns the result of read.csv to the variable moocs.
# moocs is a dataframe, similar to an excel worksheet.
moocs <- read.csv(file="moocs.basic.csv", header = T)

```

```{r}
# head() prints the first rows of the dataframe. Useful to check if the import has worked. We see that there are many missing values signaled by "NA" == Not Available, namely for variables starting with MOOC_ ... this is because these students never logged in the MOOC.
head(moocs)
```

Our dataset contains variables which name starts with EPFL and other variables starting with MOOC. The variables with EPFL come from the academic database and describe student's performance in their EPFL exams. The variables starting with MOOC conern their online activity in the MOOC.

```{r}

# List the names of the variables
names(moocs)
```

# Descriptive Statistics

## Summary for variables
```{r}
# For each variable, print some information about the distribution (if the variable is quantitative) or about the distinct values taken by the variable (if the variable is categorical). 
summary(moocs)

# To get information about only one variable, use the $ sign to separate the name of the dataframe and the name of the variable like so: <name_dataframe>$<name_variable>
summary(moocs$EPFL_StudentSection)
summary(moocs$MOOC_PercentageVideosWatched)

mean(moocs$MOOC_PercentageVideosWatched, na.rm=T) # na.rm removes missing values
sd(moocs$MOOC_PercentageVideosWatched, na.rm=T) # standard deviation
median(moocs$MOOC_PercentageVideosWatched, na.rm=T) # median (50% observations below and 50% observations above)
max(moocs$MOOC_PercentageVideosWatched, na.rm=T)
min(moocs$MOOC_PercentageVideosWatched, na.rm=T)
```

## Computing new categorical variables
It is sometimes useful to transform a quantitative variable into a categorical variable. 

For example, below we create a variable that reflects whether students used the MOOC or not based on the number of videos they watched and the number of problems they attempted in the MOOC. In R, categorical variables are created with the function `factor`. The variable `didMOOC` is given the value `NO.MOOC` whenever the values for the number of videos watched and the number of problems attempted is missin. In other cases, the value is `MOOC`.   

```{r}


moocs$didMOOC = factor(
  ifelse(
    is.na(moocs$MOOC_NVideosWatched) & 
      is.na(moocs$MOOC_NProblemsSolved), 
    "NO.MOOC", 
    "MOOC")
)

summary(moocs$didMOOC)

# Now for people who did the MOOC
moocs$watched_videos = factor(
                          ifelse(
                            moocs$MOOC_PercentageVideosWatched == 0 |
                              is.na(moocs$MOOC_PercentageVideosWatched),
                            "NO.VIDEO",
                            "SOME.VIDEO")
                          )

summary(moocs$watched_videos)

# The table command reports the number of observations per category.
table(moocs$watched_videos)

moocs$solved_problems = factor(
                          ifelse(
                            moocs$MOOC_PercentageProblemsSolved == 0 |
                              is.na(moocs$MOOC_PercentageProblemsSolved),
                            "NO.PROBLEM",
                            "SOME.PROBLEM")
                          )

table(moocs$solved_problems)

# With two arguments, the table command cross-tabulates the number of observations per category.
table(moocs$watched_videos, moocs$solved_problems)

# Let's also replace the missing values with 0 for the percentage of videos watched and the percentage of problems solved
moocs$MOOC_PercentageVideosWatched <- ifelse(is.na(moocs$MOOC_PercentageVideosWatched), 0, moocs$MOOC_PercentageVideosWatched)
moocs$MOOC_PercentageProblemsSolved <- ifelse(is.na(moocs$MOOC_PercentageProblemsSolved), 0, moocs$MOOC_PercentageProblemsSolved)

```

> When using a MOOC, some students only watch videos and don't do exercices. But most of the students who do exercices also watch videos. Therefore we qualify MOOC useage with three categories: `NONE` corresponds to no use of MOOCs, `WATCH` corresponds to students mainly watching videos and `DO` corresponds to students doing some exercices.     

```{r}


# TODO:
# Define a categorical variable names mooc$MOOC that has 3 levels with:
# 1) NONE = "did not use the mooc at all"
# 2) WATCH = "used the mooc mostly to watch videos"
# 3) DO = "also did some assignments"

# Begin Solution:
moocs$MOOC <- factor(
  ifelse(moocs$watched_videos == "NO.VIDEO" & moocs$solved_problems == "NO.PROBLEM",
       "NONE",
       ifelse(moocs$solved_problems == "SOME.PROBLEM",
              "DO",
              "WATCH")
       )
)
# End Solution 

library(gdata)
# put the levels into a convenient order (for graphing, etc.)
moocs$MOOC <- reorder(moocs$MOOC, new.order=c("NONE", "WATCH", "DO"))
table(moocs$MOOC)

```

### Median split and quartile splits
Another popular way to define categorical variables is to use a median or quartiles of a distribution to define categories with equal number of observations. Below, we define `prior` based on whether the students' grade in highschool (`EPFL_BacGrade`) was above the median or below. Similarty `prior4` is defoned depending on the quartiles of the distribution of `EPFL_BacGrade`.

> We operationalise the previous ability of students based on their highschool performance. We perform a median split to define two categories of students: "High" performers obtained a grade above the median and "Low" performers below the median.

```{r}

hist(moocs$EPFL_BacGrade)
abline(v=median(moocs$EPFL_BacGrade, na.rm=T), col="red", lwd=3)

# TODO: Define two categorical variables named moocs$prior and moocs$prior4 that distinguishstudents given at their baccalaureat grade.

# 1) moocs$prior.
# The values of the variable are "HI"" for strong students and "LO"" for weakers students.
# => use median() to define high and low levels of ability

# 2) moocs$prior4 
# Levels are called XLO, LO, HI, XHI
# => use quantile() and cut() to definelevels 

# Begin solution
moocs$prior <- factor(
                  ifelse(moocs$EPFL_BacGrade > median(moocs$EPFL_BacGrade, na.rm=T),
                         "HI",
                         "LO"
                         )
)

# reorder the levels of prior so that low comes before hi in the graphs
moocs$prior <- reorder(moocs$prior, new.order=c("LO", "HI"))

# count the number of observations in LO and HI 
table(moocs$prior)

# define 
qq <- quantile(moocs$EPFL_BacGrade, na.rm=T)
moocs$prior4 <- cut(moocs$EPFL_BacGrade, breaks = qq, labels=c("XLO","LO","HI","XHI"))

table(moocs$prior4)
# end solution

```

## Basic Plotting
```{r}
## hist draws the historgram of a quantitative variable
##   main="title" sets the title.
##   xlab="title of the xaxis"
##   ylab="title of the yaxis"
##   breaks=n sets the number of bins used by the historgram

## par() sets graphical parameters. mfrow allows to plot several plots on one figure. It takes as argument, an array c(nrow,ncol) that contains the number of rows and columns to arrange the subplots.  
## png opens a png file of teh given size to output the results of the plotting.

png("histograms.png", width=600, height=600) # Opens the file
par(mfrow=c(2,2)) # splits the figure into 4 subplots.
hist(moocs$EPFL_CourseGrade, main ="Course Grades", xlab = "Grade")
hist(moocs$MOOC_PercentageVideosWatched, main="Number of videos watched", xlab="Videos", breaks=100)
hist(moocs$MOOC_Grade, main="MOOC Grade", xlab="Grade")
hist(moocs$MOOC_AverageMaxVideosWatched, main="Max Number of videos", xlab="Videos")
dev.off() # Closes the file

```

Boxplots can be obtained for one variable at the time with the boxplot command. R also creates boxplots when using the command plot with a categorical variable: plot(quantitative ~ categorical).

```{r}
par(mfrow=c(1,2)) 

# Boxplots
boxplot(moocs$MOOC_PercentageVideosWatched, horizontal=TRUE, varwidth=TRUE)

# PLotting a quantitative variable given a categorical variable.
plot(moocs$EPFL_CourseGrade ~ moocs$MOOC)

# Plotting two quantitative variables
plot(moocs$EPFL_CourseGrade ~ moocs$MOOC_PercentageVideosWatched, pch=21)

# Since all points are on top of each other, we use jitter() to add some random noise.
# pch is the plotting character used to draw points https://www.statmethods.net/advgraphs/parameters.html
# cex is a size adjusment
# color of the points is given by RGB and alpha
plot(jitter(moocs$EPFL_CourseGrade) ~ jitter(moocs$MOOC_PercentageVideosWatched), pch=19, cex=0.8, col=rgb(0,0,0,0.5))

```

## Plotting with GGPLOT
Let's now see how to plot these same graphs with ggplot2. See documentation here http://ggplot2.org/ and here http://www.cookbook-r.com/Graphs/ and https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
```{r}
library(ggplot2)
ggplot(moocs, aes(x=MOOC_PercentageVideosWatched)) + geom_histogram(binwidth = 10)

ggplot(moocs, aes(x=MOOC, y=EPFL_CourseGrade)) + geom_boxplot(aes(fill = MOOC)) 

ggplot(moocs, aes(x=MOOC, y=EPFL_CourseGrade)) + geom_boxplot(aes(fill = MOOC)) + facet_grid(.~EPFL_Topic)

# In this plot we see that the "cloud of points" is strongly clustered to the left, since most of the people in the sample have watched zero videos.
ggplot(moocs, aes(x=MOOC_PercentageVideosWatched, y=EPFL_CourseGrade)) + geom_jitter(alpha = I(1/4)) 

# TODO Create a plot with ggplot that checks whether the course grade is linked with the grade obtained at the baccalaureat
# Find a way to add some jitter
# PLot this separately for eah EPFL_Topic
# Begin Solution
ggplot(moocs, aes(x=EPFL_BacGrade, y=EPFL_CourseGrade)) + 
  geom_point(size=3, alpha=0.1, position = position_jitter(width=0.1)) + 
  stat_smooth(method="lm") + 
  facet_grid(.~EPFL_Topic) 
# end solution
```

# Basic Statistics

## Correlation

cor.test() computes the pearson (or spearman) correlation coefficient and performs an associated t-test. The output of the command contains the Pearson correlation coefficiant (cor) as well as a t-test that is testing whether this correlation is statistically different from 0. When p << .05, the correlation is non-null. 

```{r}
# Correlation between 2 normally distributed quantitative variables
# H0: the correlation = 0 
# H1: the correlation != 0
cor.test(moocs$EPFL_CourseGrade, moocs$MOOC_PercentageVideosWatched)
```

> Students with good grades in highschool also have good grades at university as is shown by a positive correlation between the grades  (r[8593]=0.35, p < .001).  

```{r}
# TODO: find whether the CourseGrade is correlated with the BacGrade
cor.test(moocs$EPFL_CourseGrade, moocs$EPFL_BacGrade)
```

The same cor.test function can be used to computer Spearman rank correlation which is used whenever the distributions of the variables are not normal (which seems to be the case here for moocs$MOOC_PercentageVideosWatched). In such a case the Spearman rank correlation might be more appropriate (see http://geoinfo.amu.edu.pl/qg/archives/2011/QG302_087-093.pdf).

The difference bewteen a rank correlation and pearson correlation is well illustrated here: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

> Students with higher course grades also rank higher with regards to their video use (rs=0.24, p < .001).  

```{r}
cor.test(moocs$EPFL_CourseGrade, moocs$MOOC_PercentageVideosWatched, method="spearman")
```


## Normality Assumption
Testing normality visually with quantile plots. If the variable is normally distributed, the observations would follow the line drawn by qqline. We see from the plots below that the distribution of the variable "MOOC_PercentageVideosWatched" departs very strongly from a normal distribution.

```{r}
par(mfrow=c(2,2))
hist(moocs$MOOC_PercentageVideosWatched)
qqnorm(moocs$MOOC_PercentageVideosWatched)
qqline(moocs$MOOC_PercentageVideosWatched)

hist(moocs$EPFL_CourseGrade)
qqnorm(moocs$EPFL_CourseGrade)
qqline(moocs$EPFL_CourseGrade)
par(mfrow=c(1,1))


# TODO: Is Bac Grade normally distributed ?
# Plot a historgram and a qq plot
# Begin solution
par(mfrow=c(1,2))
hist(moocs$EPFL_BacGrade)
qqnorm(moocs$EPFL_BacGrade)
qqline(moocs$EPFL_BacGrade)
par(mfrow=c(1,1))
# End solution

# An alternative to qqline and qqnorm.
# install.packages("ggpubr")
library(ggpubr)
ggqqplot(moocs$EPFL_BacGrade)

```

Normality can also be tested statistically, however the normality tests are rather sensitive to small deviations.

For example, the Shapiro-Wilks test, is testing the null hypothesis that the variable is normally distributed. In this case the adequate p.value is .1. Hence, with p <<.1 we have to reject the null hypothesis and conclude that the variable is not normally distributed. This test work with samples smaller than 5000 observations.  
```{r}
# H0: variable is normally distributed
# H1: variable is not normally distributed

# shapiro.test(moocs$MOOC_PercentageVideosWatched)
# shapiro.test(moocs$EPFL_CourseGrade) # has too many observations
```

An alternative to shapiro is the Kolmogorov-Smirnov test which allows to test whether two samples were drawn from the same distribution. This allows to compare our observations with a sample that follows a normal distibution. The test for EPFL_BacGrade (D=0.04, p<.00) shows that the distribution of sample is unlikely to stem from a normal distribution.

> A Kolmogorov-Smirnov test was used to test for normality on the baccalaureat grade. The results of the test (D = 0.04, p < .001) show that the grade is not normally distributed.

```{r}
# Kolmogorov-Smirnov test that compares a variable x with a normal distribution that has the same mean and variance as x. The test relies on the maximum difference between the Cumulated Frequency Distributions of two samples.

# H0: the two samples comes from the same distribution.
# H1: the two samples come from a different distribution.

x <- moocs$EPFL_BacGrade
ks.test(x,"pnorm",mean(x,na.rm=T),sqrt(var(x,na.rm=T)))
```


```{r}
x <- moocs$MOOC_PercentageVideosWatched
ks.test(x,"pnorm",mean(x,na.rm=T),sqrt(var(x,na.rm=T)))
```

```{r}

# define afunction to plot the Kolmogorov Smirnov test     
# (adapted from https://rpubs.com/mharris/KSplot)

plot.normal.ks <-  function(sample1) {

  normal.sample <- rnorm(10000, mean(sample1,na.rm=T),sqrt(var(sample1,na.rm=T))) # a normal sample that has the same mean and sd

  cdf1 <- ecdf(sample1) 
  cdf2 <- ecdf(normal.sample)

  plot(cdf1, verticals=TRUE, do.points=FALSE, col="blue") 
  plot(cdf2, verticals=TRUE, do.points=FALSE, col="green", add=TRUE) 

  minMax <- seq(min(sample1, normal.sample, na.rm=T), max(sample1, normal.sample, na.rm=T), length.out=length(sample1)) 
  x0 <- minMax[which( abs(cdf1(minMax) - cdf2(minMax)) == max(abs(cdf1(minMax) - cdf2(minMax))) )] # identify where the maximum difference is.
  y0 <- cdf1(x0) 
  y1 <- cdf2(x0) 
  points(c(x0, x0), c(y0, y1), pch=16, col="red") 
  segments(x0, y0, x0, y1, col="red", lwd=5, lty="solid") 
}

par(mfrow=c(1,2)) # split the plotting area into 1 row and 2 columns
plot.normal.ks(moocs$MOOC_PercentageVideosWatched)
plot.normal.ks(moocs$EPFL_BacGrade)

par(mfrow=c(1,1)) # reset the plotting area

```

# ANOVA - Analysis of Variance

Comparing means across categories. We use the Analysis of Variance (ANOVA) to compare the mean of a variable across several categorical variables.

```{r}
# Comparing whether one variable has the same mean in different groups ?
# quantitative ~  categorical

# H0: mean_1 = mean_2 = ...= Mean_n
# H1: mean_1 != mean_2 != ... != mean_n
oneway.test(moocs$EPFL_CourseGrade ~ moocs$MOOC, var.equal = T) 
pairwise.t.test(moocs$EPFL_CourseGrade, moocs$MOOC, p.adj = "bonf") # Checks which group is different from what other group in doing 2 by 2 tests.

```

## Assumptions to be met before conducting anova

### 1) Data is normally distributed in each of the groups
```{r}
# by() is applying a function for each value of the categorical variable defined by "INDICES"
# here it takes every value of moocs$MOOC. See below how to do the graphs one by one.
par(mfrow=c(3,3))
by(moocs$EPFL_CourseGrade,
    INDICES = moocs$MOOC, function(x) {
    hist(x);
      qqnorm(x);
    qqline(x);
    plot.normal.ks(x)
}
)
par(mfrow=c(1,1))

# repeating the same graph 3 times, once for each possible value of moocs$MOOC. The syntax above with by() is more efficient. 
#
#par(mfrow=c(3,2))
#qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =="NONE"])
#qqline(moocs$EPFL_CourseGrade[moocs$MOOC =="NONE"])
#plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =="NONE"])
#
#qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =="WATCH"])
#qqline(moocs$EPFL_CourseGrade[moocs$MOOC =="WATCH"])
#plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =="WATCH"])
#
#qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =="DO"])
#qqline(moocs$EPFL_CourseGrade[moocs$MOOC =="DO"])
#plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =="DO"])
#
#par(mfrow=c(1,1))

```
### 2) Variances are equal in all groups

```{r}
# H0: variances are the same
# H1: variances are different
# We have to reject the null hypothesis H0 for the bartlett test whih means that we the variances are different. This is appararent fom the boxplot as well, where we see that the variance for the DO group is smaller.
bartlett.test(moocs$EPFL_CourseGrade ~ moocs$MOOC)
tapply(moocs$EPFL_CourseGrade, moocs$MOOC, var, na.rm=T) #  tapply allows to apply the var() function to 3 subsamples defined by the 3 values of moocs$MOOC 
boxplot(moocs$EPFL_CourseGrade ~ moocs$MOOC, horizontal=T)

# Normality +- OK
# Variances different

# a) adapt the caluclations to take unequal variances into account 
oneway.test(moocs$EPFL_CourseGrade ~ moocs$MOOC, var.equal = F)

# b) => use non parametric alternative to anova. 
# Tests the null hypothesis that the location parameters are the same in each of the groups.
kruskal.test(moocs$EPFL_CourseGrade ~ moocs$MOOC)

######## 
# TODO
# make an anova / or another test to find out high ability students (prior or prior4) get better grades (EPFL_CourseGrade) 
# check assumptions

# Begin solution
# 1) Checking normality => looks ok
par(mfrow=c(2,3))
by(moocs$EPFL_CourseGrade,
    INDICES = moocs$prior, function(x) {
    hist(x);
    qqnorm(x);
    qqline(x);
    plot.normal.ks(x)
}
)
par(mfrow=c(1,1))

# testing homoscedasticity => the result shos that we cannot consider the variances equal between LO and HI prior. This seems not to be the case given the bartlett.test
bartlett.test(moocs$EPFL_CourseGrade ~ moocs$prior)
boxplot(moocs$EPFL_CourseGrade ~ moocs$prior, horizontal=T)

# non equal variances 
# option 1=> adapt oneway.test
oneway.test(moocs$EPFL_CourseGrade ~ moocs$prior, var.equal = F)
plotmeans(moocs$EPFL_CourseGrade ~ moocs$prior)

# option 2 => use kruskall.test instead
kruskal.test(moocs$EPFL_CourseGrade ~ moocs$prior)

# End Solution

```


## Two factor ANOVA

We can add more than one categorical predictor and also model the interaction of two factors (":" represents the interaction below). When there is an interaction, the effect of one variable is conditional on the effect of another variable. We use the lm function to express the linear model m underlying the Anova. Hence, y ~ A + B + A:B expresses a model where the dependent variable y depends on factor A, factor B and their interaction. It can also be written as y ~ A \* B. The \* sign means "single effects and interaction". 

```{r}
m <- lm(moocs$EPFL_CourseGrade ~ moocs$watched_videos + moocs$solved_problems + moocs$watched_videos:moocs$solved_problems) 
# a synonym for the model that contains single and interaction effects is A * B == A + B + A:B
# hence we can rewrite m as: m <- lm(moocs$EPFL_CourseGrade ~ moocs$watched_videos * moocs$solved_problems) 

# Following http://goanna.cs.rmit.edu.au/~fscholer/anova.php the Type III Sum of Square is more adapted  to interpret the contributions of the differnt variables. R and anova report Type I sums of square.
# Here is an explanation about how to obtain Type III sums of squares.
# https://www.r-bloggers.com/anova-%E2%80%93-type-iiiiii-ss-explained/
# And some explanations here: prometheus.scp.rochester.edu/zlab/sites/default/files/InteractionsAndTypesOfSS.pdf
# install.packages("car") 

# Doing Type III sums of squares. 
library(car)
### Default is: options(contrasts = c("contr.treatment", "contr.poly"))
options(contrasts = c("contr.sum", "contr.poly"))

# There are two simple effects, but no interaction. The effect of watching videos is indenpendent of the effect of solving problems. 
# I.e. the effect of watching videos is the same for people who solve problems and for people who do not solve problems. 
Anova(m, type=3)

# Checking assumptions with residuals plot
# 1) residuals need to be normally distributed for each subgroup
# 2) residuals' variance needs to be constant across different fitted values.
residualPlots(m)

```

## Plotting means and confidence intervals with plotmeans

```{r}
par(mfrow=c(1,1))
library(gplots)
plotmeans(moocs$EPFL_CourseGrade ~ moocs$watched_videos)
plotmeans(moocs$EPFL_CourseGrade ~ moocs$solved_problems)

# technique 1: plotting the interaction
plotmeans(moocs$EPFL_CourseGrade ~ interaction(moocs$watched_videos, moocs$solved_problems))

# technique 2: plotting the interaction with interaction.plot. There is no interaction, the lines are "parallel"
interaction.plot(moocs$watched_videos, moocs$solved_problems, moocs$EPFL_CourseGrade)

```

## Plotting means and confidence intervals with ggplot

```{r}

# Strategy 1: summarise the data first and plot the summary.
# 
# step 1: first we need to create a data.frame that contains only the summary values we wwant to plot 
require(dplyr)
df <- ddply(moocs,
      .(trace = watched_videos,
        xtrace = solved_problems), # specify which categorical variables are used to summarise data
      summarise,
      # now list the new variables thart we compute
      avg = mean(EPFL_CourseGrade), 
      ci = 1.96 * sd(EPFL_CourseGrade, na.rm=T)/sqrt(length(EPFL_CourseGrade))
      )

# Here is the dataframe that contains the poitns
df

# we will use this to shift the data points left and wirhg
pd <- position_dodge(0.1) 

# step 2: plot the summary
gp <- ggplot(df, aes(x=xtrace, y=avg, group=trace))
gp <- gp + geom_line(aes(linetype=trace), size=0.9, position=pd)
gp <- gp + geom_errorbar(aes(ymax=avg+ci, ymin=avg-ci), width=.2 , position=pd)
gp <- gp + geom_point(aes(shape=trace), size=3.5 , position=pd) 
gp <- gp +  theme_bw() 
gp 

```

## Exercice: add text label to the plot with the number of observations
```{r}
# TODO
# a) add the number of obsevations to df 
# (tip: length() counts the number of elements in a vector)
df <- ddply(moocs,
      .(trace = watched_videos,
        xtrace = solved_problems), # specify which categorical variables are used to summarise data
      summarise,
      # now list the new variables thart we compute
      avg = mean(EPFL_CourseGrade), 
      ci = 1.96 * sd(EPFL_CourseGrade, na.rm=T)/sqrt(length(EPFL_CourseGrade)),
      n = length(EPFL_CourseGrade)
      )

# b) add the number to the plot 
# (tip: geom_text() allows to add text)
gp <- ggplot(df, aes(x=xtrace, y=avg, group=trace))
gp <- gp + geom_line(aes(linetype=trace), size=0.9, position=pd)
gp <- gp + geom_errorbar(aes(ymax=avg+ci, ymin=avg-ci), width=.2 , position=pd)
gp <- gp + geom_point(aes(shape=trace), size=3.5 , position=pd) 
gp <- gp +  theme_bw() 
gp + geom_text(aes(x=xtrace, y=avg, label=paste("N=",n, sep=""), hjust=1.2, vjust=-1))
```


```{r}
## Strategy 2: use ggplot to compute the summary and confidence intervals

#install.packages("Hmisc") # needed for  mean_cl_boot in stat_summary
library(Hmisc)

pd = position_dodge(width=.1)

# Plotting direclty the course grades rather than summarizng as a separate data frame
gp <- ggplot(moocs, aes(x=watched_videos, y=EPFL_CourseGrade, group=solved_problems))
gp <- gp + geom_line(stat = "summary", fun.y = "mean", position=pd)
gp <- gp + geom_point(stat = "summary", fun.y = "mean", position=pd, size=3)
gp <- gp + stat_summary(fun.data = "mean_cl_boot", size = 1, position=pd, geom = "errorbar", width=0.2, 
                        aes(colour=solved_problems)) 
# gp <- gp + geom_jitter(aes(x=watched_videos, y=EPFL_CourseGrade, colour=solved_problems),  alpha=0.1)
gp <- gp + theme_bw()
gp
ggsave(filename = "grade-problems-videos.pdf",gp)

# Plotting with bars
pd = position_dodge(width=1)
gp <- ggplot(moocs, aes(x=watched_videos, y=EPFL_CourseGrade, group=solved_problems))
gp <- gp + geom_bar(aes(fill=solved_problems), stat = "summary", fun.y = "mean", position=pd)
gp <- gp + stat_summary(fun.data = "mean_cl_boot", colour = "black", size = 1, position=pd ) 
# gp + stat_summary(fun.y = mean, geom="line", position=pd)
gp
```

## Exercise: 

Run an ANOVA that answers the following question: 
* Does participating in MOOCs help gettng to better grades (choose one of didMOOC or MOOC which are two variables that define mooc usage) ?
* Do MOOCs help weak students as much as strong students (prior and prior4 are two categorical variables that define level) ?
* Is there an interaction between MOOC use and prior ability (e.g. do MOOCs help weaker students more than strong students ?) 

Check assumptions with residualsPlot.

```{r}

# TODO
# Let's use only observations for which we know the prior ability of the students
moocs.valid <- subset(moocs, ! is.na(moocs$prior))

# Begin Solution
m <- lm(EPFL_CourseGrade ~ prior + MOOC + prior:MOOC , data=moocs.valid)

Anova(m, data=moocs.valid, type=3)

# Check assuptions
residualPlots(m)
# End Solution

# TODO  Three way ANOVA
# Is the relation different for different levels of EPFL_topic (use facet_grid to plot the relation for different values of EPFL_topic)?


m3 <- lm(EPFL_CourseGrade ~ prior * MOOC * EPFL_Topic , data=moocs.valid)
Anova(m3,data=moocs.valid, type=3)

```

## Plotting interactions with GGPLOT

```{r}

# TODO: plot the graphs that represent the interactions between MOOC usage and prior ability  

interaction.plot(moocs.valid$MOOC, moocs.valid$prior, moocs.valid$EPFL_CourseGrade)

# Begin solution
# Plot means and confidence intervals
df <- ddply(moocs.valid,
      .(trace = prior,
        xtrace = MOOC), # specify which categorical variables are used to summarise data
      summarise, # transform is another version.
      # now list the new variables thart we compute
      avg = mean(EPFL_CourseGrade, na.rm=T), 
      ci = 1.96 * sd(EPFL_CourseGrade, na.rm=T)/sqrt(length(EPFL_CourseGrade))
      )

df

pd <- position_dodge(0.33) # move them .05 to the left and right

gp <- ggplot(df, aes(x=xtrace, y=avg, group=trace))
gp <- gp + geom_line(aes(linetype=trace), size=0.9, position=pd)
gp <- gp + geom_errorbar(aes(ymax=avg+ci, ymin=avg-ci), width=.5 , position=pd)
gp <- gp + geom_point(aes(shape=trace), size=3.5 , position=pd) 
gp <- gp +  theme_bw()
gp 
# End solution


# TODO: plot the graphs that represent the three way interactions between MOOC usage and prior ability for different topics  
# Begin solution
df <- ddply(moocs.valid,
      .(trace = prior,
        xtrace = MOOC,
        topic = EPFL_Topic), # specify which categorical variables are used to summarise data
      summarise, # transform is another version.
      # now list the new variables thart we compute
      avg = mean(EPFL_CourseGrade, na.rm=T), 
      ci = 1.96 * sd(EPFL_CourseGrade, na.rm=T)/sqrt(length(EPFL_CourseGrade))
      )

pd <- position_dodge(0.33) # move them .05 to the left and right

gp <- ggplot(df, aes(x=xtrace, y=avg, group=trace))
gp <- gp + geom_line(aes(linetype=trace), size=0.9, position=pd)
gp <- gp + geom_errorbar(aes(ymax=avg+ci, ymin=avg-ci), width=.5 , position=pd)
gp <- gp + geom_point(aes(shape=trace), size=3.5 , position=pd) 
gp <- gp +  theme_bw() + facet_grid(.~topic)
gp 
# End Solution

# Plotting direclty the course grades rather than summarizng
#install.packages("Hmisc")
library(Hmisc)
pd = position_dodge(width=1)
gp <- ggplot(moocs.valid, aes(x=MOOC, y=EPFL_CourseGrade, group=prior))
gp <- gp + geom_bar(aes(fill=prior), stat = "summary", fun.y = "mean", position=pd)
gp <- gp + stat_summary(fun.data = "mean_cl_boot", colour = "black", size = 1, position=pd ) 
gp

```

# Linear regression
```{r}
# Predicting one quantitative variable with a combination of other quantitative variables

# Assumptions for linear regression 
# * Linearity: The relationship between X and the mean of Y is linear.
# * Homoscedasticity: The variance of residual is the same for any value of X.
# * Independence: Observations are independent of each other.
# * Normality: For any fixed value of X, Y, and the residuals are normally distributed.


# lm estimates the parameters of y = ax + b so as to minimise the Error. 
# the null model consists of predicting the CourseGrade by its mean (~1):
m0 <- lm(EPFL_CourseGrade ~ 1, data=moocs)
summary(m0)

# Adding a predictor MOOC_PercentageVideosWatched to the null model.
m1 <- lm(EPFL_CourseGrade ~ MOOC_PercentageVideosWatched, data=moocs)

# The adjusted R^2 is interpreted as the proportion of variance explained.# 
# The F-statistic indicates whereh this model is better than the null model that would include only one parameter (the mean).
summary(m1)

moocs$pred <- fitted(m1)
moocs$res <- residuals(m1)
# let's plot the distribution of the residuals
par(mfrow=c(1,3))
hist(moocs$res)
plot.normal.ks(moocs$res)
qqnorm(moocs$res)
qqline(moocs$res)
par(mfrow=c(1,1))

# Plotting a model results in 4 plots that allow to diagnose the quality of the model
# http://data.library.virginia.edu/diagnostic-plots/

par(mfrow=c(2,2))
plot(m1)

# Plotting the regression line.
ggplot(moocs, aes(x = MOOC_PercentageVideosWatched, y = EPFL_CourseGrade)) +
  geom_jitter(alpha = .1) + # plot the points with jitter and transparency
  geom_point(aes(y = pred), col="red", fill="red", shape = 21, size=3)  + # Add the predicted values
  geom_smooth(method = "lm") + # Add the regression line
  xlab("Percentage of videos watched")+ # add an x-axis label
  ylab("Grade at the EPFL exam") # add an y-axis label


### Let's add a second predictor to our model
m2 <- lm(EPFL_CourseGrade ~ MOOC_PercentageVideosWatched + MOOC_PercentageProblemsSolved, data=moocs)

# The F statistic reported in the summary corresponds to the comparison of this model 
# with a null model that contains only an intercpt.
summary(m2)

# if we compute the null model by hand:
m0 <- lm(EPFL_CourseGrade ~ 1, data=moocs)
anova(m0,m2) # same as what is reported by summary(m2)


# We can compare the three models that we have built.
# The test makes sense if m1 is nested in m2 (we added one term at the time)
# Evaluating incrementally whether adding one predictor enhances the model.
anova(m0, m1, m2)
anova(m2)
```


## Assumptions for regression. Evaluate the quality of the model

Assumptions for linear regression 
* Linearity: The relationship between X and the mean of Y is linear.
* Homoscedasticity: The variance of residual is the same for any value of X.
* Independence: Observations are independent of each other.
* Normality: For any fixed value of X, Y is normally distributed.

```{r}

# Let's check how the model fits assumptions. There is a slight trend for residuals 
# to become negative as the predicted values augment. This is probably due to the large amount of 0 values 
# for the percentage of videos watched. 
par(mfrow=c(2,2))
plot(m2)

```

The anscombe Data set illustrates the importance to check assumptions and look at residuals.
```{r}
# TODO : load the anscombe data set 
# The dataset contains four pairs of x and y values (y1~x1; y2~x2; y3~x3 and y4~x4) 
# The mean and sd are identical across the 4 sets.
# and evaluate whether the assumptions to conduct a linear regression are met with the 4 sets of data in the dataframe

#load("anscombe.RData")

# x and y are correlated with r = 0.81
cor.test(anscombe$x1, anscombe$y1)
cor.test(anscombe$x2, anscombe$y2)
cor.test(anscombe$x3, anscombe$y3)
cor.test(anscombe$x4, anscombe$y4)

# the parameters for a linear model y~x are the same
m1 <- lm(anscombe$y1 ~ anscombe$x1)
m1

m2 <- lm(anscombe$y2 ~ anscombe$x2)
m2

m3 <- lm(anscombe$y3 ~ anscombe$x3)
m3

m4 <- lm(anscombe$y4 ~ anscombe$x4)
m4

# Plot the relation between y and x for the four datasets (y1~x1; y2~x2; y3~x3 and y4~x4) 
# What do you observe ?
par(mfrow=c(2,2))
plot(anscombe$y1~anscombe$x1, pch=19)
abline(m1)

plot(anscombe$y2~anscombe$x2, pch=19)
abline(m2)

plot(anscombe$y3~anscombe$x3, pch=19)
abline(m3)

plot(anscombe$y4~anscombe$x4, pch=19)
abline(m4)
par(mfrow=c(1,1))

### Checking assumptions
# TODO look at the diagnostic plots for m1, m2, m3, m4
# Do you see problems ?

# Begin solution
# linearity (there should not be huge residuals) 
# and homoscedasticity (there should not be a trend for residuals as we increase the fitted values)
par(mfrow=c(2,2))
plot(m1, which=1, main="Model 1") # 
plot(m2, which=1, main="Model 2") # Looks like there is a pattern
plot(m3, which=1, main="Model 3") # There is a trend downwards, the error becomes larger and more negative as we predict larger values. This means we surestimate more and more the predicted values.  
plot(m4, which=1, main="Model 4")
par(mfrow=c(1,1))

# Assumption: normality of residuals. 
par(mfrow=c(2,2))
plot(m1, which=2, main="Model 1") # 
plot(m2, which=2, main="Model 2") # Looks like there is a pattern
plot(m3, which=2, main="Model 3") # There is an outlier (observation 3)  
plot(m4, which=2, main="Model 4")
par(mfrow=c(1,1))

# Fitted agains standardised residuals. there shoudl be no trend in the data 
par(mfrow=c(2,2))
plot(m1, which=3, main="Model 1") # 
plot(m2, which=3, main="Model 2") # Looks like there is a pattern
plot(m3, which=3, main="Model 3") # There is an outlier (observation 3)  
plot(m4, which=3, main="Model 4")
par(mfrow=c(1,1))

# Cook's distance. Measures the influence of one particular observation on the parameters of the regression. Values close and above 1 have a too large influence and should be observed.  
par(mfrow=c(2,2))
plot(m1, which=4, main="Model 1") # 
plot(m2, which=4, main="Model 2") # Looks like observations 6 and 8 are quite large
plot(m3, which=4, main="Model 3") # There is an outlier (observation 3)  
plot(m4, which=4, main="Model 4")
par(mfrow=c(1,1))
# end solution


# if we want to remove the third obsevation in the data 
# anscombe = anscombe[-3,]

```


# Categorical variables

Do more people with high prior ability solve problems ? 
The chi-square test allows to test whether the observations in two categorical variables can be considered independent. 
```{r}

# table() and chisq.test()

tt <- table(moocs$prior, moocs$solved_problems)
tt

# The chi-square tests the independence of X and Y.
# H0: the observed and expected number of observations are the same

ct <- chisq.test(tt)
ct

ct$observed
ct$expected # expected values if the variables were independent


# Let's plot the observed and expected tallies
par(mfrow=c(1,2))
plot(as.table(ct$observed), main="Observed")
plot(as.table(ct$expected), main="Expected")

ct$residuals # pearson residuals larger than 1.96 are significantly different from 0 with p<.05
# the Pearson residuals, (observed - expected) / sqrt(expected).

# TODO: Do more people with high prior ability watch videos ?

tt <- table(moocs$prior, moocs$watched_videos)
ct <- chisq.test(tt)
ct

ct$residuals
```