---
title: "MOOCs"
author: "Patrick Jermann"
date: "11/14/2017"
output:
  html_document: default
  pdf_document: default
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=90)
knitr::opts_chunk$set(comment = "", warning = FALSE, message = FALSE, echo = TRUE, tidy = TRUE, size="small")
```

# Loading Data
Let's load data. The first step is to read a CSV file into a dataframe, which is the data structure used by R to store data.

```{r echo=FALSE, results='hide', message=FALSE}

# Uncomment and run if these packages are not installed yet.
# install.packages("gdata")
# install.packages("gplots")

library(gdata) # reorder
library(gplots) # plotmeans
library(ggfortify) # regression diagnostics
library(plyr) # ddply

```

```{r echo=TRUE}
# TODO: Adapt the path to your own installation
# Set the working directory
setwd("/Volumes/GoogleDrive/My Drive/CEDE/TEACHING/courses/CS411-2018/")

# Read a csv file (header = T signals that the first line of the file contains the names of the variables)
# the <- operator assigns the result of read.csv to the variable moocs 
# moocs is a dataframe, similar to an excel worksheet.
moocs <- read.csv(file="moocs.basic.csv", header = T)

```

```{r}
# prints the first rows of the dataframe. Useful to check if the import has worked. We see that there are many missing values signaled by "NA" == Not Available, namely for variables starting with MOOC_ ... this is because these students never logged in the MOOC.
head(moocs)
```

Our dataset contains variables starting with EPFL and variables starting with MOOC. The variables with EPFL come from the academic database and describe student's performance in their EPFL exams. The variables starting with MOOC conern their online activity in the MOOC.

```{r}

# List the names of the variables
names(moocs)
```

# Descriptive Statistics

## Summary for variables
```{r}
# For each variable, print some information about the distribution (if the variable is quantitative) or about the distinct values taken by the variable (if the variable is categorical). 
summary(moocs)

# To get information about only one variable, use the $ sign to separate the name of the dataframe and the name of the variable like so: <name_dataframe>$<name_variable>
summary(moocs$EPFL_StudentSection)
summary(moocs$MOOC_PercentageVideosWatched)

mean(moocs$MOOC_PercentageVideosWatched, na.rm=T) # na.rm removes missing values
sd(moocs$MOOC_PercentageVideosWatched, na.rm=T) # standard deviation
median(moocs$MOOC_PercentageVideosWatched, na.rm=T) # median (50% observations below and 50% observations above)
max(moocs$MOOC_PercentageVideosWatched, na.rm=T)
min(moocs$MOOC_PercentageVideosWatched, na.rm=T)
```

## Computing new categorical variables
```{r}

moocs$didMOOC = factor(
  ifelse(
    is.na(moocs$MOOC_NVideosWatched) & 
      is.na(moocs$MOOC_NProblemsSolved), 
    "NO.MOOC", 
    "MOOC")
)

summary(moocs$didMOOC)

# Now for people who did the MOOC
moocs$watched_videos = factor(
                          ifelse(
                            moocs$MOOC_PercentageVideosWatched == 0 |
                              is.na(moocs$MOOC_PercentageVideosWatched),
                            "NO.VIDEO",
                            "SOME.VIDEO")
                          )

summary(moocs$watched_videos)
# The table command reports the number of observations per category.
table(moocs$watched_videos)

moocs$solved_problems = factor(
                          ifelse(
                            moocs$MOOC_PercentageProblemsSolved == 0 |
                              is.na(moocs$MOOC_PercentageProblemsSolved),
                            "NO.PROBLEM",
                            "SOME.PROBLEM")
                          )

table(moocs$solved_problems)
# With two arguments, the table command cross-tabulates the number of observations per category.
table(moocs$watched_videos, moocs$solved_problems)

# Let's also replace the missing values with 0 for the percentage of videos watched and the percentage of problems solved
moocs$MOOC_PercentageVideosWatched <- ifelse(is.na(moocs$MOOC_PercentageVideosWatched), 0, moocs$MOOC_PercentageVideosWatched)
moocs$MOOC_PercentageProblemsSolved <- ifelse(is.na(moocs$MOOC_PercentageProblemsSolved), 0, moocs$MOOC_PercentageProblemsSolved)

# TODO:
# Define a categorical variable names mooc$MOOC that has 3 levels with:
# 1) NONE = "did not use the mooc at all"
# 2) WATCH = "used the mooc mostly to watch videos"
# 3) DO = "also did some assignments"

# Begin Solution

# End Solution 

library(gdata)
# put the levels into a convenient order (for graphing, etc.)
moocs$MOOC <- reorder(moocs$MOOC, new.order=c("NONE", "WATCH", "DO"))
table(moocs$MOOC)

```

### Median split and quartile splits
```{r}

hist(moocs$EPFL_BacGrade)
abline(v=median(moocs$EPFL_BacGrade, na.rm=T), col="red", lwd=3)

# TODO: Define two categorical variables named moocs$prior and moocs$prior4 that distinguishstudents given at their baccalaureat grade.

# 1) moocs$prior. 
# The values of the variable are "HI"" for strong students and "LO"" for weakers students.
# => use median() to define high and low levels of ability

# 2) moocs$prior4 
# Levels are called XLO, LO, HI, XHI
# => use quantile() and cut() to definelevels 

# Begin solution

# end solution

```

## Basic Plotting
```{r}
# hist draws the historgram of a quantitative variable
#   main="title" sets the title.
#   xlab="title of the xaxis"
#   ylab="title of the yaxis"
#   breaks=n sets the number of bins used by the historgram
# par() sets graphical parameters. mfrow allows to plot several plots on one figure. It takes as argument, an array c(nrow,ncol) that contains the number of rows and columns to arrange the subplots.  
# png opens a png file of teh given size to output the results of the plotting.

png("histograms.png", width=600, height=600) # Opens the file
par(mfrow=c(2,2)) # splits the figure into 4 subplots.
hist(moocs$EPFL_CourseGrade, main ="Course Grades", xlab = "Grade")
hist(moocs$MOOC_PercentageVideosWatched, main="Number of videos watched", xlab="Videos", breaks=100)
hist(moocs$MOOC_Grade, main="MOOC Grade", xlab="Grade")
hist(moocs$MOOC_AverageMaxVideosWatched, main="Max Number of videos", xlab="Videos")
dev.off() # Closes the file

```

Boxplots can be obtained for one variable at the time with the boxplot command. R also creates boxplots when using the command plot with a categorical variable: plot(quantitative ~ categorical).

```{r}
par(mfrow=c(1,2)) 

# Boxplots
boxplot(moocs$MOOC_PercentageVideosWatched, horizontal=TRUE, varwidth=TRUE)

# PLotting a quantitative variable given a categorical variable.
plot(moocs$EPFL_CourseGrade ~ moocs$MOOC)

# Plotting two quantitative variables
plot(moocs$EPFL_CourseGrade ~ moocs$MOOC_PercentageVideosWatched, pch=21)

# Since all points are on top of each other, we use jitter() to add some random noise.
# pch is the plotting character used to draw points https://www.statmethods.net/advgraphs/parameters.html
# cex is a size adjusment
# color of the points is given by RGB and alpha
plot(jitter(moocs$EPFL_CourseGrade) ~ jitter(moocs$MOOC_PercentageVideosWatched), pch=19, cex=0.8, col=rgb(0,0,0,0.5))

```

## Plotting with GGPLOT
Let's now see how to plot these same graphs with ggplot2. See documentation here http://ggplot2.org/ and here http://www.cookbook-r.com/Graphs/ and https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
```{r}
library(ggplot2)
ggplot(moocs, aes(x=MOOC_PercentageVideosWatched)) + geom_histogram(binwidth = 10)

ggplot(moocs, aes(x=MOOC, y=EPFL_CourseGrade)) + geom_boxplot(aes(fill = MOOC)) 

ggplot(moocs, aes(x=MOOC, y=EPFL_CourseGrade)) + geom_boxplot(aes(fill = MOOC)) + facet_grid(.~EPFL_Topic)

# In this plot we see that the "cloud of points" is strongly clustered to the left, since most of the people in the sample have watched zero videos.
ggplot(moocs, aes(x=MOOC_PercentageVideosWatched, y=EPFL_CourseGrade)) + geom_jitter(alpha = I(1/4)) 

# TODO Create a plot with ggplot that checks whether the course grade is linked with the grade obtained at the baccalaureat
# Find a way to add some jitter
# PLot this separately for eah EPFL_Topic
# Begin Solution

# end solution
```

# Basic Statistics

## Correlation
```{r}
# Correlation between 2 normally distributed quantitative variables
cor.test(moocs$EPFL_CourseGrade, moocs$MOOC_PercentageVideosWatched)

# TODO: find whether the CourseGrade is correlated with the BacGrade

```

The output contains the Pearson correlation coefficiant (cor) as well as a t-test that is testing whether this correlation is statistically different from 0. In our case, p << .05 which indicates that the correlation is positive. 

The same cor.test function can be used to computer Spearman rank correlation which is used whenever the distributions of the variables are not normal (which seems to be the case here for moocs$MOOC_PercentageVideosWatched). In such a case the Spearman rank correlation might be more appropriate (see http://geoinfo.amu.edu.pl/qg/archives/2011/QG302_087-093.pdf).

```{r}
cor.test(moocs$EPFL_CourseGrade, moocs$MOOC_PercentageVideosWatched, method="spearman")
```
The difference bewteen a rank correlation and pearson correlation is well illustrated here: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

## Normality Assumption
Testing normality visually with quantile plots. If the variable is normally distributed, the observations would follow the line drawn by qqline. We see from the plots below that the distribution of the variable "MOOC_PercentageVideosWatched" departs very strongly from a normal distribution.

```{r}
par(mfrow=c(2,2))
hist(moocs$MOOC_PercentageVideosWatched)
qqnorm(moocs$MOOC_PercentageVideosWatched)
qqline(moocs$MOOC_PercentageVideosWatched)

hist(moocs$EPFL_CourseGrade)
qqnorm(moocs$EPFL_CourseGrade)
qqline(moocs$EPFL_CourseGrade)
par(mfrow=c(1,1))


# TODO: Is Bac Grade normally distributed ?
# Plot a historgram and a qq plot
# Begin solution

# End solution

# An alternative to qqline and qqnorm.
# install.packages("ggpubr")
library(ggpubr)
ggqqplot(moocs$EPFL_BacGrade)

```

Normality can also be tested statistically, however the normality tests are rather sensitive to small deviations.

For example, the Shapiro-Wilks test, is testing the null hypothesis that the variable is normally distributed. In this case the adequate p.value is .1. Hence, with p <<.1 we have to reject the null hypothesis and conclude that the variable is not normally distributed. This test work with samples smaller than 5000 observations.  
```{r}
# shapiro.test(moocs$MOOC_PercentageVideosWatched)
# shapiro.test(moocs$EPFL_CourseGrade) # has too many observations
```

An alternative to shapiro is the Kolmogorov-Smirnov test which allows to test whether two samples were drawn from the same distribution. This allows to compare our observations with a sample that follows a normal distibution. The test for EPFL_BacGrade (D=0.04, p<.00) shows that the distribution of sample is unlikely to stem from a normal distribution.

```{r}
# Kolmogorov-Smirnov test that compares a variable x with a normal distribution that has the same mean and variance as x. The test relies on the maximum difference between the Cumulated Frequency Distributions of two samples.
x <- moocs$EPFL_BacGrade
ks.test(x,"pnorm",mean(x,na.rm=T),sqrt(var(x,na.rm=T)))

# define afunction to plot the Kolmogorov Smirnov test     
# (adapted from https://rpubs.com/mharris/KSplot)

plot.normal.ks <-  function(sample1) {

  normal.sample <- rnorm(10000, mean(sample1,na.rm=T),sqrt(var(sample1,na.rm=T))) # a normal sample that has the same mean and sd

  cdf1 <- ecdf(sample1) 
  cdf2 <- ecdf(normal.sample)

  plot(cdf1, verticals=TRUE, do.points=FALSE, col="blue") 
  plot(cdf2, verticals=TRUE, do.points=FALSE, col="green", add=TRUE) 

  minMax <- seq(min(sample1, normal.sample, na.rm=T), max(sample1, normal.sample, na.rm=T), length.out=length(sample1)) 
  x0 <- minMax[which( abs(cdf1(minMax) - cdf2(minMax)) == max(abs(cdf1(minMax) - cdf2(minMax))) )] # identify where the maximum difference is.
  y0 <- cdf1(x0) 
  y1 <- cdf2(x0) 
  points(c(x0, x0), c(y0, y1), pch=16, col="red") 
  segments(x0, y0, x0, y1, col="red", lwd=5, lty="solid") 
}

par(mfrow=c(1,2)) # split the plotting area into 1 row and 2 columns
plot.normal.ks(moocs$MOOC_PercentageVideosWatched)
plot.normal.ks(moocs$EPFL_BacGrade)

par(mfrow=c(1,1)) # reset the plotting area

```

# ANOVA - Analysis of Variance

Comparing means across categories. We use the Analysis of Variance (ANOVA) to compare the mean of a variable across several categorical variables.

```{r}
# Comparing whether one variable has the same mean in different groups ?
# quantitative ~  categorical

# H0: mean_1 = mean_2 = ...= Mean_n
# H1: mean_1 != mean_2 != ... != mean_n
oneway.test(moocs$EPFL_CourseGrade ~ moocs$MOOC, var.equal = T) 
pairwise.t.test(moocs$EPFL_CourseGrade, moocs$MOOC) # Checks which group is different from what other group in doing 2 by 2 tests.
summary(aov(moocs$EPFL_CourseGrade ~ moocs$MOOC))

```

## Assumptions to be met before conducting anova

### 1) Data is normally distributed in each of the groups
```{r}
# by() is applying a function for each value of the categorical variable defined by "INDICES"
# here it takes every value of moocs$MOOC. See below how to do the graphs one by one.
par(mfrow=c(3,3))
by(moocs$EPFL_CourseGrade,
    INDICES = moocs$MOOC, function(x) {
    hist(x);
      qqnorm(x);
    qqline(x);
    plot.normal.ks(x)
}
)
par(mfrow=c(1,1))

# repeating the same graph 3 times, once for each possible value of moocs$MOOC. The syntax above with by() is more efficient. 
#
#par(mfrow=c(3,2))
#qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =="NONE"])
#qqline(moocs$EPFL_CourseGrade[moocs$MOOC =="NONE"])
#plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =="NONE"])
#
#qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =="WATCH"])
#qqline(moocs$EPFL_CourseGrade[moocs$MOOC =="WATCH"])
#plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =="WATCH"])
#
#qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =="DO"])
#qqline(moocs$EPFL_CourseGrade[moocs$MOOC =="DO"])
#plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =="DO"])
#
#par(mfrow=c(1,1))

```
### 2) Variances are equal in both groups

```{r}
# H0: variances are the same
# H1: variances are different
# We have to reject the null hypothesis H0 for the bartlett test whih means that we the variances are different. This is appararent fom the boxplot as well, where we see that the variance for the DO group is smaller.
bartlett.test(moocs$EPFL_CourseGrade ~ moocs$MOOC)
tapply(moocs$EPFL_CourseGrade, moocs$MOOC, var, na.rm=T) #  tapply allows to apply the var() function to 3 subsamples defined by the 3 values of moocs$MOOC 
boxplot(moocs$EPFL_CourseGrade ~ moocs$MOOC, horizontal=T)

# Normality +- OK
# Variances different

# a) adapt the caluclations to take unequal variances into account 
oneway.test(moocs$EPFL_CourseGrade ~ moocs$MOOC, var.equal = F)

# b) => use non parametric alternative to anova. 
# Tests the null hypothesis that the location parameters are the same in each of the groups.
kruskal.test(moocs$EPFL_CourseGrade ~ moocs$MOOC)

######## 
# TODO
# make an anova / or another test to find out high ability students (prior or prior4) get better grades (EPFL_CourseGrade) 
# check assumptions

# Begin solution
# 

# End Solution

```

# Linear regression
```{r}
# Predicting one quantitative variable with a combination of other quantitative variables

# Assumptions for linear regression 
# * Linearity: The relationship between X and the mean of Y is linear.
# * Homoscedasticity: The variance of residual is the same for any value of X.
# * Independence: Observations are independent of each other.
# * Normality: For any fixed value of X, Y, and the residuals are normally distributed.

# lm estimates the parameters of y = ax + b so as to minimise the Error. 
m1 <- lm(EPFL_CourseGrade ~ MOOC_PercentageVideosWatched, data=moocs)

# The adjusted R^2 is interpreted as the proportion of variance explained.# 
# The F-statistic indicates whereh this model is better than the null model that would include only one parameter (the mean).
summary(m1)

moocs$pred <- fitted(m1)

moocs$res <- residuals(m1)
# let's plot the distribution of the residuals
par(mfrow=c(1,3))
hist(moocs$res)
plot.normal.ks(moocs$res)
qqnorm(moocs$res)
qqline(moocs$res)
par(mfrow=c(1,1))

# Plotting a model results in 4 plots that allow to diagnose the quality of the model
# http://data.library.virginia.edu/diagnostic-plots/

par(mfrow=c(2,2))
plot(m1)

# Plotting the regression line.
ggplot(moocs, aes(x = MOOC_PercentageVideosWatched, y = EPFL_CourseGrade)) +
  geom_jitter(alpha = .1) + # plot the points with jitter and transparency
  geom_point(aes(y = pred), col="red", fill="red", shape = 21, size=3)  + # Add the predicted values
  geom_smooth(method = "lm") + # Add the regression line
  xlab("Percentage of videos watched")+ # add an x-axis label
  ylab("Grade at the EPFL exam") # add an y-axis label


### Let's add one predictor to our model
m2 <- lm(EPFL_CourseGrade ~ MOOC_PercentageVideosWatched + MOOC_PercentageProblemsSolved, data=moocs)

# The F statistic reported in the summary corresponds to the comparison of this model 
# with a null model that contains only an intercpt.
summary(m2)

# if we compute the null model by hand:
m0 <- lm(EPFL_CourseGrade ~ 1, data=moocs)
anova(m0,m2) # same as what is reported by summary(m2)

# Let's check how the model fits assumptions. There is a slight trend for residuals 
# to become negative as the predicted values augment. This is probably due to the large amount of 0 values 
# for the percentage of videos watched. 
par(mfrow=c(2,2))
plot(m2)

# We can compare the two models that we have built.
# The test makes sense if m1 is nested in m2 (we added one term at the time)
anova(m1,m2)

# The same result is given by anova() if we provide only one model. 
# R computes the contribution of each term to a better fit of the model.
anova(m2) # same as anova(m0,m1,m2)

```
## Assumptions for regression. Evaluate the quality of the model

```{r}

# Assumptions for linear regression 
# * Linearity: The relationship between X and the mean of Y is linear.
# * Homoscedasticity: The variance of residual is the same for any value of X.
# * Independence: Observations are independent of each other.
# * Normality: For any fixed value of X, Y is normally distributed.

# TODO : load the anscombe data set 
# The dataset contains four pairs of x and y values (y1~x1; y2~x2; y3~x3 and y4~x4) 
# The mean and sd are identical across the 4 sets.
# 
# and evaluate whether the assumptions to conduct a linear regression are met with the 4 sets of data in the dataframe

#load("anscombe.RData")

# x and y are correlated with r = 0.81
cor.test(anscombe$x1, anscombe$y1)
cor.test(anscombe$x2, anscombe$y2)
cor.test(anscombe$x3, anscombe$y3)
cor.test(anscombe$x4, anscombe$y4)

# the parameters for a linear model y~x are the same
m1 <- lm(anscombe$y1 ~ anscombe$x1)
m1

m2 <- lm(anscombe$y2 ~ anscombe$x2)
m2

m3 <- lm(anscombe$y3 ~ anscombe$x3)
m3

m4 <- lm(anscombe$y4 ~ anscombe$x4)
m4

# Plot the relation between y and x for the four datasets (y1~x1; y2~x2; y3~x3 and y4~x4) 
# What do you observe ?
par(mfrow=c(2,2))
plot(anscombe$y1~anscombe$x1, pch=19)
abline(m1)

plot(anscombe$y2~anscombe$x2, pch=19)
abline(m2)

plot(anscombe$y3~anscombe$x3, pch=19)
abline(m3)

plot(anscombe$y4~anscombe$x4, pch=19)
abline(m4)
par(mfrow=c(1,1))

### Checking assumptions
# TODO look at the diagnostic plots for m1, m2, m3, m4
# Do you see problems ?

# Begin solution

# end solution


# if we want to remove the third obsevation in the data 
# anscombe = anscombe[-3,]

```
## ANOVA revisited

ANOVA as done with oneway.test can be implemented by fitting several linear models with a categorical predictor and comparing models with each other. 
Because the anova function in R uses SS type 1 by default we have to configure it so that it uses type III sums of squares.

```{r}
#install.packages("car") 
library(car)
options(contrasts = c("contr.sum", "contr.poly"))


m0 <- lm(moocs$EPFL_CourseGrade ~ 1)
m1 <- lm(moocs$EPFL_CourseGrade ~ moocs$MOOC)
Anova(m1,type = 3) 

# When we have onl one variable this is equivalent to:
# same as anova(m0,m1) 
# same as oneway.test(moocs$EPFL_CourseGrade ~ moocs$MOOC, equal.var=T)
# same as summary(aov(moocs$EPFL_CourseGrade ~ moocs$MOOC))

```

## Two factor ANOVA

We can add more than one categorical predictor and also model the interaction of two factors (":" represents the interaction below). When there is an interaction, the effect of one variable is conditional on the effect of another variable  

```{r}
m0 <- lm(moocs$EPFL_CourseGrade ~ 1)
m1 <- lm(moocs$EPFL_CourseGrade ~ moocs$watched_videos)
m2 <- lm(moocs$EPFL_CourseGrade ~ moocs$watched_videos + moocs$solved_problems)
m3 <- lm(moocs$EPFL_CourseGrade ~ moocs$watched_videos + moocs$solved_problems + moocs$watched_videos:moocs$solved_problems) 
# a synonym for the model that contains single and interaction effects is A * B == A + B + A:B
# hence we can rewrite m3 as: m3 <- lm(moocs$EPFL_CourseGrade ~ moocs$watched_videos * moocs$solved_problems) 

# Following http://goanna.cs.rmit.edu.au/~fscholer/anova.php the Type III Sum of Square is more adapted  to interpret the contributions of the differnt variables. R and anova report Type I sums of square.
# Here is an explanation about how to obtain Type III sums of squares.
# https://www.r-bloggers.com/anova-%E2%80%93-type-iiiiii-ss-explained/
# And some explanations here: prometheus.scp.rochester.edu/zlab/sites/default/files/InteractionsAndTypesOfSS.pdf
# install.packages("car") 

# There are two simple effects, but no interaction. The effect of watching videos is indenpendent of the effect of solving problems. 
# I.e. the effect of watching videos is the same for people who solve problems and for people who do not solve problems. 
Anova(m3, type=3)

# Type1 SSQ
#anova(m0,m1,m2,m3)
# same output as the aov wrapper
#summary(aov(moocs$EPFL_CourseGrade ~ moocs$watched_videos + moocs$solved_problems + moocs$watched_videos:moocs$solved_problems))

```

## Plotting means and confidence intervals with plotmeans

```{r}
par(mfrow=c(1,1))
library(gplots)
plotmeans(moocs$EPFL_CourseGrade ~ moocs$watched_videos)
plotmeans(moocs$EPFL_CourseGrade ~ moocs$solved_problems)

# technique 1: plotting the interaction
plotmeans(moocs$EPFL_CourseGrade ~ interaction(moocs$watched_videos, moocs$solved_problems))

# technique 2: plotting the interaction with interaction.plot. There is no interaction, the lines are "parallel"
interaction.plot(moocs$watched_videos, moocs$solved_problems, moocs$EPFL_CourseGrade)

```

## Plotting means and confidence intervals with ggplot

```{r}

# Strategy 1: summarise the data first and plot the summary.
# 
# step 1: first we need to create a data.frame that contains only the summary values we wwant to plot 
require(dplyr)
df <- ddply(moocs,
      .(trace = watched_videos,
        xtrace = solved_problems), # specify which categorical variables are used to summarise data
      summarise,
      # now list the new variables thart we compute
      avg = mean(EPFL_CourseGrade), 
      ci = 1.96 * sd(EPFL_CourseGrade, na.rm=T)/sqrt(length(EPFL_CourseGrade))
      )

# Here is the dataframe that contains the poitns
df

# we will use this to shift the data points left and wirhg
pd <- position_dodge(0.1) 

# step 2: plot the summary
gp <- ggplot(df, aes(x=xtrace, y=avg, group=trace))
gp <- gp + geom_line(aes(linetype=trace), size=0.9, position=pd)
gp <- gp + geom_errorbar(aes(ymax=avg+ci, ymin=avg-ci), width=.2 , position=pd)
gp <- gp + geom_point(aes(shape=trace), size=3.5 , position=pd) 
gp <- gp +  theme_bw() 
gp 

```

## Exercice: add text label to the plot with the number of observations
```{r}
# TODO

# a) add the number of obsevations to df 
# (tip: length() counts the number of elements in a vector)
df <- ddply(moocs,
      .(trace = watched_videos,
        xtrace = solved_problems), # specify which categorical variables are used to summarise data
      summarise,
      # now list the new variables thart we compute
      avg = mean(EPFL_CourseGrade), 
      ci = 1.96 * sd(EPFL_CourseGrade, na.rm=T)/sqrt(length(EPFL_CourseGrade)),
      n = length(EPFL_CourseGrade)
      )

# b) add the number to the plot 
# (tip: geom_text() allows to add text)
gp <- ggplot(df, aes(x=xtrace, y=avg, group=trace))
gp <- gp + geom_line(aes(linetype=trace), size=0.9, position=pd)
gp <- gp + geom_errorbar(aes(ymax=avg+ci, ymin=avg-ci), width=.2 , position=pd)
gp <- gp + geom_point(aes(shape=trace), size=3.5 , position=pd) 
gp <- gp +  theme_bw() 
gp + geom_text(aes(x=xtrace, y=avg, label=paste("N=",n, sep=""), hjust=1.2, vjust=-1))
```


```{r}
## Strategy 2: use ggplot to compute the summary and confidence intervals

#install.packages("Hmisc") # needed for  mean_cl_boot in stat_summary
library(Hmisc)

pd = position_dodge(width=.1)

# Plotting direclty the course grades rather than summarizng as a separate data frame
gp <- ggplot(moocs, aes(x=watched_videos, y=EPFL_CourseGrade, group=solved_problems))
gp <- gp + geom_line(stat = "summary", fun.y = "mean", position=pd)
gp <- gp + geom_point(stat = "summary", fun.y = "mean", position=pd, size=3)
gp <- gp + stat_summary(fun.data = "mean_cl_boot", size = 1, position=pd, geom = "errorbar", width=0.2, 
                        aes(colour=solved_problems)) 
# gp <- gp + geom_jitter(aes(x=watched_videos, y=EPFL_CourseGrade, colour=solved_problems),  alpha=0.1)
gp <- gp + theme_bw()
gp
ggsave(filename = "grade-problems-videos.pdf",gp)

# Plotting with bars
pd = position_dodge(width=1)
gp <- ggplot(moocs, aes(x=watched_videos, y=EPFL_CourseGrade, group=solved_problems))
gp <- gp + geom_bar(aes(fill=solved_problems), stat = "summary", fun.y = "mean", position=pd)
gp <- gp + stat_summary(fun.data = "mean_cl_boot", colour = "black", size = 1, position=pd ) 
# gp + stat_summary(fun.y = mean, geom="line", position=pd)
gp
```

## Exercise: 

```{r}
# TODO
# Build a series of linear models / anova that answers the following question: 
# Does participating in MOOCs help gettng to better grades (didMOOC and MOOC are two variables that define mooc usage) ?
# Do MOOCs help weak students as much as strong students (prior and prior4 are two categorical variables that define level) ?
# Is the relation different for different levels of EPFL_topic (use facet_grid to plot the relation for different values of EPFL_topic)?
# Run the analysis with the level of students as a categorical and a quantitative variable 

# Begin Solution

# End Solution


```

## Plotting interactions with GGPLOT

```{r}

# TODO: plot the graphs that represent the interactions between MOOC usage and prior ability  

# Begin solution

# End solution

```

# Categorical variables

Do more people with high prior ability solve problems ? 
The chi-square test allows to test whether the observations in two categorical variables can be considered independent. 
```{r}

# table() and chisq.test()

tt <- table(moocs$prior, moocs$solved_problems)
tt

# The chi-square tests the independence of X and Y.
# H0: the observed and expected number of observations are the same

ct <- chisq.test(tt)
ct

ct$observed
ct$expected # expected values if the variables were independent

# Let's plot the observed and expected tallies
par(mfrow=c(1,2))
plot(as.table(ct$observed), main="Observed")
plot(as.table(ct$expected), main="Expected")

ct$residuals # pearson residuals larger than 1.96 are significantly different from 0 with p<.05
# the Pearson residuals, (observed - expected) / sqrt(expected).

# TODO: Do more people with high prior ability watch videos ?

tt <- table(moocs$prior, moocs$watched_videos)
ct <- chisq.test(tt)
ct

ct$residuals
```