Loading Data

Let’s load data. The first step is to read a CSV file into a dataframe, which is the data structure used by R to store data.

# TODO: Adapt the path to your own installation Set the working directory
setwd("/Volumes/GoogleDrive/My Drive/CEDE/TEACHING/courses/CS411-2018/")

# Read a csv file (header = T signals that the first line of the file contains the names of
# the variables) the <- operator assigns the result of read.csv to the variable moocs moocs
# is a dataframe, similar to an excel worksheet.
moocs <- read.csv(file = "moocs.basic.csv", header = T)
# prints the first rows of the dataframe. Useful to check if the import has worked. We see
# that there are many missing values signaled by 'NA' == Not Available, namely for
# variables starting with MOOC_ ... this is because these students never logged in the
# MOOC.
head(moocs)
  EPFL_AcademicYear EPFL_IsRepeating EPFL_PropaedeuticGPA EPFL_CourseGrade
1         2015-2016                0                   NA              3.5
2         2015-2016                0                 4.51              3.5
3         2015-2016                1                 4.43              4.5
4         2015-2016                0                 4.75              5.0
5         2015-2016                0                 3.62              3.0
6         2015-2016                0                 2.95              3.0
  EPFL_StudentSection MOOC_NVideosWatched MOOC_AverageMaxVideosWatched
1                  IN                  NA                           NA
2                  SV                  NA                           NA
3                  SV                   1                           29
4                  EL                  NA                           NA
5                  GM                  NA                           NA
6                  GC                   2                           29
  MOOC_PercentageVideosWatched MOOC_NProblemsSolved MOOC_AverageMaxProblemsSolved
1                           NA                   NA                            NA
2                           NA                   NA                            NA
3                            0                   NA                            NA
4                           NA                   NA                            NA
5                           NA                   NA                            NA
6                           10                   NA                            NA
  MOOC_PercentageProblemsSolved MOOC_Grade EPFL_Topic EPFL_BacGrade EPFL_BacLevel
1                            NA       -Inf   MATH-111     0.8142857            HI
2                            NA          0   MATH-111     0.6833333            LO
3                            NA          0   MATH-111     0.8571429            HI
4                            NA       -Inf   MATH-111     0.8500000            HI
5                            NA       -Inf   MATH-111     0.7750000            LO
6                            NA          0   MATH-111     0.6666667            LO
  EPFL_CourseType EPFL_UniqueTopicID_Anonymous             EPFL_AnonymousUserID
1            MATH       MATH-111 2015-2016 T19 3824714ee78feab0dbfcf503ca2f4982
2            MATH       MATH-111 2015-2016 T19 07bea1a899d08b881015ae1018d6778b
3            MATH       MATH-111 2015-2016 T19 ba2579f42a4d5cee4fcea913fbff0a1e
4            MATH       MATH-111 2015-2016 T19 b9a3e806bc1104c6dace2e2393ef19ab
5            MATH       MATH-111 2015-2016 T19 b01e6559a855cf4c774ac9a6193d4268
6            MATH       MATH-111 2015-2016 T19 15917ccf6bfad439511f4a1276500b40

Our dataset contains variables starting with EPFL and variables starting with MOOC. The variables with EPFL come from the academic database and describe student’s performance in their EPFL exams. The variables starting with MOOC conern their online activity in the MOOC.

# List the names of the variables
names(moocs)
 [1] "EPFL_AcademicYear"             "EPFL_IsRepeating"             
 [3] "EPFL_PropaedeuticGPA"          "EPFL_CourseGrade"             
 [5] "EPFL_StudentSection"           "MOOC_NVideosWatched"          
 [7] "MOOC_AverageMaxVideosWatched"  "MOOC_PercentageVideosWatched" 
 [9] "MOOC_NProblemsSolved"          "MOOC_AverageMaxProblemsSolved"
[11] "MOOC_PercentageProblemsSolved" "MOOC_Grade"                   
[13] "EPFL_Topic"                    "EPFL_BacGrade"                
[15] "EPFL_BacLevel"                 "EPFL_CourseType"              
[17] "EPFL_UniqueTopicID_Anonymous"  "EPFL_AnonymousUserID"         

Descriptive Statistics

Summary for variables

# For each variable, print some information about the distribution (if the variable is
# quantitative) or about the distinct values taken by the variable (if the variable is
# categorical).
summary(moocs)
 EPFL_AcademicYear EPFL_IsRepeating EPFL_PropaedeuticGPA EPFL_CourseGrade
 2013-2014:2749    Min.   :0.0000   Min.   :1.690        Min.   :0.000   
 2014-2015:2694    1st Qu.:0.0000   1st Qu.:3.680        1st Qu.:3.000   
 2015-2016:3871    Median :0.0000   Median :4.250        Median :4.000   
                   Mean   :0.1846   Mean   :4.133        Mean   :3.719   
                   3rd Qu.:0.0000   3rd Qu.:4.630        3rd Qu.:5.000   
                   Max.   :1.0000   Max.   :5.890        Max.   :6.000   
                                    NA's   :1833                         
 EPFL_StudentSection MOOC_NVideosWatched MOOC_AverageMaxVideosWatched
 SV     :2199        Min.   : 1.00       Min.   :14.00               
 GM     :1157        1st Qu.: 7.50       1st Qu.:44.00               
 MT     :1128        Median :28.00       Median :45.00               
 MA     :1040        Mean   :24.46       Mean   :48.95               
 GC     : 970        3rd Qu.:39.00       3rd Qu.:58.00               
 PH     : 649        Max.   :85.00       Max.   :85.00               
 (Other):2171        NA's   :6339        NA's   :6339                
 MOOC_PercentageVideosWatched MOOC_NProblemsSolved MOOC_AverageMaxProblemsSolved
 Min.   :  0.00               Min.   :  1.00       Min.   : 17.00               
 1st Qu.: 20.00               1st Qu.:  3.00       1st Qu.: 17.00               
 Median : 60.00               Median :  7.00       Median : 20.00               
 Mean   : 54.28               Mean   : 10.67       Mean   : 36.44               
 3rd Qu.: 80.00               3rd Qu.: 14.00       3rd Qu.: 25.00               
 Max.   :100.00               Max.   :144.00       Max.   :142.00               
 NA's   :6339                 NA's   :7440         NA's   :7440                 
 MOOC_PercentageProblemsSolved   MOOC_Grade      EPFL_Topic   EPFL_BacGrade   
 Min.   :  0.00                Min.   :-Inf   CS-111  :1631   Min.   :0.1260  
 1st Qu.: 10.00                1st Qu.:-Inf   CS-112  :1316   1st Qu.:0.7500  
 Median : 30.00                Median :-Inf   EE-102  : 797   Median :0.8083  
 Mean   : 37.27                Mean   :-Inf   MATH-111:1404   Mean   :0.8062  
 3rd Qu.: 60.00                3rd Qu.:   0   PHYS-101:4166   3rd Qu.:0.8536  
 Max.   :100.00                Max.   : 100                   Max.   :1.0000  
 NA's   :7440                                                 NA's   :719     
 EPFL_BacLevel EPFL_CourseType         EPFL_UniqueTopicID_Anonymous
 HI  :4346     CS  :2947       PHYS-101 2014-2015 T63: 336         
 LO  :4249     EE  : 797       EE-102 2015-2016 T29  : 300         
 NA's: 719     MATH:1404       MATH-111 2015-2016 T18: 295         
               PHYS:4166       PHYS-101 2015-2016 T46: 295         
                               PHYS-101 2013-2014 T15: 288         
                               EE-102 2014-2015 T29  : 281         
                               (Other)               :7519         
                       EPFL_AnonymousUserID
 e2e988bc1a66ae2f50330fd60a0c2367:   8     
 16af0f5438b6f7983392ab058a7e4eeb:   7     
 29e1b05939960835d785c0aaa20ae388:   7     
 35ad976f794b6523fe4dd3388b02ff5b:   7     
 4abf1bcbc3198f4f3c2feba5bca4f6e6:   7     
 4c46db8f7de7dd9b3521efad9efe0aa9:   7     
 (Other)                         :9271     
# To get information about only one variable, use the $ sign to separate the name of the
# dataframe and the name of the variable like so: <name_dataframe>$<name_variable>
summary(moocs$EPFL_StudentSection)
AUDIT   CGC  EDBB    EL    GC    GM    IN    MA    MT    MX    PH    SC   SIE    SV 
    7   513     1   129   970  1157   535  1040  1128   168   649   346   472  2199 
summary(moocs$MOOC_PercentageVideosWatched)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   20.00   60.00   54.28   80.00  100.00    6339 
mean(moocs$MOOC_PercentageVideosWatched, na.rm = T)  # na.rm removes missing values
[1] 54.28235
sd(moocs$MOOC_PercentageVideosWatched, na.rm = T)  # standard deviation
[1] 34.6713
median(moocs$MOOC_PercentageVideosWatched, na.rm = T)  # median (50% observations below and 50% observations above)
[1] 60
max(moocs$MOOC_PercentageVideosWatched, na.rm = T)
[1] 100
min(moocs$MOOC_PercentageVideosWatched, na.rm = T)
[1] 0

Computing new categorical variables

moocs$didMOOC = factor(ifelse(is.na(moocs$MOOC_NVideosWatched) & is.na(moocs$MOOC_NProblemsSolved), 
    "NO.MOOC", "MOOC"))

summary(moocs$didMOOC)
   MOOC NO.MOOC 
   3004    6310 
# Now for people who did the MOOC
moocs$watched_videos = factor(ifelse(moocs$MOOC_PercentageVideosWatched == 0 | is.na(moocs$MOOC_PercentageVideosWatched), 
    "NO.VIDEO", "SOME.VIDEO"))

summary(moocs$watched_videos)
  NO.VIDEO SOME.VIDEO 
      6591       2723 
# The table command reports the number of observations per category.
table(moocs$watched_videos)

  NO.VIDEO SOME.VIDEO 
      6591       2723 
moocs$solved_problems = factor(ifelse(moocs$MOOC_PercentageProblemsSolved == 0 | is.na(moocs$MOOC_PercentageProblemsSolved), 
    "NO.PROBLEM", "SOME.PROBLEM"))

table(moocs$solved_problems)

  NO.PROBLEM SOME.PROBLEM 
        7621         1693 
# With two arguments, the table command cross-tabulates the number of observations per
# category.
table(moocs$watched_videos, moocs$solved_problems)
            
             NO.PROBLEM SOME.PROBLEM
  NO.VIDEO         6555           36
  SOME.VIDEO       1066         1657
# Let's also replace the missing values with 0 for the percentage of videos watched and the
# percentage of problems solved
moocs$MOOC_PercentageVideosWatched <- ifelse(is.na(moocs$MOOC_PercentageVideosWatched), 0, 
    moocs$MOOC_PercentageVideosWatched)
moocs$MOOC_PercentageProblemsSolved <- ifelse(is.na(moocs$MOOC_PercentageProblemsSolved), 0, 
    moocs$MOOC_PercentageProblemsSolved)

# TODO: Define a categorical variable names mooc$MOOC that has 3 levels with: 1) NONE =
# 'did not use the mooc at all' 2) WATCH = 'used the mooc mostly to watch videos' 3) DO =
# 'also did some assignments'

# Begin Solution:
moocs$MOOC <- factor(ifelse(moocs$watched_videos == "NO.VIDEO" & moocs$solved_problems == "NO.PROBLEM", 
    "NONE", ifelse(moocs$solved_problems == "SOME.PROBLEM", "DO", "WATCH")))
# End Solution

library(gdata)
# put the levels into a convenient order (for graphing, etc.)
moocs$MOOC <- reorder(moocs$MOOC, new.order = c("NONE", "WATCH", "DO"))
table(moocs$MOOC)

 NONE WATCH    DO 
 6555  1066  1693 

Median split and quartile splits

hist(moocs$EPFL_BacGrade)
abline(v = median(moocs$EPFL_BacGrade, na.rm = T), col = "red", lwd = 3)

# TODO: Define two categorical variables named moocs$prior and moocs$prior4 that
# distinguishstudents given at their baccalaureat grade.

# 1) moocs$prior.  The values of the variable are 'HI'' for strong students and 'LO'' for
# weakers students.  => use median() to define high and low levels of ability

# 2) moocs$prior4 Levels are called XLO, LO, HI, XHI => use quantile() and cut() to
# definelevels

# Begin solution
moocs$prior <- factor(ifelse(moocs$EPFL_BacGrade > median(moocs$EPFL_BacGrade, na.rm = T), 
    "HI", "LO"))

# reorder the levels of prior so that low comes before hi in the graphs
moocs$prior <- reorder(moocs$prior, new.order = c("LO", "HI"))
# count the number of observations in LO and HI
table(moocs$prior)

  LO   HI 
4318 4277 
# define
qq <- quantile(moocs$EPFL_BacGrade, na.rm = T)
moocs$prior4 <- cut(moocs$EPFL_BacGrade, breaks = qq, labels = c("XLO", "LO", "HI", "XHI"))

table(moocs$prior4)

 XLO   LO   HI  XHI 
2285 2032 2128 2149 
# end solution

Basic Plotting

# hist draws the historgram of a quantitative variable main='title' sets the title.
# xlab='title of the xaxis' ylab='title of the yaxis' breaks=n sets the number of bins used
# by the historgram par() sets graphical parameters. mfrow allows to plot several plots on
# one figure. It takes as argument, an array c(nrow,ncol) that contains the number of rows
# and columns to arrange the subplots.  png opens a png file of teh given size to output
# the results of the plotting.

png("histograms.png", width = 600, height = 600)  # Opens the file
par(mfrow = c(2, 2))  # splits the figure into 4 subplots.
hist(moocs$EPFL_CourseGrade, main = "Course Grades", xlab = "Grade")
hist(moocs$MOOC_PercentageVideosWatched, main = "Number of videos watched", xlab = "Videos", 
    breaks = 100)
hist(moocs$MOOC_Grade, main = "MOOC Grade", xlab = "Grade")
hist(moocs$MOOC_AverageMaxVideosWatched, main = "Max Number of videos", xlab = "Videos")
dev.off()  # Closes the file
quartz_off_screen 
                2 

Boxplots can be obtained for one variable at the time with the boxplot command. R also creates boxplots when using the command plot with a categorical variable: plot(quantitative ~ categorical).

par(mfrow = c(1, 2))

# Boxplots
boxplot(moocs$MOOC_PercentageVideosWatched, horizontal = TRUE, varwidth = TRUE)

# PLotting a quantitative variable given a categorical variable.
plot(moocs$EPFL_CourseGrade ~ moocs$MOOC)

# Plotting two quantitative variables
plot(moocs$EPFL_CourseGrade ~ moocs$MOOC_PercentageVideosWatched, pch = 21)

# Since all points are on top of each other, we use jitter() to add some random noise.  pch
# is the plotting character used to draw points
# https://www.statmethods.net/advgraphs/parameters.html cex is a size adjusment color of
# the points is given by RGB and alpha
plot(jitter(moocs$EPFL_CourseGrade) ~ jitter(moocs$MOOC_PercentageVideosWatched), pch = 19, 
    cex = 0.8, col = rgb(0, 0, 0, 0.5))

Plotting with GGPLOT

Let’s now see how to plot these same graphs with ggplot2. See documentation here http://ggplot2.org/ and here http://www.cookbook-r.com/Graphs/ and https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

library(ggplot2)
ggplot(moocs, aes(x = MOOC_PercentageVideosWatched)) + geom_histogram(binwidth = 10)

ggplot(moocs, aes(x = MOOC, y = EPFL_CourseGrade)) + geom_boxplot(aes(fill = MOOC))

ggplot(moocs, aes(x = MOOC, y = EPFL_CourseGrade)) + geom_boxplot(aes(fill = MOOC)) + facet_grid(. ~ 
    EPFL_Topic)

# In this plot we see that the 'cloud of points' is strongly clustered to the left, since
# most of the people in the sample have watched zero videos.
ggplot(moocs, aes(x = MOOC_PercentageVideosWatched, y = EPFL_CourseGrade)) + geom_jitter(alpha = I(1/4))

# TODO Create a plot with ggplot that checks whether the course grade is linked with the
# grade obtained at the baccalaureat Find a way to add some jitter PLot this separately for
# eah EPFL_Topic Begin Solution
ggplot(moocs, aes(x = EPFL_BacGrade, y = EPFL_CourseGrade)) + geom_point(size = 3, alpha = 0.1, 
    position = position_jitter(width = 0.1)) + stat_smooth(method = "lm") + facet_grid(. ~ 
    EPFL_Topic)

# end solution

Basic Statistics

 Correlation

# Correlation between 2 normally distributed quantitative variables
cor.test(moocs$EPFL_CourseGrade, moocs$MOOC_PercentageVideosWatched)

    Pearson's product-moment correlation

data:  moocs$EPFL_CourseGrade and moocs$MOOC_PercentageVideosWatched
t = 24.944, df = 9312, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2311348 0.2692098
sample estimates:
      cor 
0.2502691 
# TODO: find whether the CourseGrade is correlated with the BacGrade
cor.test(moocs$EPFL_CourseGrade, moocs$EPFL_BacGrade)

    Pearson's product-moment correlation

data:  moocs$EPFL_CourseGrade and moocs$EPFL_BacGrade
t = 34.656, df = 8593, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3314945 0.3685945
sample estimates:
      cor 
0.3501819 

The output contains the Pearson correlation coefficiant (cor) as well as a t-test that is testing whether this correlation is statistically different from 0. In our case, p << .05 which indicates that the correlation is positive.

The same cor.test function can be used to computer Spearman rank correlation which is used whenever the distributions of the variables are not normal (which seems to be the case here for moocs$MOOC_PercentageVideosWatched). In such a case the Spearman rank correlation might be more appropriate (see http://geoinfo.amu.edu.pl/qg/archives/2011/QG302_087-093.pdf).

cor.test(moocs$EPFL_CourseGrade, moocs$MOOC_PercentageVideosWatched, method = "spearman")

    Spearman's rank correlation rho

data:  moocs$EPFL_CourseGrade and moocs$MOOC_PercentageVideosWatched
S = 1.0206e+11, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.2421586 

The difference bewteen a rank correlation and pearson correlation is well illustrated here: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

Normality Assumption

Testing normality visually with quantile plots. If the variable is normally distributed, the observations would follow the line drawn by qqline. We see from the plots below that the distribution of the variable “MOOC_PercentageVideosWatched” departs very strongly from a normal distribution.

par(mfrow = c(2, 2))
hist(moocs$MOOC_PercentageVideosWatched)
qqnorm(moocs$MOOC_PercentageVideosWatched)
qqline(moocs$MOOC_PercentageVideosWatched)

hist(moocs$EPFL_CourseGrade)
qqnorm(moocs$EPFL_CourseGrade)
qqline(moocs$EPFL_CourseGrade)

par(mfrow = c(1, 1))


# TODO: Is Bac Grade normally distributed ?  Plot a historgram and a qq plot Begin solution
par(mfrow = c(1, 2))
hist(moocs$EPFL_BacGrade)
qqnorm(moocs$EPFL_BacGrade)
qqline(moocs$EPFL_BacGrade)

par(mfrow = c(1, 1))
# End solution

# An alternative to qqline and qqnorm.  install.packages('ggpubr')
library(ggpubr)
ggqqplot(moocs$EPFL_BacGrade)

Normality can also be tested statistically, however the normality tests are rather sensitive to small deviations.

For example, the Shapiro-Wilks test, is testing the null hypothesis that the variable is normally distributed. In this case the adequate p.value is .1. Hence, with p <<.1 we have to reject the null hypothesis and conclude that the variable is not normally distributed. This test work with samples smaller than 5000 observations.

# shapiro.test(moocs$MOOC_PercentageVideosWatched) shapiro.test(moocs$EPFL_CourseGrade)
# # has too many observations

An alternative to shapiro is the Kolmogorov-Smirnov test which allows to test whether two samples were drawn from the same distribution. This allows to compare our observations with a sample that follows a normal distibution. The test for EPFL_BacGrade (D=0.04, p<.00) shows that the distribution of sample is unlikely to stem from a normal distribution.

# Kolmogorov-Smirnov test that compares a variable x with a normal distribution that has
# the same mean and variance as x. The test relies on the maximum difference between the
# Cumulated Frequency Distributions of two samples.
x <- moocs$EPFL_BacGrade
ks.test(x, "pnorm", mean(x, na.rm = T), sqrt(var(x, na.rm = T)))

    One-sample Kolmogorov-Smirnov test

data:  x
D = 0.042517, p-value = 6.395e-14
alternative hypothesis: two-sided
# define afunction to plot the Kolmogorov Smirnov test (adapted from
# https://rpubs.com/mharris/KSplot)

plot.normal.ks <- function(sample1) {
    
    normal.sample <- rnorm(10000, mean(sample1, na.rm = T), sqrt(var(sample1, na.rm = T)))  # a normal sample that has the same mean and sd
    
    cdf1 <- ecdf(sample1)
    cdf2 <- ecdf(normal.sample)
    
    plot(cdf1, verticals = TRUE, do.points = FALSE, col = "blue")
    plot(cdf2, verticals = TRUE, do.points = FALSE, col = "green", add = TRUE)
    
    minMax <- seq(min(sample1, normal.sample, na.rm = T), max(sample1, normal.sample, na.rm = T), 
        length.out = length(sample1))
    x0 <- minMax[which(abs(cdf1(minMax) - cdf2(minMax)) == max(abs(cdf1(minMax) - cdf2(minMax))))]  # identify where the maximum difference is.
    y0 <- cdf1(x0)
    y1 <- cdf2(x0)
    points(c(x0, x0), c(y0, y1), pch = 16, col = "red")
    segments(x0, y0, x0, y1, col = "red", lwd = 5, lty = "solid")
}

par(mfrow = c(1, 2))  # split the plotting area into 1 row and 2 columns
plot.normal.ks(moocs$MOOC_PercentageVideosWatched)
plot.normal.ks(moocs$EPFL_BacGrade)

par(mfrow = c(1, 1))  # reset the plotting area

 ANOVA - Analysis of Variance

Comparing means across categories. We use the Analysis of Variance (ANOVA) to compare the mean of a variable across several categorical variables.

# Comparing whether one variable has the same mean in different groups ?  quantitative ~ 
# categorical

#  H0: mean_1 = mean_2 = ...= Mean_n H1: mean_1 != mean_2 != ... != mean_n
oneway.test(moocs$EPFL_CourseGrade ~ moocs$MOOC, var.equal = T)

    One-way analysis of means

data:  moocs$EPFL_CourseGrade and moocs$MOOC
F = 291.59, num df = 2, denom df = 9311, p-value < 2.2e-16
pairwise.t.test(moocs$EPFL_CourseGrade, moocs$MOOC, p.adj = "bonf")  # Checks which group is different from what other group in doing 2 by 2 tests.

    Pairwise comparisons using t tests with pooled SD 

data:  moocs$EPFL_CourseGrade and moocs$MOOC 

      NONE    WATCH  
WATCH 6.7e-11 -      
DO    < 2e-16 < 2e-16

P value adjustment method: bonferroni 

Assumptions to be met before conducting anova

1) Data is normally distributed in each of the groups

# by() is applying a function for each value of the categorical variable defined by
# 'INDICES' here it takes every value of moocs$MOOC. See below how to do the graphs one by
# one.
par(mfrow = c(3, 3))
by(moocs$EPFL_CourseGrade, INDICES = moocs$MOOC, function(x) {
    hist(x)
    qqnorm(x)
    qqline(x)
    plot.normal.ks(x)
})

moocs$MOOC: NONE
NULL
------------------------------------------------------------------- 
moocs$MOOC: WATCH
NULL
------------------------------------------------------------------- 
moocs$MOOC: DO
NULL
par(mfrow = c(1, 1))

# repeating the same graph 3 times, once for each possible value of moocs$MOOC. The syntax
# above with by() is more efficient.  par(mfrow=c(3,2))
# qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =='NONE'])
# qqline(moocs$EPFL_CourseGrade[moocs$MOOC =='NONE'])
# plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =='NONE'])
# qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =='WATCH'])
# qqline(moocs$EPFL_CourseGrade[moocs$MOOC =='WATCH'])
# plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =='WATCH'])
# qqnorm(moocs$EPFL_CourseGrade[moocs$MOOC =='DO'])
# qqline(moocs$EPFL_CourseGrade[moocs$MOOC =='DO'])
# plot.normal.ks(moocs$EPFL_CourseGrade[moocs$MOOC =='DO']) par(mfrow=c(1,1))

2) Variances are equal in both groups

# H0: variances are the same H1: variances are different  We have to reject the null
# hypothesis H0 for the bartlett test whih means that we the variances are different. This
# is appararent fom the boxplot as well, where we see that the variance for the DO group is
# smaller.
bartlett.test(moocs$EPFL_CourseGrade ~ moocs$MOOC)

    Bartlett test of homogeneity of variances

data:  moocs$EPFL_CourseGrade by moocs$MOOC
Bartlett's K-squared = 130, df = 2, p-value < 2.2e-16
tapply(moocs$EPFL_CourseGrade, moocs$MOOC, var, na.rm = T)  #  tapply allows to apply the var() function to 3 subsamples defined by the 3 values of moocs$MOOC 
    NONE    WATCH       DO 
2.222989 1.757752 1.433036 
boxplot(moocs$EPFL_CourseGrade ~ moocs$MOOC, horizontal = T)

# Normality +- OK Variances different

# a) adapt the caluclations to take unequal variances into account
oneway.test(moocs$EPFL_CourseGrade ~ moocs$MOOC, var.equal = F)

    One-way analysis of means (not assuming equal variances)

data:  moocs$EPFL_CourseGrade and moocs$MOOC
F = 366.38, num df = 2.0, denom df = 2495.4, p-value < 2.2e-16
# b) => use non parametric alternative to anova.  Tests the null hypothesis that the
# location parameters are the same in each of the groups.
kruskal.test(moocs$EPFL_CourseGrade ~ moocs$MOOC)

    Kruskal-Wallis rank sum test

data:  moocs$EPFL_CourseGrade by moocs$MOOC
Kruskal-Wallis chi-squared = 608.7, df = 2, p-value < 2.2e-16
######## TODO make an anova / or another test to find out high ability students (prior or prior4)
######## get better grades (EPFL_CourseGrade) check assumptions

# Begin solution 1) Checking normality

# 2) testing homoscedasticity => the result shos that we cannot consider the variances
# equal between LO and HI prior. This seems not to be the case given the bartlett.test

# if non equal variances option 1=> adapt oneway.test by adding var.equal=F

# option 2 => use kruskall.test instead

# End Solution

Two factor ANOVA

We can add more than one categorical predictor and also model the interaction of two factors (“:” represents the interaction below). When there is an interaction, the effect of one variable is conditional on the effect of another variable. We use the lm function to express the linear model m underlying the Anova. Hence, y ~ A + B + A:B expresses a model where the dependent variable y depends on factor A, factor B and their interaction. It can also be written as y ~ A * B. The * sign means “single effects and interaction”.

m <- lm(moocs$EPFL_CourseGrade ~ moocs$watched_videos + moocs$solved_problems + moocs$watched_videos:moocs$solved_problems)
# a synonym for the model that contains single and interaction effects is A * B == A + B +
# A:B hence we can rewrite m as: m <- lm(moocs$EPFL_CourseGrade ~ moocs$watched_videos *
# moocs$solved_problems)

# Following http://goanna.cs.rmit.edu.au/~fscholer/anova.php the Type III Sum of Square is
# more adapted to interpret the contributions of the differnt variables. R and anova report
# Type I sums of square.  Here is an explanation about how to obtain Type III sums of
# squares.  https://www.r-bloggers.com/anova-%E2%80%93-type-iiiiii-ss-explained/ And some
# explanations here:
# prometheus.scp.rochester.edu/zlab/sites/default/files/InteractionsAndTypesOfSS.pdf
# install.packages('car')

# Doing Type III sums of squares.
library(car)
### Default is: options(contrasts = c('contr.treatment', 'contr.poly'))
options(contrasts = c("contr.sum", "contr.poly"))

#  There are two simple effects, but no interaction. The effect of watching videos is
# indenpendent of the effect of solving problems.  I.e. the effect of watching videos is
# the same for people who solve problems and for people who do not solve problems.
Anova(m, type = 3)
Anova Table (Type III tests)

Response: moocs$EPFL_CourseGrade
                                           Sum Sq   Df    F value    Pr(>F)    
(Intercept)                                 80941    1 39960.6683 < 2.2e-16 ***
moocs$watched_videos                           91    1    44.8710 2.226e-11 ***
moocs$solved_problems                           7    1     3.4905   0.06175 .  
moocs$watched_videos:moocs$solved_problems      1    1     0.5596   0.45444    
Residuals                                   18857 9310                         
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Checking assumptions with residuals plot 1) residuals need to be normally distributed for
# each subgroup 2) residuals' variance needs to be constant across different fitted values.
residualPlots(m)

Plotting means and confidence intervals with plotmeans

par(mfrow = c(1, 1))
library(gplots)
plotmeans(moocs$EPFL_CourseGrade ~ moocs$watched_videos)

plotmeans(moocs$EPFL_CourseGrade ~ moocs$solved_problems)

# technique 1: plotting the interaction
plotmeans(moocs$EPFL_CourseGrade ~ interaction(moocs$watched_videos, moocs$solved_problems))

# technique 2: plotting the interaction with interaction.plot. There is no interaction, the
# lines are 'parallel'
interaction.plot(moocs$watched_videos, moocs$solved_problems, moocs$EPFL_CourseGrade)

Plotting means and confidence intervals with ggplot

# Strategy 1: summarise the data first and plot the summary.
# 
# step 1: first we need to create a data.frame that contains only the summary values we wwant to plot 
require(dplyr)
df <- ddply(moocs,
      .(trace = watched_videos,
        xtrace = solved_problems), # specify which categorical variables are used to summarise data
      summarise,
      # now list the new variables thart we compute
      avg = mean(EPFL_CourseGrade), 
      ci = 1.96 * sd(EPFL_CourseGrade, na.rm=T)/sqrt(length(EPFL_CourseGrade))
      )

# Here is the dataframe that contains the poitns
df
       trace       xtrace      avg         ci
1   NO.VIDEO   NO.PROBLEM 3.513959 0.03609428
2   NO.VIDEO SOME.PROBLEM 3.958333 0.61780481
3 SOME.VIDEO   NO.PROBLEM 3.828799 0.07958959
4 SOME.VIDEO SOME.PROBLEM 4.455944 0.05663132
# we will use this to shift the data points left and wirhg
pd <- position_dodge(0.1) 

# step 2: plot the summary
gp <- ggplot(df, aes(x=xtrace, y=avg, group=trace))
gp <- gp + geom_line(aes(linetype=trace), size=0.9, position=pd)
gp <- gp + geom_errorbar(aes(ymax=avg+ci, ymin=avg-ci), width=.2 , position=pd)
gp <- gp + geom_point(aes(shape=trace), size=3.5 , position=pd) 
gp <- gp +  theme_bw() 
gp 

Exercice: add text label to the plot with the number of observations

# TODO
# a) add the number of obsevations to df 
# (tip: length() counts the number of elements in a vector)
df <- ddply(moocs,
      .(trace = watched_videos,
        xtrace = solved_problems), # specify which categorical variables are used to summarise data
      summarise,
      # now list the new variables thart we compute
      avg = mean(EPFL_CourseGrade), 
      ci = 1.96 * sd(EPFL_CourseGrade, na.rm=T)/sqrt(length(EPFL_CourseGrade)),
      n = length(EPFL_CourseGrade)
      )

# b) add the number to the plot 
# (tip: geom_text() allows to add text)
gp <- ggplot(df, aes(x=xtrace, y=avg, group=trace))
gp <- gp + geom_line(aes(linetype=trace), size=0.9, position=pd)
gp <- gp + geom_errorbar(aes(ymax=avg+ci, ymin=avg-ci), width=.2 , position=pd)
gp <- gp + geom_point(aes(shape=trace), size=3.5 , position=pd) 
gp <- gp +  theme_bw() 
gp + geom_text(aes(x=xtrace, y=avg, label=paste("N=",n, sep=""), hjust=1.2, vjust=-1))

## Strategy 2: use ggplot to compute the summary and confidence intervals

# install.packages('Hmisc') # needed for mean_cl_boot in stat_summary
library(Hmisc)

pd = position_dodge(width = 0.1)

# Plotting direclty the course grades rather than summarizng as a separate data frame
gp <- ggplot(moocs, aes(x = watched_videos, y = EPFL_CourseGrade, group = solved_problems))
gp <- gp + geom_line(stat = "summary", fun.y = "mean", position = pd)
gp <- gp + geom_point(stat = "summary", fun.y = "mean", position = pd, size = 3)
gp <- gp + stat_summary(fun.data = "mean_cl_boot", size = 1, position = pd, geom = "errorbar", 
    width = 0.2, aes(colour = solved_problems))
# gp <- gp + geom_jitter(aes(x=watched_videos, y=EPFL_CourseGrade, colour=solved_problems),
# alpha=0.1)
gp <- gp + theme_bw()
gp

ggsave(filename = "grade-problems-videos.pdf", gp)

# Plotting with bars
pd = position_dodge(width = 1)
gp <- ggplot(moocs, aes(x = watched_videos, y = EPFL_CourseGrade, group = solved_problems))
gp <- gp + geom_bar(aes(fill = solved_problems), stat = "summary", fun.y = "mean", position = pd)
gp <- gp + stat_summary(fun.data = "mean_cl_boot", colour = "black", size = 1, position = pd)
# gp + stat_summary(fun.y = mean, geom='line', position=pd)
gp

Exercise:

Run an ANOVA that answers the following question: * Does participating in MOOCs help gettng to better grades (choose one of didMOOC or MOOC which are two variables that define mooc usage) ? * Do MOOCs help weak students as much as strong students (prior and prior4 are two categorical variables that define level) ? * Is there an interaction between MOOC use and prior ability (e.g. do MOOCs help weaker students more than strong students ?)

Check assumptions with residualsPlot.

# TODO Let's use only observations for which we know the prior ability of the students
moocs.valid <- subset(moocs, !is.na(moocs$prior))

# Begin Solution


# Check assuptions with residualPlots


# End Solution

# TODO Three way ANOVA Is the relation different for different levels of EPFL_topic (use
# facet_grid to plot the relation for different values of EPFL_topic)?

Plotting interactions with GGPLOT

# TODO: plot the graphs that represent the interactions between MOOC usage and prior
# ability

interaction.plot(moocs.valid$MOOC, moocs.valid$prior, moocs.valid$EPFL_CourseGrade)

# Begin solution with GGPLOT Plot means and confidence intervals


# End solution

# TODO: plot the graphs that represent the three way interactions between MOOC usage and
# prior ability for different topics Begin solution

# End Solution

Linear regression

# Predicting one quantitative variable with a combination of other quantitative variables

# Assumptions for linear regression 
# * Linearity: The relationship between X and the mean of Y is linear.
# * Homoscedasticity: The variance of residual is the same for any value of X.
# * Independence: Observations are independent of each other.
# * Normality: For any fixed value of X, Y, and the residuals are normally distributed.


# lm estimates the parameters of y = ax + b so as to minimise the Error. 
# the null model consists of predicting the CourseGrade by its mean (~1):
m0 <- lm(EPFL_CourseGrade ~ 1, data=moocs)
summary(m0)

Call:
lm(formula = EPFL_CourseGrade ~ 1, data = moocs)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7193 -0.7193  0.2807  1.2807  2.2807 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7193     0.0152   244.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.467 on 9313 degrees of freedom
# Adding a predictor MOOC_PercentageVideosWatched to the null model.
m1 <- lm(EPFL_CourseGrade ~ MOOC_PercentageVideosWatched, data=moocs)

# The adjusted R^2 is interpreted as the proportion of variance explained.# 
# The F-statistic indicates whereh this model is better than the null model that would include only one parameter (the mean).
summary(m1)

Call:
lm(formula = EPFL_CourseGrade ~ MOOC_PercentageVideosWatched, 
    data = moocs)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6676 -0.9382  0.2502  0.9796  2.4796 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  3.5203862  0.0167408  210.29   <2e-16 ***
MOOC_PercentageVideosWatched 0.0114721  0.0004599   24.94   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.421 on 9312 degrees of freedom
Multiple R-squared:  0.06263,   Adjusted R-squared:  0.06253 
F-statistic: 622.2 on 1 and 9312 DF,  p-value: < 2.2e-16
moocs$pred <- fitted(m1)
moocs$res <- residuals(m1)
# let's plot the distribution of the residuals
par(mfrow=c(1,3))
hist(moocs$res)
plot.normal.ks(moocs$res)
qqnorm(moocs$res)
qqline(moocs$res)

par(mfrow=c(1,1))

# Plotting a model results in 4 plots that allow to diagnose the quality of the model
# http://data.library.virginia.edu/diagnostic-plots/

par(mfrow=c(2,2))
plot(m1)

# Plotting the regression line.
ggplot(moocs, aes(x = MOOC_PercentageVideosWatched, y = EPFL_CourseGrade)) +
  geom_jitter(alpha = .1) + # plot the points with jitter and transparency
  geom_point(aes(y = pred), col="red", fill="red", shape = 21, size=3)  + # Add the predicted values
  geom_smooth(method = "lm") + # Add the regression line
  xlab("Percentage of videos watched")+ # add an x-axis label
  ylab("Grade at the EPFL exam") # add an y-axis label

### Let's add a second predictor to our model
m2 <- lm(EPFL_CourseGrade ~ MOOC_PercentageVideosWatched + MOOC_PercentageProblemsSolved, data=moocs)

# The F statistic reported in the summary corresponds to the comparison of this model 
# with a null model that contains only an intercpt.
summary(m2)

Call:
lm(formula = EPFL_CourseGrade ~ MOOC_PercentageVideosWatched + 
    MOOC_PercentageProblemsSolved, data = moocs)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7506 -0.8793  0.2065  0.9784  2.4784 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   3.5216264  0.0167081 210.774  < 2e-16 ***
MOOC_PercentageVideosWatched  0.0085766  0.0006532  13.131  < 2e-16 ***
MOOC_PercentageProblemsSolved 0.0065298  0.0010480   6.231 4.85e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.418 on 9311 degrees of freedom
Multiple R-squared:  0.06653,   Adjusted R-squared:  0.06633 
F-statistic: 331.8 on 2 and 9311 DF,  p-value: < 2.2e-16
# if we compute the null model by hand:
m0 <- lm(EPFL_CourseGrade ~ 1, data=moocs)
anova(m0,m2) # same as what is reported by summary(m2)
Analysis of Variance Table

Model 1: EPFL_CourseGrade ~ 1
Model 2: EPFL_CourseGrade ~ MOOC_PercentageVideosWatched + MOOC_PercentageProblemsSolved
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1   9313 20048                                  
2   9311 18714  2    1333.7 331.79 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We can compare the three models that we have built.
# The test makes sense if m1 is nested in m2 (we added one term at the time)
# Evaluating incrementally whether adding one predictor enhances the model.
anova(m0, m1, m2)
Analysis of Variance Table

Model 1: EPFL_CourseGrade ~ 1
Model 2: EPFL_CourseGrade ~ MOOC_PercentageVideosWatched
Model 3: EPFL_CourseGrade ~ MOOC_PercentageVideosWatched + MOOC_PercentageProblemsSolved
  Res.Df   RSS Df Sum of Sq       F    Pr(>F)    
1   9313 20048                                   
2   9312 18792  1   1255.69 624.754 < 2.2e-16 ***
3   9311 18714  1     78.02  38.819  4.85e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m2)
Analysis of Variance Table

Response: EPFL_CourseGrade
                                Df  Sum Sq Mean Sq F value    Pr(>F)    
MOOC_PercentageVideosWatched     1  1255.7 1255.69 624.754 < 2.2e-16 ***
MOOC_PercentageProblemsSolved    1    78.0   78.02  38.819  4.85e-10 ***
Residuals                     9311 18714.1    2.01                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Assumptions for regression. Evaluate the quality of the model

Assumptions for linear regression * Linearity: The relationship between X and the mean of Y is linear. * Homoscedasticity: The variance of residual is the same for any value of X. * Independence: Observations are independent of each other. * Normality: For any fixed value of X, Y is normally distributed.

# Let's check how the model fits assumptions. There is a slight trend for residuals to
# become negative as the predicted values augment. This is probably due to the large amount
# of 0 values for the percentage of videos watched.
par(mfrow = c(2, 2))
plot(m2)

The anscombe Data set illustrates the importance to check assumptions and look at residuals.

# TODO : load the anscombe data set The dataset contains four pairs of x and y values
# (y1~x1; y2~x2; y3~x3 and y4~x4) The mean and sd are identical across the 4 sets.  and
# evaluate whether the assumptions to conduct a linear regression are met with the 4 sets
# of data in the dataframe

# load('anscombe.RData')

# x and y are correlated with r = 0.81
cor.test(anscombe$x1, anscombe$y1)

    Pearson's product-moment correlation

data:  anscombe$x1 and anscombe$y1
t = 4.2415, df = 9, p-value = 0.00217
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4243912 0.9506933
sample estimates:
      cor 
0.8164205 
cor.test(anscombe$x2, anscombe$y2)

    Pearson's product-moment correlation

data:  anscombe$x2 and anscombe$y2
t = 4.2386, df = 9, p-value = 0.002179
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4239389 0.9506402
sample estimates:
      cor 
0.8162365 
cor.test(anscombe$x3, anscombe$y3)

    Pearson's product-moment correlation

data:  anscombe$x3 and anscombe$y3
t = 4.2394, df = 9, p-value = 0.002176
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4240623 0.9506547
sample estimates:
      cor 
0.8162867 
cor.test(anscombe$x4, anscombe$y4)

    Pearson's product-moment correlation

data:  anscombe$x4 and anscombe$y4
t = 4.243, df = 9, p-value = 0.002165
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4246394 0.9507224
sample estimates:
      cor 
0.8165214 
# the parameters for a linear model y~x are the same
m1 <- lm(anscombe$y1 ~ anscombe$x1)
m1

Call:
lm(formula = anscombe$y1 ~ anscombe$x1)

Coefficients:
(Intercept)  anscombe$x1  
     3.0001       0.5001  
m2 <- lm(anscombe$y2 ~ anscombe$x2)
m2

Call:
lm(formula = anscombe$y2 ~ anscombe$x2)

Coefficients:
(Intercept)  anscombe$x2  
      3.001        0.500  
m3 <- lm(anscombe$y3 ~ anscombe$x3)
m3

Call:
lm(formula = anscombe$y3 ~ anscombe$x3)

Coefficients:
(Intercept)  anscombe$x3  
     3.0025       0.4997  
m4 <- lm(anscombe$y4 ~ anscombe$x4)
m4

Call:
lm(formula = anscombe$y4 ~ anscombe$x4)

Coefficients:
(Intercept)  anscombe$x4  
     3.0017       0.4999  
# Plot the relation between y and x for the four datasets (y1~x1; y2~x2; y3~x3 and y4~x4)
# What do you observe ?
par(mfrow = c(2, 2))
plot(anscombe$y1 ~ anscombe$x1, pch = 19)
abline(m1)

plot(anscombe$y2 ~ anscombe$x2, pch = 19)
abline(m2)

plot(anscombe$y3 ~ anscombe$x3, pch = 19)
abline(m3)

plot(anscombe$y4 ~ anscombe$x4, pch = 19)
abline(m4)

par(mfrow = c(1, 1))

###  Checking assumptions TODO look at the diagnostic plots for m1, m2, m3, m4 Do you see
### problems ?

# Begin solution linearity (there should not be huge residuals) and homoscedasticity (there
# should not be a trend for residuals as we increase the fitted values)
par(mfrow = c(2, 2))
plot(m1, which = 1, main = "Model 1")  # 
plot(m2, which = 1, main = "Model 2")  # Looks like there is a pattern
plot(m3, which = 1, main = "Model 3")  # There is a trend downwards, the error becomes larger and more negative as we predict larger values. This means we surestimate more and more the predicted values.  
plot(m4, which = 1, main = "Model 4")

par(mfrow = c(1, 1))

# Assumption: normality of residuals.
par(mfrow = c(2, 2))
plot(m1, which = 2, main = "Model 1")  # 
plot(m2, which = 2, main = "Model 2")  # Looks like there is a pattern
plot(m3, which = 2, main = "Model 3")  # There is an outlier (observation 3)  
plot(m4, which = 2, main = "Model 4")

par(mfrow = c(1, 1))

# Fitted agains standardised residuals. there shoudl be no trend in the data
par(mfrow = c(2, 2))
plot(m1, which = 3, main = "Model 1")  # 
plot(m2, which = 3, main = "Model 2")  # Looks like there is a pattern
plot(m3, which = 3, main = "Model 3")  # There is an outlier (observation 3)  
plot(m4, which = 3, main = "Model 4")

par(mfrow = c(1, 1))

# Cook's distance. Measures the influence of one particular observation on the parameters
# of the regression. Values close and above 1 have a too large influence and should be
# observed.
par(mfrow = c(2, 2))
plot(m1, which = 4, main = "Model 1")  # 
plot(m2, which = 4, main = "Model 2")  # Looks like observations 6 and 8 are quite large
plot(m3, which = 4, main = "Model 3")  # There is an outlier (observation 3)  
plot(m4, which = 4, main = "Model 4")

par(mfrow = c(1, 1))
# end solution


# if we want to remove the third obsevation in the data anscombe = anscombe[-3,]

Categorical variables

Do more people with high prior ability solve problems ? The chi-square test allows to test whether the observations in two categorical variables can be considered independent.

# table() and chisq.test()

tt <- table(moocs$prior, moocs$solved_problems)
tt
    
     NO.PROBLEM SOME.PROBLEM
  LO       3712          606
  HI       3307          970
# The chi-square tests the independence of X and Y.  H0: the observed and expected number
# of observations are the same

ct <- chisq.test(tt)
ct

    Pearson's Chi-squared test with Yates' continuity correction

data:  tt
X-squared = 106.67, df = 1, p-value < 2.2e-16
ct$observed
    
     NO.PROBLEM SOME.PROBLEM
  LO       3712          606
  HI       3307          970
ct$expected  # expected values if the variables were independent
    
     NO.PROBLEM SOME.PROBLEM
  LO   3526.241     791.7589
  HI   3492.759     784.2411
# Let's plot the observed and expected tallies
par(mfrow = c(1, 2))
plot(as.table(ct$observed), main = "Observed")
plot(as.table(ct$expected), main = "Expected")

ct$residuals  # pearson residuals larger than 1.96 are significantly different from 0 with p<.05
    
     NO.PROBLEM SOME.PROBLEM
  LO   3.128194    -6.601661
  HI  -3.143152     6.633228
# the Pearson residuals, (observed - expected) / sqrt(expected).

# TODO: Do more people with high prior ability watch videos ?

tt <- table(moocs$prior, moocs$watched_videos)
ct <- chisq.test(tt)
ct

    Pearson's Chi-squared test with Yates' continuity correction

data:  tt
X-squared = 91.004, df = 1, p-value < 2.2e-16
ct$residuals
    
      NO.VIDEO SOME.VIDEO
  LO  3.660087  -5.666913
  HI -3.677588   5.694010