{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example Analyses CS-411" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import scipy as sp\n", "import statsmodels.api as sm\n", "from statsmodels.formula.api import ols\n", "from scipy.stats.stats import pearsonr\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading data\n", "Let’s load data. The first step is to read a CSV file into a dataframe, which is the data structure used by R to store data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Adapt the path to your own installation\n", "# Set the working directory\n", "datadir = \"~/STATS_CS411/\"\n", "\n", "# Read a csv file\n", "# df is a dataframe, similar to an excel worksheet.\n", "df = pd.read_csv(datadir + \"moocs.basic.csv\", sep=\",\")\n", "\n", "# Prints the first 5 rows of the dataframe. Useful to check if the import has worked. \n", "# We see that there are many missing values signaled by \"NA\" == Not Available, namely \n", "# for variables starting with MOOC_ ... this is because these students never logged in \n", "# the MOOC.\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# List the names of the variables\n", "df.columns.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Descriptive Statistics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To get information about only one variable, use the `.` sign to chose a variable then call describe on it\n", "df.MOOC_NVideosWatched.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.MOOC_PercentageVideosWatched.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.MOOC_PercentageVideosWatched.std()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.MOOC_PercentageVideosWatched.median()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.MOOC_PercentageVideosWatched.max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.MOOC_PercentageVideosWatched.min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing new categorical variables " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conditions = [\n", " (df['MOOC_NVideosWatched'] > 0),\n", " (df['MOOC_NProblemsSolved'] > 0)\n", "]\n", "choices = ['MOOC', 'MOOC']\n", "\n", "df['didMOOC'] = np.select(conditions, choices, default='NO.MOOC')\n", "df['didMOOC'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conditions = [\n", " (df['MOOC_NVideosWatched'] > 0)\n", "]\n", "choices = ['SOME.VIDEO']\n", "\n", "df['watchedVideos'] = np.select(conditions, choices, default='NO.VIDEO')\n", "df['watchedVideos'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conditions = [\n", " (df['MOOC_NProblemsSolved'] > 0)\n", "]\n", "choices = ['SOME.PROBLEM']\n", "\n", "df['solvedProblems'] = np.select(conditions, choices, default='NO.PROBLEM')\n", "df['solvedProblems'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(df.solvedProblems, df.watchedVideos)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's also replace the missing values with 0 for the percentage of videos watched and the percentage of \n", "# problems solved\n", "df.MOOC_PercentageVideosWatched = df.MOOC_PercentageVideosWatched.replace(np.NaN,0)\n", "df.MOOC_PercentageProblemsSolved = df.MOOC_PercentageProblemsSolved.replace(np.NaN,0)\n", "df.MOOC_AverageMaxVideosWatched = df.MOOC_AverageMaxVideosWatched.replace(np.NaN,0)\n", "df.MOOC_Grade[df.MOOC_Grade<0] = 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO:\n", "# Define a categorical variable name 'MOOC' that has 3 levels with:\n", "# 1) NONE = \"did not use the mooc at all\"\n", "# 2) WATCH = \"used the mooc only to watch videos\"\n", "# 3) DO = \"did some assignments in the mooc\"\n", "\n", "# Begin Solution\n", "\n", "# End Solution \n", "\n", "df['MOOC'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Median split and quartile splits\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.EPFL_BacGrade.hist(bins=16)\n", "_median = df.EPFL_BacGrade.median()\n", "plt.plot([_median,_median],[0,2700], 'r')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Define two categorical variables named 'prior' and 'prior4' that distinguishes\n", "# students given their grade at the baccalaureat.\n", "\n", "# 1) prior. \n", "# The values of the variable are \"HI\"\" for strong students and \"LO\"\" for weakers students.\n", "# => use median() to define high and low levels of ability\n", "\n", "# 2) prior4 \n", "# Levels are called XLO, LO, HI, XHI\n", "# => use quantile() and cut() to definelevels \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Subsetting the data\n", "\n", "Since we will analyse BacLevel later on, we select a subset of the dataset that includes only observations for which we have the Baccalaureat level.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_missing = df['EPFL_BacLevel'].notnull().value_counts()[False]\n", "msg = \"We have : {} missing values \\n\"\n", "print( msg.format( n_missing ))\n", "\n", "# Let's subset the data and keep only observations for which EPFL_BacLevel is not null\n", "df = df[df['EPFL_BacLevel'].notnull()]\n", "print(\"After subsetting, we now have %s valid observations\" % df.shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic plotting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(16,10))\n", "plt.subplot(2,2,1)\n", "sns.distplot(df[\"EPFL_CourseGrade\"], kde=False)\n", "plt.xlabel('Grade')\n", "plt.title('Course Grades')\n", "\n", "plt.subplot(2,2,2)\n", "sns.distplot(df[\"MOOC_PercentageVideosWatched\"], kde=False)\n", "plt.xlabel('Videos')\n", "plt.title('Percentage of videos watched')\n", "\n", "plt.subplot(2,2,3)\n", "sns.distplot(df[\"MOOC_Grade\"], kde=False)\n", "plt.xlabel('Grade')\n", "plt.title('MOOC Grade')\n", "\n", "plt.subplot(2,2,4)\n", "sns.distplot(df[\"MOOC_AverageMaxVideosWatched\"], kde=False)\n", "plt.xlabel('Videos')\n", "plt.title('Max number of videos')\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(16,4))\n", "plt.subplot(1,2,1)\n", "sns.boxplot(x=df[\"MOOC_PercentageVideosWatched\"])\n", "plt.subplot(1,2,2)\n", "sns.boxplot(x=\"MOOC\", y=\"EPFL_CourseGrade\", data=df)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(x=df.EPFL_CourseGrade, y=df.MOOC_PercentageVideosWatched)\n", "plt.xlabel('CourseGrade')\n", "plt.ylabel('PercentageVideosWatched')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO Create a plot with seaborn regplot that checks whether the course grade is linked with the \n", "# grade obtained at the baccalaureat.\n", "# Plot this separately for eah EPFL_Topic\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic Statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Measuring correlation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "correlation, pvalue = pearsonr(df.EPFL_CourseGrade, df.MOOC_PercentageVideosWatched)\n", "\n", "print('correlation:', correlation)\n", "print('p-value :', pvalue)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: find whether the CourseGrade is correlated with the BacGrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output contains the Pearson correlation coefficiant (cor) as well as a p-value that is testing whether this correlation is statistically different from 0. In our case, p << .05 which indicates that the correlation is positive.\n", "\n", "The same package can also be used to compute the Spearman rank correlation which is used whenever the distributions of the variables are not normal (which seems to be the case here for moocs$MOOC_PercentageVideosWatched). In such a case the Spearman rank correlation might be more appropriate (see http://geoinfo.amu.edu.pl/qg/archives/2011/QG302_087-093.pdf).\n", "\n", "The difference bewteen a rank correlation and pearson correlation is well illustrated here: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sp.stats.spearmanr(df.EPFL_CourseGrade,df.MOOC_PercentageVideosWatched)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Normality Assumption\n", "Testing normality visually with quantile plots. If the variable is normally distributed, the observations would follow the line drawn by qqline. We see from the plots below that the distribution of the variable “MOOC_PercentageVideosWatched” departs very strongly from a normal distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f, (ax1, ax2) = plt.subplots(1,2)\n", "df.MOOC_PercentageVideosWatched.hist(ax=ax1)\n", "sm.qqplot(df.MOOC_PercentageVideosWatched, line='s', ax=ax2)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f, (ax1, ax2) = plt.subplots(1,2)\n", "df.EPFL_CourseGrade.hist(ax=ax1)\n", "sm.qqplot(df.EPFL_CourseGrade, line='s', ax=ax2)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Is Bac Grade normally distributed ?\n", "# Plot a historgram and a qq plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normality can also be tested statistically, however the normality tests are rather sensitive to small deviations.\n", "\n", "For example, the Shapiro-Wilks test, is testing the null hypothesis that the variable is normally distributed. In this case the adequate p.value is .1. Hence, with p <<.1 we have to reject the null hypothesis and conclude that the variable is not normally distributed. This test work with samples smaller than 5000 observations.\n", "\n", "An alternative to shapiro is the Kolmogorov-Smirnov test which allows to test whether two samples were drawn from the same distribution. This allows to compare our observations with a sample that follows a normal distibution. The test for EPFL_BacGrade (D=0.04, p<.00) shows that the distribution of sample is unlikely to stem from a normal distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chi^2 test\n", "Simply looking at the number of obsevations in the contingency table does not allow to judge whether there is a strong imbalance of strong/weak students in the MOOC use categies. A chi-square test will tell us whether these two variables can be considered independent.\n", "\n", "The assumptions to conduct a chi-square test are:\n", "* observations are independent (i.e. each person appears only in one cell of the contingency table)\n", "* expected counts are larger than 5 in each cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "observed = pd.crosstab(df['EPFL_BacLevel'], df['MOOC'])\n", "observed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import chisquare, chi2_contingency\n", "from scipy.stats.contingency import margins\n", "\n", "# Running the test\n", "chi2, p, ddof, expected = chi2_contingency(observed)\n", "\n", "msg = \"Test Statistic: {}\\np-value: {}\\nDegrees of Freedom: {}\\n\"\n", "print( msg.format( chi2, p, ddof ) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Residuals are given by the formula below and \n", "# indicate how much our observations depart from the expected number of observations\n", "def pearson_residuals(observed, expected):\n", " return (observed - expected) / np.sqrt(expected)\n", "\n", "res = pearson_residuals(observed,expected)\n", "res" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Standardised residuals \n", "# Residuals with an absolute value larger than 1.96 are considered to contribute significantly \n", "# to the imbalance of observations and hence to the significance of the chi-square test)\n", "# See some explanations here https://pdfs.semanticscholar.org/1ca1/2da97e5068cc2d783ed8fdf8f3c5fce3b136.pdf\n", "\n", "def adjusted_residuals(observed, expected):\n", " n = observed.sum()\n", " rsum, csum = margins(observed)\n", " v = csum * rsum * (n - rsum) * (n - csum) / n**3\n", " return (observed - expected) / np.sqrt(v)\n", "\n", "adjusted_residuals(np.array(observed),np.array(expected))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also visualise the deviance by plotting the residuals. It becomes clear that LO ability students tend to DO less than HI students and that the inverse is true for not using the MOOC (NONE). There is no large difference with regards to VIEWing the MOOC." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res.plot(kind='bar')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ANOVA - Analysis of Variance\n", "Comparing means across categories. We use the Analysis of Variance (ANOVA) to compare the mean of a variable across several categorical variables. We start by looking at the means of the grade for these different groups, using a pivot table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pd.pivot_table(df,\n", " values=['EPFL_CourseGrade'],\n", " index=['EPFL_BacLevel'],\n", " columns=['MOOC'],\n", " aggfunc=np.mean)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Doing pairwise t-test to compare each variable\n", "print(sm.stats.ttest_ind(df.EPFL_CourseGrade[df.MOOC == 'NONE'], df.EPFL_CourseGrade[df.MOOC == 'DO']))\n", "print(sm.stats.ttest_ind(df.EPFL_CourseGrade[df.MOOC == 'DO'], df.EPFL_CourseGrade[df.MOOC == 'WATCH']))\n", "print(sm.stats.ttest_ind(df.EPFL_CourseGrade[df.MOOC == 'WATCH'], df.EPFL_CourseGrade[df.MOOC == 'NONE']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grade_lm=ols('EPFL_CourseGrade ~ MOOC', data=df).fit()\n", "print(sm.stats.anova_lm(grade_lm, typ=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Assumptions to be met before conducting an ANOVA\n", "\n", "## 1) Data is normally distributed in each of the groups\n", "\n", "Let's start with the normality of the dependent in all subgroups. The QQ plost below do not show huge deviations, except maybe for the \"DO\" category. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f, (ax1, ax2, ax3) = plt.subplots(1,3)\n", "f.set_size_inches(12,6)\n", "\n", "sm.qqplot(df['EPFL_CourseGrade'][df['MOOC'] == 'NONE'], line='s', ax=ax1)\n", "ax1.set_title('NONE')\n", "\n", "sm.qqplot(df['EPFL_CourseGrade'][df['MOOC'] == 'WATCH'], line='s', ax=ax2)\n", "ax2.set_title('WATCH')\n", "\n", "sm.qqplot(df['EPFL_CourseGrade'][df['MOOC'] == 'DO'], line='s', ax=ax3)\n", "ax3.set_title('DO')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2) Variances are equal in both groups\n", "\n", "The bartlett test allows to compare the variance across groups. \n", "\n", "* H0: var1= var2 ... = varn\n", "* H1: var1 != var2 ... != varn\n", "\n", "We see from the results that the test is significant for both MOOC and EPFL_BacLevel which indicates an unequality of variances. The use of tests to check assumptions should encourage you to investigate more and try to understand the deviations in the data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sp.stats.bartlett(df['EPFL_CourseGrade'][df['MOOC'] == 'NONE'],\n", " df['EPFL_CourseGrade'][df['MOOC'] == 'WATCH'],\n", " df['EPFL_CourseGrade'][df['MOOC'] == 'DO'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(\n", " df['EPFL_CourseGrade'][df['MOOC'] == 'NONE'].var(),\n", " df['EPFL_CourseGrade'][df['MOOC'] == 'WATCH'].var(),\n", " df['EPFL_CourseGrade'][df['MOOC'] == 'DO'].var()\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.boxplot(x=\"EPFL_CourseGrade\", y=\"MOOC\", data=df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So what now ? We have seen some minor deviations from normality, and some unequal variances. \n", "* We can decide that these are minor and report the analysis of variance results.\n", "* As an alternative we can use a more robust statistical method, e.g. a non-parametric equivalent for ANOVA, e.g the Kruskall-Wallis test for one independent variable. But this test would not allow to test for interactions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A Kruskall test on the MOOC variable\n", "sp.stats.kruskal(df['EPFL_CourseGrade'][df['MOOC'] == 'NONE'],\n", " df['EPFL_CourseGrade'][df['MOOC'] == 'WATCH'],\n", " df['EPFL_CourseGrade'][df['MOOC'] == 'DO'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ANOVA - Exercise" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO\n", "# make an anova / or another test to find out high ability students (prior or prior4) get better grades (EPFL_CourseGrade) \n", "# check the assumptions" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }