MATH-493: Lab 2a, Cloud Seeding Data, question about summary of clouds.lm

Hello,

As written in the title, I have a question regarding Lab 2a, exercise with Cloud Seeding Data.

In this exercise, we have a model: clouds.formula <- rainfall ~ seeding + seeding:(sne + cloudcover + prewetness + echomotion) + time, then we do a linear regression on this model: clouds.lm <- lm(clouds.formula, data = clouds)

I have questions about result of command summary(clouds.lm). Why is there a coefficient seedingyes but no coefficient seedingno, and why not simply a single coefficient seeding? Also, why are there the two coefficients seedingno:echomotionstationary and seedingyes:echomotionstationary but no seedingno:echomotionmoving and seedingyes:echomotionmoving?

Do not hesitate to give me some documentation that could explain that if you prefer rather than explaining here by message. It is just that I have difficulties to find a documentation explaining this behaviour

Thank you very much in advance!

Re: Lab 2a, Cloud Seeding Data, question about summary of clouds.lm

by Hsin-Ping Wu - Tuesday, 7 March 2023, 02:38

Hi,

Nice observation.
A very short answer: they are included in the reference group and here in the intercept term.
Let's focus on the seeding-no and seeding-yes and ignore the echomotionmoving part for a moment.
Here we are actually fitting 2 different models corresponding to 2 levels of seeding: seedingno and seedingyes. (If you did the exercise all the way down to the last part you'll see two lines on the graph)

Let's say, we assign 0 to seedingno and 1 to seedingyes. It's called "dummy coding" and is a way to deal with categorical predictors in a linear model.
So seeding = 0 in case of seedingno and seeding = 1 in case of seedingyes.

Part of your model is as followed:
rainfall = beta0 + beta1*seeding + beta2*time + beta3*(1-seeding)*sne + beta4*seeding*sne......

beta0 is intercept term and beta1 is the coefficient for seeding, beta2 the coefficient for time, etc. And when seeding=0=seedingno, you obtain beta3 while when seeding=1=seedingyes you obtain beta4, etc.

When seeding = 0 = seedingno, beta1*0=0,
the predicted rainfall = -0.34624 - 0.04497*time + 0.41981*sne+....

In contrast, when seeding = 1 = seedingyes,
the predicted rainfall = -0.34624 + 15.68293 - 0.04497*time - 2.77738*sne+...

So you don't need another coefficient for seedingno as it is coded as 0 here.

The same applies to echomotionmoving.

Try writing down the model. It should help you understand the table from the summary(clouds.lm).

If you still have questions, feel free to ask us TAs in the lab session or ask Darlene during her office hours.

Re: Lab 2a, Cloud Seeding Data, question about summary of clouds.lm

by Darlene Goldstein - Tuesday, 7 March 2023, 21:42

hello - for factor variables (cloud seeding yes/no), the number of terms in the model is number of factor levels (here 2) minus 1 (so 2-1 = 1). This is because the design matrix X for the model will become rank deficient (ie, linear dependent columns) if you include all levels. Why is this important? Because to get the least squares estimator we need to compute
(Xtranspose * X) inverse * Xtranspose* y - and if X is rank deficient then you cannot invert Xtranspose * X...... so you can't get the least squares estimator.

It is similar for the interaction terms - they do not have to be chosen in the way that R chooses them by default, but the important aspect is that the number of estimable interactions is only equal to the number of interaction degrees of freedom. ie, there are constraints on the interactions, so they cannot all be estimated freely.

Does this make sense??

Best regards,
Darlene