## Correlation doesn't imply Causation

But it may, if there is a good independent reason to think that X influences Y.

## Regression implies X causes Y

$Y = slope \times X + intercept$

• Both variables are continuous
• TONS of data are suitable for this kind of analysis.
• Examples?

## Simple Linear Model

Simplest way two variables can be modeled as related to one another

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$

• $$\beta_0$$ is the intercept (value of y where x= 0)
• $$\beta_1X_i$$ is the slope value expressing $$\Delta Y / \Delta X$$
• $$\epsilon_i$$ is the error term
• normal random variable with mean 0 and variance $$\sigma^2$$

## This equation should make sense

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$

Once you decide on a line, then the value of Y equals:

• the value predicted by the line, plus
• a random error from our error term

## Finding the best line

• Once we decide on linear regression, we have to find best line
• There is unexplained variation in Y, so points don't fall on straight line (why?)
• Many lines pass through $$(\bar{X},\bar{Y})$$. How do we pick the best one?

## Finding the best line - residuals

A residual represents the distance between the predicted value from the regression, and the actual value.

Another way to think about the residual: a single value from the normally distributed error term.

The squared residual is calculated as such:

$d_i^2=(Y_i - \hat{Y}_i)^2$

## Finding the best line

The best line minimizes the residual sum of squares

$RSS=\sum\limits_{i=1}^n(Y_i - \hat{Y}_i)^2$

We could do a Monte Carlo approach: try a bunch of slopes passing through $$(\bar{X},\bar{Y})$$ and calculate the RSS, then pick the smallest, but math offers a simpler solution.

## Variances and Covariances

Recall the sum of squares: $\ SS_Y = \sum\limits_{i=1}^n(Y_i - \bar{Y})^2$

Sample variance: $\ s^2_Y=\frac{\sum\limits_{i=1}^n(Y_i - \bar{Y})^2}{n-1}$

SS equivalent to: $\ SS_Y = \sum\limits_{i=1}^n(Y_i - \bar{Y})(Y_i - \bar{Y})$

## Variances and Covariances

With 2 variables, we can define sum of cross products $\ SS_{XY} = \sum\limits_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})$

By analogy to the sample variance, we define sample covariance $\ s_{XY} = \frac{\sum\limits_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})}{(n-1)}$

## Sample Covariance - Negative

From $$-\infty$$ to $$\infty$$

$\ s_{XY} = \frac{\sum\limits_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})}{(n-1)}$