House Price Regression
1. **Problem Statement:**
We want to build a linear regression model to predict house prices based on size (sq ft) and age (years) using the given data.
2. **Model Form:**
The multiple linear regression model is:
$$\text{Price} = \beta_0 + \beta_1 \times \text{Size} + \beta_2 \times \text{Age} + \epsilon$$
where $\beta_0$ is the intercept, $\beta_1$ and $\beta_2$ are coefficients for size and age respectively, and $\epsilon$ is the error term.
3. **Data:**
| Size | Age | Price |
|------|-----|-------|
| 1500 | 5 | 300 |
| 1600 | 10 | 280 |
| 1700 | 15 | 260 |
| 1800 | 20 | 240 |
| 1900 | 25 | 220 |
4. **Step: Calculate means**
$$\bar{x}_1 = \frac{1500+1600+1700+1800+1900}{5} = 1700$$
$$\bar{x}_2 = \frac{5+10+15+20+25}{5} = 15$$
$$\bar{y} = \frac{300+280+260+240+220}{5} = 260$$
5. **Step: Calculate coefficients using least squares formulas**
Calculate sums:
$$S_{x_1x_1} = \sum (x_1 - \bar{x}_1)^2 = 100000$$
$$S_{x_2x_2} = \sum (x_2 - \bar{x}_2)^2 = 250$$
$$S_{x_1x_2} = \sum (x_1 - \bar{x}_1)(x_2 - \bar{x}_2) = -5000$$
$$S_{x_1y} = \sum (x_1 - \bar{x}_1)(y - \bar{y}) = -50000$$
$$S_{x_2y} = \sum (x_2 - \bar{x}_2)(y - \bar{y}) = -1250$$
6. **Step: Solve for coefficients $\beta_1$ and $\beta_2$**
Using matrix form:
$$\begin{bmatrix} S_{x_1x_1} & S_{x_1x_2} \\ S_{x_1x_2} & S_{x_2x_2} \end{bmatrix} \begin{bmatrix} \beta_1 \\ \beta_2 \end{bmatrix} = \begin{bmatrix} S_{x_1y} \\ S_{x_2y} \end{bmatrix}$$
$$\begin{bmatrix} 100000 & -5000 \\ -5000 & 250 \end{bmatrix} \begin{bmatrix} \beta_1 \\ \beta_2 \end{bmatrix} = \begin{bmatrix} -50000 \\ -1250 \end{bmatrix}$$
Calculate determinant:
$$D = 100000 \times 250 - (-5000) \times (-5000) = 25000000 - 25000000 = 0$$
Since determinant is zero, the matrix is singular, indicating perfect multicollinearity or linear dependence between size and age in this dataset.
7. **Interpretation:**
The size and age variables are perfectly linearly related in this data (age increases by 5 years as size increases by 100 sq ft), so we cannot estimate unique coefficients for both.
8. **Alternative approach:**
Use simple linear regression with one variable, for example size:
Calculate slope:
$$\beta_1 = \frac{S_{x_1y}}{S_{x_1x_1}} = \frac{-50000}{100000} = -0.5$$
Intercept:
$$\beta_0 = \bar{y} - \beta_1 \bar{x}_1 = 260 - (-0.5)(1700) = 260 + 850 = 1110$$
9. **Final model:**
$$\text{Price} = 1110 - 0.5 \times \text{Size}$$
10. **Interpretation of coefficients:**
- Intercept $1110$: predicted price when size is zero (not meaningful practically but part of the model).
- Slope $-0.5$: for each additional square foot, price decreases by 0.5 (in $1000s), which is counterintuitive and likely due to the confounding effect of age.
**Summary:** Due to perfect linear dependence between size and age, a multiple regression model cannot be fit uniquely. A simple regression on size shows a negative relationship, but this is likely misleading without considering age separately.