Linear Regressions
1. **Problem Statement:**
We need to perform three linear regressions using the dataset of countries with variables: Surface Area (independent variable), Population, and Sex Ratio (dependent variables). We will calculate slope, intercept, correlation coefficient $r$, and coefficient of determination $r^2$ for each regression.
2. **Formulas and Important Rules:**
- The linear regression line is given by $$y = mx + b$$ where $m$ is the slope and $b$ is the intercept.
- Slope formula: $$m = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}$$
- Intercept formula: $$b = \frac{\sum y - m \sum x}{n}$$
- Correlation coefficient $r$: $$r = \frac{n\sum xy - \sum x \sum y}{\sqrt{(n\sum x^2 - (\sum x)^2)(n\sum y^2 - (\sum y)^2)}}$$
- Coefficient of determination: $$r^2 = r \times r$$
- Important: $x$ is independent variable (Surface Area), $y$ is dependent variable (Population or Sex Ratio).
3. **Linear Regression 1: Surface Area vs Population**
- Using all countries, calculate sums: $\sum x$, $\sum y$, $\sum xy$, $\sum x^2$, $\sum y^2$, and $n$ (number of countries).
- Compute slope $m_1$, intercept $b_1$, correlation $r_1$, and $r_1^2$.
- Equation of line: $$Population = m_1 \times SurfaceArea + b_1$$
4. **Linear Regression 2: Surface Area vs Sex Ratio**
- Using all countries, repeat the same calculations with $y$ as Sex Ratio.
- Compute slope $m_2$, intercept $b_2$, correlation $r_2$, and $r_2^2$.
- Equation of line: $$SexRatio = m_2 \times SurfaceArea + b_2$$
5. **Linear Regression 3: Surface Area vs Sex Ratio (excluding Bahrain, Kuwait, Saudi Arabia, Qatar, Oman, UAE)**
- Remove these 6 countries from dataset.
- Repeat calculations for slope $m_3$, intercept $b_3$, correlation $r_3$, and $r_3^2$.
- Equation of line: $$SexRatio = m_3 \times SurfaceArea + b_3$$
6. **Graphs:**
- For each regression, plot scatterplot of data points.
- Draw line of best fit using calculated $m$ and $b$.
- Label axes: "Surface Area (1000 km^2)" on x-axis, "Population (millions)" or "Sex Ratio (males per 100 females)" on y-axis.
- Display equation of line on graph.
7. **Error Variance Discussion:**
- Error variance means residuals (differences between actual and predicted $y$) should be roughly constant across $x$.
- For population regression, large variation in population sizes (from small to very large) causes heteroscedasticity (non-constant variance).
- This violates linear regression assumptions, potentially biasing slope and intercept estimates and reducing reliability.
- Large populations may have larger residuals, skewing results.
- Remedies include transforming variables (e.g., log scale) or using weighted regression.
**Final answers:**
- Linear Regression 1: slope $m_1$, intercept $b_1$, $r_1$, $r_1^2$ with equation $$Population = m_1 SurfaceArea + b_1$$
- Linear Regression 2: slope $m_2$, intercept $b_2$, $r_2$, $r_2^2$ with equation $$SexRatio = m_2 SurfaceArea + b_2$$
- Linear Regression 3: slope $m_3$, intercept $b_3$, $r_3$, $r_3^2$ with equation $$SexRatio = m_3 SurfaceArea + b_3$$ (excluding 6 countries)
Due to the complexity and size of data, exact numeric values require computational tools.