Multiple Linear Regression

1. **Problem Statement:** We are given the hypothesis function for multiple linear regression: $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$$ and the regularized cost function: $$J(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 + \lambda \sum_{j=1}^n \theta_j^2 \right]$$ Our goal is to derive the partial derivatives $\frac{\partial J}{\partial \theta_j}$ for $j=0$ and $j \geq 1$, and then show how these lead to the gradient descent update rules. 2. **Recall:** - $m$ is the number of training examples. - $\lambda$ is the regularization parameter. - Regularization does not apply to $\theta_0$. 3. **Deriving Partial Derivatives:** - For $j=0$ (no regularization term): $$\frac{\partial J}{\partial \theta_0} = \frac{1}{2m} \cdot 2 \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) \cdot \frac{\partial}{\partial \theta_0} \left(h_\theta(x^{(i)}) - y^{(i)}\right)$$ Since $h_\theta(x^{(i)}) = \theta_0 + \sum_{j=1}^n \theta_j x_j^{(i)}$, the derivative w.r.t. $\theta_0$ is 1: $$= \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)$$ - For $j \geq 1$ (includes regularization): $$\frac{\partial J}{\partial \theta_j} = \frac{1}{2m} \cdot 2 \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)} + \frac{1}{2m} \cdot 2 \lambda \theta_j$$ Simplifying: $$= \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)} + \frac{\lambda}{m} \theta_j$$ 4. **Gradient Descent Update Rules:** - Component form: - For $j=0$: $$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)$$ - For $j \geq 1$: $$\theta_j := \theta_j - \alpha \left( \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)} + \frac{\lambda}{m} \theta_j \right)$$ - Vector form: Let $X$ be the $m \times (n+1)$ design matrix with first column all ones, $\theta$ the parameter vector, and $y$ the output vector. The hypothesis vector is: $$h_\theta = X \theta$$ The gradient vector is: $$\nabla J(\theta) = \frac{1}{m} X^T (X \theta - y) + \frac{\lambda}{m} \begin{bmatrix} 0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix}$$ The update rule is: $$\theta := \theta - \alpha \nabla J(\theta)$$ 5. **Summary:** - The partial derivative for $\theta_0$ excludes regularization. - For $\theta_j$ with $j \geq 1$, regularization adds $\frac{\lambda}{m} \theta_j$. - Gradient descent updates parameters by moving opposite to the gradient scaled by learning rate $\alpha$. This completes the derivation and explanation of the gradient descent update rules for regularized multiple linear regression.