Multiple Linear Regression
1. **Problem Statement:**
We are given the hypothesis function for multiple linear regression:
$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$$
and the regularized cost function:
$$J(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 + \lambda \sum_{j=1}^n \theta_j^2 \right]$$
Our goal is to derive the partial derivatives $\frac{\partial J}{\partial \theta_j}$ for $j=0$ and $j \geq 1$, and then show how these lead to the gradient descent update rules.
2. **Recall:**
- $m$ is the number of training examples.
- $\lambda$ is the regularization parameter.
- Regularization does not apply to $\theta_0$.
3. **Deriving Partial Derivatives:**
- For $j=0$ (no regularization term):
$$\frac{\partial J}{\partial \theta_0} = \frac{1}{2m} \cdot 2 \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) \cdot \frac{\partial}{\partial \theta_0} \left(h_\theta(x^{(i)}) - y^{(i)}\right)$$
Since $h_\theta(x^{(i)}) = \theta_0 + \sum_{j=1}^n \theta_j x_j^{(i)}$, the derivative w.r.t. $\theta_0$ is 1:
$$= \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)$$
- For $j \geq 1$ (includes regularization):
$$\frac{\partial J}{\partial \theta_j} = \frac{1}{2m} \cdot 2 \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)} + \frac{1}{2m} \cdot 2 \lambda \theta_j$$
Simplifying:
$$= \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)} + \frac{\lambda}{m} \theta_j$$
4. **Gradient Descent Update Rules:**
- Component form:
- For $j=0$:
$$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)$$
- For $j \geq 1$:
$$\theta_j := \theta_j - \alpha \left( \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)} + \frac{\lambda}{m} \theta_j \right)$$
- Vector form:
Let $X$ be the $m \times (n+1)$ design matrix with first column all ones, $\theta$ the parameter vector, and $y$ the output vector.
The hypothesis vector is:
$$h_\theta = X \theta$$
The gradient vector is:
$$\nabla J(\theta) = \frac{1}{m} X^T (X \theta - y) + \frac{\lambda}{m} \begin{bmatrix} 0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix}$$
The update rule is:
$$\theta := \theta - \alpha \nabla J(\theta)$$
5. **Summary:**
- The partial derivative for $\theta_0$ excludes regularization.
- For $\theta_j$ with $j \geq 1$, regularization adds $\frac{\lambda}{m} \theta_j$.
- Gradient descent updates parameters by moving opposite to the gradient scaled by learning rate $\alpha$.
This completes the derivation and explanation of the gradient descent update rules for regularized multiple linear regression.