Give me six hours to chop down a tree and I will spend the first four sharpening the axe, Abraham Lincoln
Chain Rule. Let $\mathbb{f}:\mathbb{R}^n \to \mathbb{R}^m, \mathbb{g}:\mathbb{R}^m \to \mathbb{R}^p$ be vector valued functions, assume that f and g are differentiable at $x \in dom(\mathbb{f}), f(x) \in dom(\mathbb{g})$, respectively. The composition function $\mathbb{h}:\mathbb{R}^n \to \mathbb{R}$ defined as $\mathbb{h}(x) = \mathbb{g}(\mathbb{f}(x))$ is also differentiable at x and its derivative can be expressed as $\mathbf{Dh(x) = Dg(f(x))Df(x)}$. In particular, if m = p = 1, h is a scalar-valued function from $\mathbf{R} \to \mathbf{R}$ differentiable at x and its gradient is given by the expression: $\nabla \mathbb{h}(x) = \mathbb{g’}(\mathbb{f}(x))\nabla \mathbb{f}(x)$ where $\mathbb{g’}(\mathbb{f}(x))$ is just the ordinary derivative of g (a single-variable function) evaluated at f(x).
$Df(x)$ is an m×n matrix (the Jacobian of f at x). $Dg(f(x))$ is a p×m matrix (the Jacobian of g at the point f(x)). Thus, their product is a p×n matrix, matching the dimension of $Dh(x)$.
It essentially states that the overall change in h is a combination of the changes in f and g, propagated through the composition. If m = p = 1, the small change in h for a small change in x is governed by how f changes in each direction (the gradient $\nabla \mathbf{f(x)}$) scaled by the rate of change of g at f(x).
Examples:
Let’s compute $\mathbf{Dh(x) = Dg(f(x))Df(x)}$. The derivative $Df(x)$ with respect to x is (a mxn, 2x1 matrix, a column vector): $\nabla \mathbf{f}(x) = (\begin{smallmatrix}cos(x)\\ -sin(x)\end{smallmatrix})$. The derivative $Dg(f(x))$ is (a pxm, 1x2, a row vector): $\nabla \mathbf{D}(g(f(x))) = (\begin{smallmatrix}2sin(x) & 2cos(x)\end{smallmatrix})$
Apply the Chain Rule: $\mathbf{Dh(x) = Dg(f(x))Df(x)} = (\begin{smallmatrix}2sin(x) & 2cos(x)\end{smallmatrix})(\begin{smallmatrix}cos(x)\\ -sin(x)\end{smallmatrix}) = 2sin(x)cos(x)+2cos(x)(-sin(x)) = 0$ (a ). Indeed, since h(x) = 1 constant (pxn, 1x1, a real value), its derivate is zero.
$f_1(x, y)=x²+y², f_2(x, y) = x -y, \frac{\partial f_1}{\partial x} = 2x, \frac{\partial f_1}{\partial y} = 2y, \frac{\partial f_2}{\partial x} = 1, \frac{\partial f_2}{\partial y} = -1, \nabla \mathbf{f}(x, y) = (\begin{smallmatrix}2x & 2y\\ 1 & -1\end{smallmatrix})$ a mxn, 2x2 matrix, each row corresponds to the gradient of f1 and f2 respectively.
$\nabla \mathbf{g}(f(x, y)) = (\begin{smallmatrix}1 & 1\end{smallmatrix})$ (a pxm, 1x2 row vector). Apply the Chain Rule: $\mathbf{Dh(x) = Dg(f(x))Df(x)} = (\begin{smallmatrix}1 & 1\end{smallmatrix})(\begin{smallmatrix}2x & 2y\\ 1 & -1\end{smallmatrix}) = (\begin{smallmatrix}2x + 1 & 2y -1\end{smallmatrix})$, a 1 x 2 row vector.
$f(\vec{x}) = \sum_{i=0}^{n} e^{x_i}, g(y) = ln(y).$ The composition function $\mathbb{h}:\mathbb{R}^n \to \mathbb{R}$ is defined as $\mathbb{h}(\vec{x}) = \mathbb{g}(\mathbb{f}(\vec{x})) = ln(\sum_{i=0}^{n} e^{x_i})$. By the Chain Rule, $\nabla \mathbb{h}(\vec{x}) = \mathbb{g’}(\mathbb{f}(\vec{x}))\nabla \mathbb{f}(\vec{x}) = \frac{1}{\sum_{i=0}^{n} e^{x_i}}\biggr(\begin{smallmatrix}e^{x_1}\\ e^{x_2}\\ \cdots \\ e^{x_n}\end{smallmatrix}\biggl)$
$f: \mathbf{R^n} \to \mathbf{R}, f(\vec{x}) = \sum_{i=0}^{n} x_i^{2} = ||\vec{x}||_2^{2}, g: \mathbf{R} \to \mathbf{R}, g(y) = \sqrt{y}.$ Then, the composition $h(\vec{x}) = g(f(\vec{x})) = ||\vec{x}||$ is the l2 norm, also known as the Euclidean norm (it measures the length or magnitude of a vector in a Euclidean space). Applying the Chain Rule (g is differentiable on the open set ℝn) $\nabla \mathbb{h}(\vec{x}) = \mathbb{g’}(\mathbb{f}(\vec{x}))\nabla \mathbb{f}(\vec{x}) = \frac{1}{2\sqrt{||\vec{x}||_2^{2}}}2\vec{x} = \frac{\vec{x}}{||\vec{x}||_2}$ for every non-zero vector $\vec{x} \ne \vec{0}$.