Give me six hours to chop down a tree and I will spend the first four sharpening the axe, Abraham Lincoln

Definition. Differentiability at a point. Let $f : ℝ^n \to ℝ^m$ be a function and let x be an interior point of the domain of f, $x \in \text{interior(dom f)} $. The function f is differentiable at x if there exists a matrix $Df(x) \in ℝ^{m \times n}$ (the Jacobian) that satisfies $\lim_{\substack{z \in \text{dom} f \\ z \neq x, z \to x}} \frac{||f(z) - f(x) - Df(x)(z-x)||_2}{||(z-x)||_2} = 0$
This matrix Df(x) is called the derivative or the Jacobian matrix of f at the point x.
Definition. A function f is called differentiable if its domain f (dom(f) ⊆ ℝn) is open and f is differentiable at every point of its domain (∀x ∈ dom(f)).

Figure. For f(x,y)=x²+y², the red plane at (1,1) is the Jacobian’s linear approximation.
The Chain Rule is one of the most fundamental and powerful tools in calculus. It governs how derivatives behave under composition of functions — whether in single-variable or multivariable settings. In higher dimensions, it elegantly expresses the derivative of a composite function as the matrix product of the Jacobians of the component functions.
Chain Rule. Let:
Define the composite function $\mathbb{h}:\mathbb{R}^n \to \mathbb{R}^p$ by $\mathbb{h}(x) = \mathbb{g}(\mathbb{f}(x))$.
Then, h is also differentiable at x and its its total derivative (Jacobian matrix) at x is given by $\mathbf{Dh(x) = Dg(f(x))Df(x)}$. This matrix equation is known as the multivariable chain rule.
Let’s verify the dimensions to ensure consistency:
The chain rule states that the overall change in h is a combination of the changes in f and g, propagated through the composition. Think of h=g∘f as a two-stage process:
In particular, if m = p = 1, then $f:\mathbf{R}ⁿ \to \mathbf{R}, g:\mathbf{R} \to \mathbf{R}$. The chain rule reduces to: $\nabla \mathbb{(g∘f)}(x) = \mathbb{g'}(\mathbb{f}(x))\nabla \mathbb{f}(x)$ where $\mathbb{g'}(\mathbb{f}(x))$ is an ordinary derivative of g evaluated at f(x) and ∇f(x) is the n-vector of partials.
$Df(x)$ is an m×n matrix (the Jacobian of f at x). $Dg(f(x))$ is a p×m matrix (the Jacobian of g at the point f(x)). Thus, their product is a p×n matrix, matching the dimension of $Dh(x)$.
It essentially states that the overall change in h is a combination of the changes in f and g, propagated through the composition. If m = p = 1, the small change in h for a small change in x is governed by how f changes in each direction (the gradient $\nabla \mathbf{f(x)}$) scaled by the rate of change of g at f(x).
Let’s compute $\mathbf{Dh(x) = Dg(f(x))Df(x)}$. The derivative $Df(x)$ with respect to x is (a mxn, 2x1 matrix, a column vector): $\nabla \mathbf{f}(x) = (\begin{smallmatrix}cos(x)\\\\ -sin(x)\end{smallmatrix})$. The derivative $Dg(f(x))$ is (a pxm, 1x2, a row vector): $\nabla \mathbf{D}(g(f(x))) = (\begin{smallmatrix}2sin(x) & 2cos(x)\end{smallmatrix})$
Apply the Chain Rule: $\mathbf{Dh(x) = Dg(f(x))Df(x)} = (\begin{smallmatrix}2sin(x) & 2cos(x)\end{smallmatrix})(\begin{smallmatrix}cos(x)\\\\ -sin(x)\end{smallmatrix}) = 2sin(x)cos(x)+2cos(x)(-sin(x)) = 0$ (a ). Indeed, since h(x) ≡ 1 constant (pxn, 1x1, a real value), its derivate is zero.
$f_1(x, y)=x²+y², f_2(x, y) = x -y, \frac{\partial f_1}{\partial x} = 2x, \frac{\partial f_1}{\partial y} = 2y, \frac{\partial f_2}{\partial x} = 1, \frac{\partial f_2}{\partial y} = -1, \nabla \mathbf{f}(x, y) = (\begin{smallmatrix}2x & 2y\\\\ 1 & -1\end{smallmatrix})$ a mxn, 2x2 matrix, each row corresponds to the gradient of f1 and f2 respectively.
$\nabla \mathbf{g}(f(x, y)) = (\begin{smallmatrix}1 & 1\end{smallmatrix})$ (a pxm, 1x2 row vector). Apply the Chain Rule: $\mathbf{Dh(x) = Dg(f(x))Df(x)} = (\begin{smallmatrix}1 & 1\end{smallmatrix})(\begin{smallmatrix}2x & 2y\\\\ 1 & -1\end{smallmatrix}) = (\begin{smallmatrix}2x + 1 & 2y -1\end{smallmatrix})$, a 1 x 2 row vector.
Now compute h’(t) using the chain rule. Df(t) = $(\begin{smallmatrix}-sin(t) \\\\ cos(t)\end{smallmatrix})$ (a 2 x 1 column vector), Dg(x, y) = (2x, -2y). Dg(f(t)) = [2cos(t), −2sin(t)]. Then, h′(t) = Dg(f(t))⋅Df(t) = $[2cos(t), −2sin(t)]·(\begin{smallmatrix}-sin(t) \\\\ cos(t)\end{smallmatrix})$ = −2cos(t)sin(t) − 2sin(t)cos(t) = −4sin(t)cos(t) = −2sin(2t).
Direct differentiation of h(t)=cos(2t) gives h′(t)=−2sin(2t) — matches perfectly!✅
$f(\vec{x}) = \sum_{i=0}^{n} e^{x_i}, g(y) = ln(y).$ The composition function $\mathbb{h}:\mathbb{R}^n \to \mathbb{R}$ is defined as $\mathbb{h}(\vec{x}) = \mathbb{g}(\mathbb{f}(\vec{x})) = ln(\sum_{i=0}^{n} e^{x_i})$. By the Chain Rule, $\nabla \mathbb{h}(\vec{x}) = \mathbb{g'}(\mathbb{f}(\vec{x}))\nabla \mathbb{f}(\vec{x}) = \frac{1}{\sum_{i=0}^{n} e^{x_i}}\biggr(\begin{smallmatrix}e^{x_1}\\\\ e^{x_2}\\\\ \cdots \\\\ e^{x_n}\end{smallmatrix}\biggl)$
Let f(x, y) = x² + y², g(t) = eᵗ, so h(x, y) = $e^{x^2+y^2}$. Then: ∇f(x, y) = (2x, 2y), g’(t) = eᵗ ⇒ g’(f(x, y)) = $e^{x^2+y^2}$.
Hence, ∇h(x, y) = $e^{x^2+y^2}·(2x, 2y) = (2xe^{x^2+y^2}, 2ye^{x^2+y^2})$.
f is differentiable on ℝn, with gradient $\nabla f(\vec{x}) = 2\vec{x}$, g is differentiable for y > 0, with derivative $g'(y) = \frac{1}{2\sqrt{y}}$
Applying the Chain Rule (g is differentiable on the open set ℝn) $\nabla \mathbb{h}(\vec{x}) = \mathbb{g'}(\mathbb{f}(\vec{x}))\nabla \mathbb{f}(\vec{x}) = \frac{1}{2\sqrt{||\vec{x}||_2^{2}}}2\vec{x} = \frac{\vec{x}}{||\vec{x}||_2}$ for every non-zero vector $\vec{x} \ne \vec{0}$. At $\vec{x} = \vec{0}$, the Euclidean norm is not differentiable due to a corner point.
Intuition: The total derivative is the product of local linear approximations.