Behind this mask there is more than just flesh. Beneath this mask there is an idea… and ideas are bulletproof, Alan Moore
Definition. Differentiability at a point. Let $f : ℝ^n \to ℝ^m$ be a function and let x be an interior point of the domain of f, $x \in \text{interior dom } f$. The function f is differentiable at x if there exists a matrix $Df(x) \in ℝ^{m \times n}$ that satisfies $\lim_{z \in dom f, z \neq x, z \to x} \frac{f(z) - f(x) - Df(x)||(z-x)||_2}{||(z-x)||_2} = 0$ [*]
Such a matrix Df(x) is called the derivative (or Jacobian) of f at x.
Df(x) = $\Biggl (\begin{smallmatrix}\frac{∂f_1(x)}{∂x_1} & \frac{∂f_1(x)}{∂x_2} & ··· & \frac{∂f_1(x)}{∂x_n}\\ \frac{∂f_2(x)}{∂x_1} & \frac{∂f_2(x)}{∂x_2} & ··· & \frac{∂f_2(x)}{∂x_n}\\· & · & · & ·\\\\· & · & · & ·\\\\· & · & · & ·\\\frac{∂f_m(x)}{∂x_1} & \frac{∂f_m(x)}{∂x_2} & ··· & \frac{∂f_m(x)}{∂x_n}\end{smallmatrix}\Biggr )$. The Jacobian Df(x) is an m x n real matrix and this is the practical way to compute the Jacobian matrix.
The definition of differentiability captures the idea that a function can be locally approximated by a linear transformation. The Jacobian matrix is the matrix representation of this linear transformation, and its entries are the partial derivatives of the component functions. The use of norms is crucial for making the definition rigorous in higher dimensions. The condition x ∈ interior dom f ensures that we can consider small perturbations around x within the function’s domain.
Jacobian of identity function. Let f: $ℝ^n \to ℝ^m$ be a function defined as $f(\vec{x}) = \vec{x}$. This function simply returns its input. Each component function is $f_i(\vec{x}) = x_i$. The partial derivative of $f_i(\vec{x})$ with respect to xj is: $Df(\vec{x})_{ij} = \frac{∂f_i(\vec{x})}{∂x_j} = \delta(i, j)$ where δ(i, j) is the Kronecker delta, defined as: δ(i, j) = 1 if i = j, δ(i, j) = 0 if i ≠ j. This is precisely the definition of the n × n identity matrix, Df(x) = In.
Jacobian of a linear transformation. Let f: $ℝ^n \to ℝ^m$ be a function defined as $f(\vec{x}) = A\vec{x}$ where A = {aij} ∈ ℝm x n (a m x n real matrix). Then, the i-th component function of f is given by: $f_i(\vec{x}) = \sum_{j=1}^n a_{ij}x_j$. The partial derivative of $f_i(\vec{x})$ with respect to xj is: $Df(\vec{x})_{ij} = \frac{∂f_i(\vec{x})}{∂x_j}$ = aij. Thus, the Jacobian matrix $Df(\vec{x})$ is a m x n matrix where the (i, j)-th entry is aij. This is exactly the matrix A, $Df(\vec{x}) = A$.
Example (m = 2, n = 3). If A = $(\begin{smallmatrix}a_{11} & a_{12} & a_{13}\\a_{21} & a_{22} & a_{23}\end{smallmatrix})$ and $f(\vec{x}) = A\vec{x}$, then $Df(\vec{x}) = A = (\begin{smallmatrix}a_{11} & a_{12} & a_{13}\\a_{21} & a_{22} & a_{23}\end{smallmatrix})$
Jacobian of affine transformation. Let f: $ℝ^n \to ℝ^m$ be a function defined as $f(\vec{x}) = A\vec{x} + b$ where A = {aij} ∈ ℝm x n (a m x n real matrix), b ∈ ℝm (a constant vector). Then, the i-th component of the function f is: $f_i(\vec{x}) = \sum_{j=1}^n a_{ij}x_j + b_i$. The partial derivative of $f_i(\vec{x})$ with respect to xj is: $Df(\vec{x})_{ij} = \frac{∂f_i(\vec{x})}{∂x_j}$ = aij (the constant term bi disappears when we take the derivative). Thus, the Jacobian matrix $Df(\vec{x})$ is a m x n matrix where the (i, j)-th entry is aij. This is again the matrix A, $Df(\vec{x}) = A$. In other words, the vector b is just a constant offset and has no impact on the derivative.
Let f: $ℝ^n \to ℝ^m$ be a function defined as f(x, y) = (x2 + y, xy), then $Df(\vec{x}) = (\begin{smallmatrix}2x & 1\\ y & x\end{smallmatrix})$
Definition. A function f is called differentiable if its domain f is open and it is differentiable at every point of its domain.
Definition. The affine function $\tilde{f}(x) = f(a) + Df(a)(x -a)$ is called the first-order approximation or linearization of the function f at the point x = a where a lies in the interior of the domain of f (i.e. a ∈ int(dom(f))).
Definition. Let $f: \real^n \to \real$ be a real-valued function. The derivative $Df(\vec{x})$ is a 1 x n matrix (a row vector). The gradient of f at $\vec{x}$, denoted by $\nabla f(\vec{x})$, is defined as the transpose of the derivative: $\nabla f(\vec{x}) = Df(\vec{x})^T$ where $\vec{x} \in Interior(dom(f))$ (the interior of the domain of f). This results in a column vector. The i-th component of the gradient is given by the partial derivative with respect to the i-th variable, $\nabla f(\vec{x})_i = \frac{\partial f(x)}{\partial x_i}$, for i = 1, ···, n provided f is differentiable at x.
The components of the gradients are given by the partial derivatives: $\nabla f(\vec{x})_i = \frac{\partial f(x)}{\partial x_i} =$
$\frac{\partial}{\partial x_i}(\sum_{j=1}^n a_jx_j) = a_i \implies \nabla f(\vec{x}) = \vec{a}$
The components of the gradients are given by the partial derivatives: $\nabla f(\vec{x})_i = \frac{\partial f(x)}{\partial x_i} =$
$\frac{\partial}{\partial x_i}(\sum_{j=1}^n a_jx_j + b) = a_i \implies \nabla f(\vec{x}) = \vec{a}$. As expected, the intercept b is just a constant term which doesn’t affect the gradient.
$2a_{kk}x_k + \sum_{i, i \neq k} x_ia_{ik}$ (these are linear terms where j = k and i = 1, ···, n, i ≠ k) + $\sum_{j, j \neq k} a_{kj}x_j$ (these are linear terms where i = k and j = 1, ···, n, j ≠ k) =[Moving one akk into each sum] $\sum_{i=1}^n x_ia_{ik} + \sum_{j=1}^n a_{kj}x_j$. In vector notation, this can be written as: $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x} =[\text{if A is symmetric, i.e., A =} A^T] 2A\vec{x}$.
$f(\vec{x}) = \vec{x}^TA\vec{x} = (\begin{smallmatrix}x_1 & x_2 & ··· & x_n \end{smallmatrix})\Biggl (\begin{smallmatrix}a_{11} & a_{12} & ··· & a_{1n}\\ a_{21} & a_{22} & ··· & a_{2n} \\ ··· & ··· & ··· & ···\\ a_{n1} & a_{n2} & ··· & a_{nn} \end{smallmatrix} \Biggr) \Biggl ( \begin{smallmatrix}x_1\\ x_2 \\ ··· \\ x_n \end{smallmatrix} \Biggr ) = (\sum_{i=1}^n x_ia_{i1}, \sum_{i=1}^n x_ia_{i2}, ···, \sum_{i=1}^n x_ia_{in}) \Biggl ( \begin{smallmatrix}x_1\\ x_2 \\ ··· \\ x_n \end{smallmatrix} \Biggr ) = \sum_{j=1}^n(\sum_{i=1}^n x_ia_{ij})x_j = \sum_{i=1}^n \sum_{j=1}^n x_ia_{ij}x_j$
A particular case is the gradient of squared norm $\mathbb{l}_2$. Let f be a quadratic form give by $f: \real^n \to \real, f(\vec{x}) = ||\vec{x}||_2² = \vec{x}^T\vec{x} =[\text{This can also be expressed as}] \vec{x}^TI\vec{x}$ where I is the n x n identity matrix. From the general result for the gradient of a quadratic form $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x}$, we can substitute A = I. Therefore, the gradient of f(x) is $\nabla f(\vec{x}) = (I^T+I)\vec{x} =[\text{Since the identity matrix is symmetric} I^T = I, \text{ this simplifies to}] 2I\vec{x} = 2\vec{x}$.
The $\mathbb{l}_2$, Euclidean or squared norm is a measure of the magnitude of a vector in Euclidean space. It is calculated as the square root of the sum of the squares of the vector’s components. The $\mathbb{l}_2$ norm of a vector $\mathbf{\vec{x}}$ is denoted as $∥\vec{x}∥^2_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2} = \vec{x}^T\vec{x}$.
The gradient could be calculated as follows: $\nabla f(\vec{x}) = \nabla (\frac{1}{2}\vec{x}^TP\vec{x} + \mathbb{q}^T\vec{x} + r) = $[We can apply the gradient operator to each term separately, as the gradient is linear] $\frac{1}{2} \nabla(\vec{x}^TP\vec{x}) + \nabla(\mathbb{q}^T\vec{x}) + \nabla(r) =$ [From the general result for the gradient of a quadratic form $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x}$, we can substitute A = P, the gradient of a constant r is zero, the gradient of a linear function of the form $f(\vec{x}) = \vec{a}^T\vec{x}, \nabla f(\vec{x}) = \vec{a}$, so putting all together] $\frac{1}{2}(P^T+P)\vec{x}+\vec{q}$ =[Since P is symmetric] $P\vec{x}+\vec{q}$
Definition. Let $\mathbb{f}:\mathbb{R}^n \to \mathbb{R}$ be a real valued function with domain S = dom(f). Let $\mathbb{U} \subseteq \mathbb{S}$ be an open set. t. If all the partial derivatives of f exist and are continuous at every point x ∈ U, then f is said to be continuously differentiable on U. If the domain S itself is an open set and f is continuously differentiable on S, then f is said to be continuously differentiable.