For every problem there is always, at least, a solution which seems quite plausible. It is simple and clean, direct, neat and nice, and yet very wrong, #Anawim, justtothepoint.com

image info

Differentiation on Euclidean Space

From Single-Variable to Multivariable Derivatives: Core Analogy

In single-variable calculus, the derivative $f'(a)$ is the slope of the tangent line at x = a, satisfying $f(a + h) - f(a) = f′(a)h + o(∣h∣)$ as $h \to 0$.

In multivariable calculus ($F: \mathbb{R}^n \to \mathbb{R}^m$), the concept generalizes: the derivative is not a single number, but a linear transformation $DF(a): \mathbb{R}^n \to \mathbb{R}^m$ such that $F(a + h) - F(a) = DF(a)h + o(∣|h|∣)$ as $h \to 0$.

This linear map DF(a) is the best linear approximation* of F near a.

Euclidean Space ℝⁿ

Definition. The n-dimensional Euclidean space ℝⁿ is the set of all ordered n-tuples: $\mathbb{R}^n = \{(x_1, x_2, \ldots, x_n) : x_i \in \mathbb{R}\}$ equipped with:

Vector addition: $(x_1, \ldots, x_n) + (y_1, \ldots, y_n) = (x_1 + y_1, \ldots, x_n + y_n)$
Scalar multiplication: $c(x_1, \ldots, x_n) = (cx_1, \ldots, cx_n)$
Euclidean norm: $\|x\| = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$. It induces the standard metric d(x, y) = ||x - y||.
Inner product: $\langle x, y \rangle = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n$

Definition. The standard basis for ℝⁿ consists of the vectors: $e_1 = (1, 0, 0, \ldots, 0), \quad e_2 = (0, 1, 0, \ldots, 0), \quad \ldots, \quad e_n = (0, 0, \ldots, 0, 1)$

More precisely: $(e_j)_i = \delta_{ij} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}$

The standard basis $\{ e_j \}_{j=1}^n$ spans $\mathbb{R}^n$. Any vector can be written as: $x = (x_1, x_2, \ldots, x_n) = x_1 e_1 + x_2 e_2 + \cdots + x_n e_n = \sum_{j=1}^{n} x_j e_j$. x has coordinates $(x_1, x_2, \ldots, x_n)$ relative to this bases.

Linear Maps

Definition. A function L: ℝⁿ → ℝᵐ is linear if:

$L(x + y) = L(x) + L(y)$ for all $x, y \in \mathbb{R}^n$
$L(cx) = cL(x)$ for all $x \in \mathbb{R}^n$, $c \in \mathbb{R}$

Equivalently: $L(\alpha x + \beta y) = \alpha L(x) + \beta L(y)$ for all scalars α, β.

Key Property: A linear map is completely determined by its action on basis vectors: $L(x) = L\left(\sum_{j=1}^{n} x_j e_j\right) = \sum_{j=1}^{n} x_j L(e_j)$

Every linear map L: ℝⁿ → ℝᵐ can be represented by an m × n matrix A: $L(x) = Ax$

The columns of A are the images of the standard basis vectors:

$$A = \begin{pmatrix} | & | & & | \\ L(e_1) & L(e_2) & \cdots & L(e_n) \\ | & | & & | \end{pmatrix}$$

Example: The linear map L: ℝ² → ℝ³ with L(e₁) = (1, 2, 3)ᵀ and L(e₂) = (4, 5, 6)ᵀ has matrix:

$$A = \begin{pmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{pmatrix}$$

The Fréchet Derivative: Best Linear Approximation

For a function F: U ⊆ ℝⁿ → ℝᵐ, we want to approximate F near a point a using the simplest possible function: a linear map.

The derivative is not a number or a vector —it’s a linear transformation that captures the “first-order” behavior of F at a .

The approximation takes the form: $F(a + h) \approx F(a) + L(h)$ where L: $\mathbb{R}^n \to \mathbb{R}^m$ is linear and the error vanishes faster than ∥h∥. The derivative is the best such approximation.

Definition (Fréchet Derivative). Let F: U ⊆ ℝⁿ → ℝᵐ where U is open, and let a ∈ U. We say F is differentiable at a if there exists a linear map L: ℝⁿ → ℝᵐ such that:

$$\lim_{h \to 0} \frac{\|F(a + h) - F(a) - L(h)\|}{\|h\|} = 0$$

If this limit exists, the linear map $L$ is unique and is called the derivative (or differential) of $F$ at $a$, denoted as $DF_a \quad \text{or} \quad dF_a \quad \text{or} \quad F'(a)$
a must be an interior point of U in $\mathbb{R}^n$ (the domain). This means there exists an open n-dimensional disk, ball or neighborhood centered at “a” that is entirely contained within U. This is necessary because we need to consider F(a+h) for small values of h, and we need room for a+h to be within the domain of f.
Domain and Codomain: F maps from a subset U of ℝⁿ (n-dimensional Euclidean space) to ℝᵐ (m-dimensional Euclidean space). So, the input is a vector in ℝⁿ and the output is a vector in ℝᵐ.
h is a vector in ℝⁿ (a small displacement from a where we’re computing the derivative). ||h|| is the norm (or magnitude) of the vector h. This is also a real number.
The fraction $\frac{\|F(a + h) - F(a) - L(h)\|}{\|h\|}$ represents the relative error of the linear approximation. Its limit equal to 0, meaning that the relative error goes to zero as h approaches zero. This is what makes L the “best” linear approximation.
The Limit: The crucial part is the limit: $\lim_{h \to 0} \frac{\|F(a + h) - F(a) - L(h)\|}{\|h\|}$ where h is a vector in ℝⁿ representing a small displacement from a. The limit is taken as h approaches the zero vector in ℝⁿ. F(a+h) - F(a) is the difference between the function’s value at a+h and its value at a. ||F(a+h) - F(a) - L(h)|| is the norm (or magnitude) of the vector F(a+h) - F(a) - L(h). It measures the absolute error of the linear approximation (how far off is our linear guess). This is a real number.
L(h) = Dfₐ(h): This is a linear transformation from ℝⁿ to ℝᵐ. It represents the best linear approximation of the change in F near a. Dfₐ is called the derivative (or differential) of f at a.

Equivalent Formulations

The definition can be rewritten as: $F(a + h) = F(a) + L(h) + o(\|h\|) \quad \text{as } h \to 0$ where $o(\|h\|)$ denotes a term with $\lim_{h \to 0} \frac{o(\|h\|)}{\|h\|} = 0$.

Or more explicitly: $F(a + h) = F(a) + L(h) + \|h\| \cdot \varepsilon(h)$ where $\varepsilon(h) \to 0$ as $h \to 0$.

The Approximation Formula: $F(a+h) \approx F(a) + DF_a(h)$. The new value equals the old value plus the linear correction.
The Error Term: The quantity $o(\|h\|) = F(a+h) - F(a) - L(h)$ is the “error” of the approximation. The condition $\lim_{h \to 0} \frac{o(\|h\|)}{\|h\|} = 0$ means the error vanishes faster than the step size $h$. This makes it a “tangent” approximation, not just a secant one.

Theorem (Uniqueness). If F is differentiable at a, the derivative $DF_a$ is unique.

Proof. Suppose L₁ and L₂ both satisfy the definition. Then, $\frac{\|L_1(h) - L_2(h)\|}{\|h\|} = \frac{\|[F(a+h) - F(a) - L_2(h)] - [F(a+h) - F(a) - L_1(h)]\|}{\|h\|}$

By the triangle inequality, $\leq \frac{\|F(a+h) - F(a) - L_1(h)\|}{\|h\|} + \frac{\|F(a+h) - F(a) - L_2(h)\|}{\|h\|}$

Both terms on the right tend to 0 as $h \to 0$. Hence, $\lim_{h \to 0} \frac{\|L_1(h) - L_2(h)\|}{\|h\|} = 0 (\star).$

Because $L_1$ and $L_2$ are linear, for any fixed non-zero vector u and any real $t \ne 0$, $L_i(tu) = tL_i(u)$.

Next, take h = tu with $t \to 0$. Then, $\|h\| = |t|\|u\|$ and $\frac{\|L_1(tu) - L_2(tu)\|}{|t|\|u\|} = \frac{|t| \|L_1(u) - L_2(u)\|}{|t|\|u\|} = \frac{\|L_1(u) - L_2(u)\|}{\|u\|}$

This expression does not depend on t. Since the limit as $f \to 0$ must be 0 $(\star)$, we obtain $\frac{\|L_1(u) - L_2(u)\|}{\|u\|} = 0 \implies L_1(u) = L_2(u)$

The equality holds for every non‑zero vector u; for u = 0 it is trivial because linear maps send 0 to 0. Therefore, $L_1$ and $L_2$ agree on the whole space, i.e., $L_1 = L_2$ ∎

Why this matters? Uniqueness guarantees that the derivative is well‑defined; otherwise the notation $DF_a$ would be ambiguous. Furthermore, the derivative is the best linear approximation.

Partial Derivatives

While the total derivative $DF_a$ captures how F changes in all directions simultaneously, partial derivatives measure rates of change along coordinate axes only.

Definition. Let F: U ⊆ ℝⁿ → ℝᵐ be a function and a ∈ U. The partial derivative of F with respect to $x_j$ at a is: $\frac{\partial F}{\partial x_j}(a) = \lim_{t \to 0} \frac{F(a + te_j) - F(a)}{t}$ where

$e_j$ is the j-th standard basis vector.
F(a + teⱼ) - F(a) is the change in the function’s value as we move a small distance t in the j-th direction.
$\frac{F(a+te_j)-F(a)}{t}$ represents the average rate of change of F with respect to xⱼ over the interval [aⱼ, aⱼ + t].
The limit as t → 0 gives us the instantaneous rate of change of F with respect to xⱼ at the point a.

The partial derivative $\frac{\partial F}{\partial x_j}(a)$ is the rate of change of F when we move from a in the $e_j$ direction, keeping all other coordinates fixed.

Example: F: ℝ² → ℝ³ defined by F(s, t) = (s² + t³, 2st, s + 3t) where $F_1(s,t) = s^2 + t^3$, $F_2(s,t) = 2st$, and $F_3(s,t) = s + 3t$

Partial with respect to s: $\frac{\partial F}{\partial s} = \begin{pmatrix} \frac{\partial F_1}{\partial s} \\[6pt] \frac{\partial F_2}{\partial s} \\[6pt] \frac{\partial F_3}{\partial s} \end{pmatrix} = \begin{pmatrix} 2s \\ 2t \\ 1 \end{pmatrix}$

Partial with respect to t: $\frac{\partial F}{\partial t} = \begin{pmatrix} \frac{\partial F_1}{\partial t} \\[6pt] \frac{\partial F_2}{\partial t} \\[6pt] \frac{\partial F_3}{\partial t} \end{pmatrix} = \begin{pmatrix} 3t^2 \\ 2s \\ 3 \end{pmatrix}$

The Jacobian Matrix

Since $DF_a$ is a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$, it can be represented by an $m \times n$ matrix. This is called the Jacobian Matrix.

Definition. Let F: U ⊆ ℝⁿ → ℝᵐ with component functions $F = (F_1, F_2, \ldots, F_m)$. The Jacobian matrix of F at a point a ∈ U is the m × n matrix formed by all partial derivatives evaluated at x = a:

$$J_F(a) = \begin{pmatrix} \frac{\partial F_1}{\partial x_1}(a) & \frac{\partial F_1}{\partial x_2}(a) & \cdots & \frac{\partial F_1}{\partial x_n}(a) \\[8pt] \frac{\partial F_2}{\partial x_1}(a) & \frac{\partial F_2}{\partial x_2}(a) & \cdots & \frac{\partial F_2}{\partial x_n}(a) \\[8pt] \vdots & \vdots & \ddots & \vdots \\[8pt] \frac{\partial F_m}{\partial x_1}(a) & \frac{\partial F_m}{\partial x_2}(a) & \cdots & \frac{\partial F_m}{\partial x_n}(a) \end{pmatrix}$$

Two Ways to View the Jacobian

Row View. The rows of the Jacobian are the transposes of the gradients of the component functions: $J_F(a) = \begin{pmatrix} — (\nabla F_1(a))^T — \\ — (\nabla F_2(a))^T — \\ \vdots \\ — (\nabla F_m(a))^T — \end{pmatrix}$

Column View: Each column is a partial derivative vector:

$$J_F = \begin{pmatrix} | & | & & | \\[4pt] \frac{\partial F}{\partial x_1} & \frac{\partial F}{\partial x_2} & \cdots & \frac{\partial F}{\partial x_n} \\[4pt] | & | & & | \end{pmatrix}$$

Computing the Differential

To find the actual change vector $DF_a(h)$ for a specific displacement $h$, we perform matrix multiplication: $dF_a(h) = J_F(a) \cdot \begin{pmatrix} h_1 \\ \vdots \\ h_n \end{pmatrix}$

The differential dFₐ is the best linear approximation of the change in F near the point a. The Jacobian matrix J_F(a) is the matrix that represents this linear transformation. When you multiply the Jacobian matrix by the vector h, you get the approximate change in F corresponding to the small change h.

Examples:

Vector-Valued Function (2D to 3D). Let $F: \mathbb{R}^2 \to \mathbb{R}^3$ be defined by $F(s, t) = (s^2+t^3, 2st, s + 3t)$.
Partials with respect to s: $(2s, 2t, 1)^T$
Partials with respect to t: $(3t^2, 2s, 3)^T$
$J_F(s,t) = \begin{pmatrix} 2s & 3t^2 \\ 2t & 2s \\ 1 & 3 \end{pmatrix}$

Complex function. Let $f(z) = z^2$. Writing $z = x+iy$, we have: $f(x,y) = (x+iy)^2 = (x^2-y^2) + i(2xy)$. Identifying $\mathbb{C} \cong \mathbb{R}^2$, this maps $(x,y) \mapsto (x^2-y^2, 2xy)$.
$J_f = \begin{pmatrix} \frac{\partial}{\partial x}(x^2-y^2) & \frac{\partial}{\partial y}(x^2-y^2) \\ \frac{\partial}{\partial x}(2xy) & \frac{\partial}{\partial y}(2xy) \end{pmatrix} = \begin{pmatrix} 2x & -2y \\ 2y & 2x \end{pmatrix}$
This has the special symmetry $\begin{pmatrix} A & -B \\ B & A \end{pmatrix}$ with A = 2x and B = 2y, which is characteristic of complex-differentiable (analytic) functions.
This structure is equivalent to the Cauchy–Riemann equations: $u_x = v_y, u_y = -v_x$.
Coordinate Transformation (Polar). Let $F(r, \theta) = (r\cos\theta, r\sin\theta)$.
$J_F = \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}$
The determinant $\det(J_F) = r(\cos^2\theta + \sin^2\theta) = r$.
When changes variables from Cartesian (x, y) to polar $(r, \theta)$, we have $dA = dx dy = |\det(J_F)|d\theta =[r \ge 0] r dr d\theta$. The transformation is not one‑to‑one on the whole plane, we usually restrict to r > 0 and a suitable interval (e.g., $[0, 2\pi)$) to obtain a bijection onto $\mathbb{R}^2$ minus a ray.
Dot Product (Bilinear Map —linear in each argument separately). Let $f: \mathbb{R}^3 \times \mathbb{R}^3 \to \mathbb{R}$ be $f(\vec{x}, \vec{y}) = \vec{x} \cdot \vec{y}$.
This takes 6 inputs: $(x_1, x_2, x_3, y_1, y_2, y_3)$.
$\frac{\partial f}{\partial x_i} = y_i \implies \nabla_{\vec{x}} f = \vec{y}$ (treating $\vec{y}$ as constant)
$\nabla_{\vec{y}} f = \vec{x}$ (treating $\vec{x}$ as constant)
For a scalar-valued function of 6 variables, the Jacobian is the gradient row vector: $J_f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \frac{\partial f}{\partial x_3}, \frac{\partial f}{\partial y_1},\frac{\partial f}{\partial y_2}, \frac{\partial f}{\partial y_3}\right) = (\vec{y}^T, \vec{x}^T) = (y_1, y_2, y_3, x_1, x_2, x_3)$.
Nicely compact: the gradient with respect to $\vec {x}$ followed by the gradient with respect to $\vec {y}$.
For small changes $(\Delta \vec{x}, \Delta \vec{y})$, the change in the dot product is: $df(\Delta \vec{x}, \Delta \vec{y}) = (y_1, y_2, y_3, x_1, x_2, x_3) \cdot \bigr(\begin{smallmatrix}\Delta x_1\\\ \Delta x_2 \\\ \Delta x_3 \\\ \Delta y_1\\\ \Delta y_2 \\\ \Delta y_3 \end{smallmatrix}\bigl) = \vec{y} \cdot \Delta \vec{x} + \vec{x} \cdot \Delta \vec{y}$.
This mirrors the product rule for dot products! $d(\vec{x} \cdot \vec{y}) = \vec{y} \cdot d\vec{x} + \vec{x} \cdot d\vec{y}$.

Complex Derivatives vs. Real Derivatives

We can think of a complex number z = x + iy as a point $(x,y) \in \mathbb{R^{\mathnormal{2}}}$. So any function $f:\mathbb{C}\rightarrow \mathbb{C}$ can also be seen as a function $F:\mathbb{R^{\mathnormal{2}}}\rightarrow \mathbb{R^{\mathnormal{2}}},\quad F(x,y)=(u(x,y),v(x,y)),$ where f(x + iy) = u(x, y) + iv(x, y).

So there are two notions of differentiability:

Real differentiability of $F:\mathbb{R^{\mathnormal{2}}}\rightarrow \mathbb{R^{\mathnormal{2}}}$.
Complex differentiability of $f:\mathbb{C}\rightarrow \mathbb{C}$.

They are related, but not the same. Complex differentiability is much more restrictive.

Real differentiability: any linear map is allowed. For a function $F:\mathbb{R^{\mathnormal{2}}}\rightarrow \mathbb{R^{\mathnormal{2}}}$, real differentiability at a point $(x_0,y_0)$ means that there exists a real linear map $DF(x_0,y_0):\mathbb{R^{\mathnormal{2}}}\rightarrow \mathbb{R^{\mathnormal{2}}}$ (a $2\times 2$ matrix) such that
$F(x_0+\Delta x,y_0+\Delta y)=F(x_0,y_0)+DF(x_0,y_0)\left( \begin{matrix}\Delta x\\ \Delta y\end{matrix}\right) +\mathrm{error},$ where the error is small compared to $\sqrt{(\Delta x)^2+(\Delta y)^2}.$
So in the real sense, a function is differentiable if it can be locally approximated by a linear transformation (a matrix multiplication). This matrix can stretch, rotate, reflect, or skew space in any way.
Complex differentiability: only complex multiplication is allowed. Now look at $f:\mathbb{C}\rightarrow \mathbb{C}$.
Complex differentiability at $z_0$ means that there exists a complex number a such that $f(z_0+h)=f(z_0)+a\, h+\mathrm{error},$ where the error is small compared to |h| as $h\rightarrow 0$.
In the real case, the linear approximation is any real-linear map $\mathbb{R^{\mathnormal{2}}}\rightarrow \mathbb{R^{\mathnormal{2}}}$. However, in the complex case, the linear approximation must be multiplication by a single complex number a, $a =re^{i\theta}$.
Multiplication by a complex number $z$ results only in rotation by angle $\theta$ and uniform scaling by r = |a|. It does not allow for reflection or skewing.

Every complex number $a=\alpha +i\beta$ defines a real-linear map $\mathbb{R^{\mathnormal{2}}}\rightarrow \mathbb{R^{\mathnormal{2}}}$ via multiplication: $a(x+iy)=(\alpha x-\beta y)+i(\beta x+\alpha y).$

In matrix form, this is: $\left( \begin{matrix}u\\ v\end{matrix}\right) =\left( \begin{matrix}\alpha &-\beta \\ \beta &\alpha \end{matrix}\right) \left( \begin{matrix}x\\ y\end{matrix}\right).$

So complex multiplication corresponds exactly to matrices of the form $\left( \begin{matrix}a&-b\\ b&a\end{matrix}\right)$.

These are precisely the matrices that represent rotation + uniform scaling (no reflection, no shear).

However, a general real derivative $DF(x_0,y_0)$ is an arbitrary matrix $\left( \begin{matrix}A&B\\ C&D\end{matrix}\right)$, with no relation between A, B, C, and D.

For complex differentiability, we require that this matrix comes from a single complex number, i.e. $\left( \begin{matrix}A&B\\ C&D\end{matrix}\right) =\left( \begin{matrix}a&-b\\ b&a\end{matrix}\right)$ for some real a, b. That forces: $A=D,\quad C=-B$.

These are exactly the Cauchy–Riemann equations in disguise.

Write f(z) = u(x, y) + iv(x, y), with z = x + iy. If f is complex differentiable at $z_0$, then:

u and v are real-differentiable at $(x_0,y_0)$.
The Jacobian matrix $J_F(x_0,y_0)=\left( \begin{matrix}u_x&u_y\\ v_x&v_y\end{matrix}\right)$ must be of the special form $\left( \begin{matrix}a&-b\\ b&a\end{matrix}\right).$

Matching entries gives: $u_x=v_y,\quad u_y=-v_x.$ These are the Cauchy–Riemann equations. They are exactly the condition that the real derivative is not just any linear map, but one that comes from complex multiplication.

Conclusion:

Real differentiability: Jacobian exists (any $2\times 2$ matrix).

Complex differentiability: Jacobian exists and satisfies Cauchy–Riemann, i.e. is of the form “multiply by a complex number”.