To err is human, to blame it on someone else is even more human, Jacob’s Law

Complex limits

The Power of Linear Approximation

In multivariable calculus, just as in single-variable calculus, differentiability is not merely about finding a “slope.” It is fundamentally about local linearization.

The central idea is remarkably simple: complicated functions can be locally approximated by simple linear functions. A function $f: \mathbb{R}^n \to \mathbb{R}^m$ is differentiable at a point a if, near a, the function behaves very much like a linear transformation. This linear transformation is the derivative.

Near any point where a function is “well-behaved,” the function looks almost like a straight line (in 1D), a plane (in 2D), or a hyperplane (in higher dimensions).

This insight transforms difficult nonlinear problems into tractable linear ones and provides the foundation for optimization (finding maxima and minima), numerical methods, error analysis and sensitivity, and understanding local behavior of complex systems.

Differentiability at a point

Definition. Differentiability at a point. Let $f: U \subseteq \mathbb{R}^n \to \mathbb{R}^m$ where U is an open set, and let $a \in U$. The function f is differentiable at a if there exists a matrix $Df(x) \in ℝ^{m \times n}$ that satisfies $\lim_{h \to 0} \frac{\|f(a + h) - f(a) - Df(a) \cdot h\|}{\|h\|} = 0$.

$h \in \mathbb{R}^n$ represents a small displacement vector. This matrix Df(x) is called the derivative, differential, or Jacobian matrix of f at a.
The requirement $a \in U \text{ or alternatively }a \in interior(\mathbb{dom(f)})$ ensures that for small h, the point a + h stays in the domain. This is necessary for the limit to be well-defined.
Equivalent Formulations: (i) Little-o notation: $f(a+h) = f(a) + Df(a) \cdot h + o(\|h\|)$; (ii) Error form: $f(a+h) = f(a) + Df(a) \cdot h + \|h\| \cdot \varepsilon(h)$ where $\varepsilon(h) \to 0$; (iii) Affine approximation: $f(a+h) \approx f(a) + Df(a) \cdot h$ with vanishing relative error.
Little-o notation breaks down the change in the function value into three parts:
Base Value: $f(a)$, the function value at the base point for approximation.
Linear Part: $Df(a)h$, the best linear prediction of the change.
Error Term, $o(\|h\|)$, the discrepancy, which vanishes faster than $\|h\|$. This means the linear term Df(a)·h captures the “first-order” (dominant) behavior of f near a.

Differentiable function

Definition. A function $f: U \subseteq \mathbb{R}^n \to \mathbb{R}^m$ is differentiable (on U) if U is an open set and f is differentiable at every point $a \in U$ (every point is an interior, so the limit definition makes sense everywhere and there’s “room” around each point to approach from all directions).

If the derivative exists, it is unique. There is only one matrix $A$ that satisfies the limit condition. This matrix $A$ is the Jacobian.

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$ with component functions $f(x) = (f_1(x), \dots, f_m(x))$, the derivative $Df(x)$ is the $m \times n$ matrix of partial derivatives:

$$Df(x) = J_f(x) = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix}$$

The First‐Order Approximation

A first-order approximation (also called a linear approximation) provides a linear estimate of a function's value near a specific point using the function's value and its first derivative(s) at that point. Geometrically, it represents the tangent line (in 1D), tangent plane (in 2D), or tangent hyperplane (in higher dimensions) to the function’s graph. The remainder term, often expressed using Landau’s little-o notation (e.g., o(∥h∥)), quantifies the error of this approximation, indicating that the error shrinks faster than the displacement from the point of approximation.

The first-order approximation (or linearization) of $f$ near $a$ is the affine function: $\tilde{f}(x) =[\text{ or } L(x) =] f(a) + Df(a) \cdot (x - a)$.

This is the unique affine function that (i) matches f at a: $\tilde{f}(a) = f(a)$; (ii) has the same “rate of change” as f at a: $D\tilde{f}(a) = Df(a)$

$f(x) = \tilde{f}(x) + r(x)$ where the remainder r(x) (r(x) = o(∥x - a∥)) satisfies $\lim_{x \to a} \frac{\|r(x)\|}{\|x - a\|} = 0$. The error in the linear approximation vanishes (or shrinks) faster than the distance from a (displacement from the point of approximation).

Single-variable case

Case 1: In the single-variable case (f: ℝ → ℝ), the first-order approximation of a function f(x) at a point x = a is given by the equation of the tangent line to the curve y = f(x) at that point. Mathematically, this is expressed as: L(x) = f(a) + f′(a)(x − a).

This linear function L(x) provides an approximation to f(x) for values of x close enough to a. That tangent line approximation is the best linear fit locally. Geometrically, this means we are approximating the curve by a straight line that just touches the curve at (a, f(a)) and has the same slope as the curve at that point.

The tangent line “kisses” the curve. It touches at (a, f(a)) and shares the curve’s instantaneous slope f′(a), then diverges as the curve bends away. Among all lines through (a, f(a)), the tangent line (m = f′(a)) minimizes the error for x near a.

The error of this approximation, E(x) = f(x) − L(x), is what is left over after subtracting the linear approximation from the actual function value.

When f is twice differentiable, Taylor’s theorem (with Lagrange remainder) provides a way to analyze this error term, E(x) = $\frac{f''(\xi)}{2}(x - a)^2$, for some $\xi$ between a and x, indicating that the error shrinks quadratically as x approaches a. Halve the distance, quarter the error. The tangent line hugs the curve increasingly tightly as we zoom in: $|E'(x)| \le \frac{max|f''|}{2}(x -a)^2$.

Example: $f(x) = \sqrt{x}$ near a = 4. f(4) = 2, f’(x) = $\frac{1}{2\sqrt{x}}$, so f’(4) = 1/4.

$L(x) = 2 + \frac{1}{4}(x - 4) = 1 + \frac{x}{4}$.

Approximation: $\sqrt{4.1}$ ≈ L(4.1) = 1 + 4.1/4 = 2.025 (actual: ≈ 2.0248). The approximation is excellent near a = 4 and degrades gracefully as we move away.

Scalar-valued function of two variables

Case 2. For a scalar-valued function of two variables, f: U ⊆ ℝ² → ℝ, f(x, y), the first-order approximation at a point (x₀, y₀) is given by the equation of the tangent plane to the surface z = f(x, y) at that point. Mathematically, this is formulated as: L(x, y) = f(x₀, y₀) + f_x(x₀, y₀)(x - x₀) + f_y(x₀, y₀)(y - y₀)

f(x₀, y₀) is the function value (the “anchor” point on the surface) at the point (x₀, y₀).
f_x(x₀, y₀) and f_y(x₀, y₀) are the partial derivatives of f with respect to x and y, respectively, evaluated at (x₀, y₀). They represent the slopes of the tangent lines to the traces of the surface in planes parallel to the xz-plane and yz-plane, respectively.
f_x(x₀, y₀) is the slope in x-direction (holding y constant).
Tangent line to trace in plane $y = y_0$ (parallel to xz -plane), $z = f(x, y_0) \implies \text{slope} = \frac{\partial f}{\partial x}(x_0, y_0)$.
f_y(x₀, y₀) is the slope in y-direction (holding x constant).
Tangent line to trace in plane $x = x_0$ (parallel to yz -plane), $z = f(x_0, y) \implies \text{slope} = \frac{\partial f}{\partial y}(x_0, y_0)$.
The tangent plane is the unique plane that: (1) passes through $(x_0, y_0, f(x_0, y_0)$, and (2) contains both directional tangent lines.
(x - x₀) and (y - y₀) are the displacements in the x and y directions from the point (x₀, y₀).
Geometrically, the surface z = f(x, y) is approximated by the plane z = L(x, y) that is tangent to the surface at (x₀, y₀, f(x₀, y₀)).
The tangent plane to the surface can be written compactly using the gradient $\nabla f = (f_x, f_y), L(x, y) = f(x_0, y_0) + \nabla f(x_0, y_0) \cdot (\begin{smallmatrix}x - x_0\\\ y-y_0\end{smallmatrix})$

Example: f(x, y) = x² + y² near (1, 1)

Evaluate function and partials. f(1, 1) = 2, $f_x = 2x \implies f_x(1, 1) = 2, f_y = 2y \implies f_y(1, 1) = 2$

Construct tangent plane. $L(x, y) = 2 + 2(x - 1) + 2(y - 1) = 2x + 2y - 2$

Approximation: f(1.1, 0.9) ≈ L(1.1, 0.9) = 2(1.1) + 2(0.9) - 2 = 2. Actual: (1.1)² + (0.9)² = 1.21 + 0.81 = 2.02

Point	$L(x, y)$	$f(x, y)$	Error $E(x, y)$
$(1, 1)$	2	2	0
$(1.1, 0.9)$	2.0	2.02	0.02
$(1.1, 1.1)$	2.4	2.42	0.02
$(0.9, 0.9)$	1.6	1.62	0.02
$(1.5, 1.5)$	4	4.5	0.5

The error grows quadratically with distance from (1,1).

image info

Figure. For f(x, y) = x² + y², the red plane at (1,1) is the Jacobian’s linear approximation.

N Variables

This concept extends naturally to functions of n variables, f: U ⊆ ℝⁿ → ℝ, point a = (a₁, …, aₙ) ∈ U, f(x₁, x₂, ···, xₙ) where the first-order approximation at a point a = f(a₁, a₂, ···, aₙ) is $L(\vec{x}) = f(\vec{a}) + \sum_{i=1}^{n}f_{x_i}(\vec{a})(x_i-a_i) =[\text{Notation}] f(\vec{a}) + \sum_{i=1}^{n}\frac{\partial f}{\partial x_i}(\vec{a})(x_i-a_i)$ where $f_{x_i}(\vec{a}) = \frac{\partial f}{\partial x_i}(\vec{a})$ is the partial derivative of f with respect to x_i evaluated at $\vec{a}$.

Or using gradient notation: $L(\vec{x}) = f(\vec{a}) + \nabla f(\vec{a}) \cdot (\vec{x} - \vec{a})$

This approximation is a tangent hyperplane to the hypersurface in $\mathbb{R}^{n+1}$.

Vector-Valued Functions

For a vector-valued function f: U ⊆ ℝⁿ → ℝᵐ, where $f(\vec{x}) = (f_1(\vec{x}), f_2(\vec{x}), \cdots, f_m(\vec{x}))$, the first-order approximation at a point $\vec{a}$ ∈ ℝⁿ involves the Jacobian matrix of f at $\vec{a}$.

The Jacobian matrix, denoted $\mathbb{Df}(\vec{a})$, is an m×n matrix whose (i,j)-th entry is the partial derivative $\frac{\partial f_i}{\partial x_j}(\vec{a})$. The linear approximation $\mathbb{L}(\vec{x})$ of $\mathbb{f}(\vec{x}) \text{ for } \vec{x} \text{ near } \vec{a}$ is given by: $\mathbb{L}(\vec{x}) = \mathbb{f}(\vec{a}) + \mathbb{Df}(\vec{a})(\vec{x}-\vec{a})$ where:

$\mathbb{f}(\vec{a})$ is the function value (a vector in ℝᵐ) at $\vec{a}$.
$\mathbb{Df}(\vec{a})$ is the Jacobian matrix (an m×n matrix) evaluated at $\vec{a}$.
$\vec{x}-\vec{a}$ is the displacement vector (in ℝⁿ) from $\vec{a}$.
This provides a linear mapping from ℝⁿ to ℝᵐ that best approximates f near $\vec{a}$

The Gradient

Definition. Let f: U ⊆ ℝⁿ → ℝ be differentiable at a. The gradient of f at a is: $\nabla f(\vec{a}) = \begin{pmatrix} \frac{\partial f}{\partial x_1}(\vec{a}) \\[3pt] \frac{\partial f}{\partial x_2}(\vec{a}) \\[3pt] \vdots \\[3pt] \frac{\partial f}{\partial x_n}(\vec{a}) \end{pmatrix}$

For scalar-valued functions, the Jacobian is a 1 × n row vector, and $\nabla f(\vec{a}) = Df(\vec{a})^T$

The first-order approximation can be written as $f(\vec{a} + \vec{h}) \approx f(\vec{a}) + \nabla f(\vec{a}) \cdot \vec{h}$ where $f(\vec{a}) + \nabla f(\vec{a}) \cdot \vec{h}$ is the dot product (inner product). The gradient is the vector that best predicts how f changes near $\vec{a}$. The dot product with $\vec{h}$ tells us how much of $\vec{h}$ points in the “increasing” direction.

Key Properties of the Gradient

$\nabla f(\vec{a})$ points in the direction of steepest increase of f.
$\|\nabla f(\vec{a})\|$ equals the maximum rate of increase.
$\nabla f(\vec{a})$ is perpendicular to level sets of f passing through a.
$\boxed{D_{\vec{u}} f(\vec{a}) = \nabla f(\vec{a}) \cdot \vec{u}}$ for unit vector u.

Why the Gradient Points Uphill? For any unit vector $\vec{u}$, the directional derivative is: $D_{\vec{u}} f(\vec{a}) = \nabla f(\vec{a}) \cdot \vec{u} = \|\nabla f(\vec{a})\| \|\vec{u}\| \cos\theta = \|\nabla f(\vec{a})\| \cos\theta$ where θ is the angle between $\nabla f(\vec{a})$ and $\vec{u}$.

This expression is maximized when $cos(\theta) = 1 \implies$ θ = 0 ($\vec{u}$ parallel to $\nabla f(\vec{a}).$) Thus, $\max_{\|\vec{u}\|=1} D_{\vec{u}} f(a) = \|\nabla f(a)\|$ achieved when $\vec{u} = \frac{\nabla f(a)}{\|\nabla f(a)\|}$. So the gradient is literally the direction in which the function increases fastest.

A level set is: $L_c=\{ \vec {x}:f(\vec {x})=c\}$. If we move along a level set, the value of f doesn’t change. If we define a path $\vec{r}(t)$ that stays entirely within the level set $L_c$ then, $f(\vec{r}(t)) = c \implies \frac{d}{dt}(f(\vec{r}(t))) = 0 \implies \nabla f(\vec{r}(t)) \cdot \vec{r'}(t) = 0$.

Since $\vec{r'}(t)$ represent the tangent vector $\vec{v}$ to the curve at any point, the dot product confirms that the gradient $\nabla f(\vec{a})$ is orthogonal to every tangent direction. Therefore, it is normal to the level set.

Directional Derivatives

Definition. The directional derivative of a function f at a point a in the direction of a unit vector u (i.e., ∥u∥ = 1) is defined as the instantaneous rate of change of the function as one moves away from a in the direction of u.

It is denoted by $D_{\vec{u}} f(\vec{a})$ or $f'_u(a)$ and is given by the formula $D_{\vec{u}} f(\vec{a}) = \lim_{t \to 0} \frac{f(\vec{a} + t\vec{u}) - f(\vec{a})}{t}$ provided this limit exists.

Geometrically, it represents the slope of the tangent line to the curve obtained by intersecting the graph of f with the vertical plane passing through a in the direction of u.

If f is differentiable at a, the directional derivative is given by the dot product of the gradient vector ∇f(a) and the direction vector u: $D_{\vec{u}} f(\vec{a}) = \nabla f(\vec{a}) \cdot \vec{u}$. Crucially, if f is differentiable at a, this formula shows that the gradient vector ∇f(a) contains all the information needed to compute the rate of change of f in ANY direction at a.

The directional derivative is simply the projection of the gradient onto the desired direction u. The partial derivatives f_xᵢ =[Notation] $\frac{\partial f}{\partial x_i}$ are special cases of directional derivatives where the direction u is the standard unit vector (basic vector) in the xᵢ-th coordinate direction, e_i = (0, ···, 1, ···, 0) (a 1 in the i-th position and 0 elsewhere), $\frac{\partial f}{\partial x_i}(\vec{a}) = D_{\vec{e}_i} f(\vec{a}) = \nabla f(\vec{a}) \cdot \vec{e}_i$

The first-order approximation can be used to approximate the value of a function f(a + hu) when moving a small distance h from a point a in the direction of a unit vector u. Using the general first-order approximation formula $f(\vec{a}+\vec{h}) \approx f(\vec{a}) + \nabla f(\vec{a}) \cdot \vec{h}$, and letting $\vec{h} = h\vec{u}$, we get: $f(\vec{a}+\vec{h}) \approx f(\vec{a}) + \nabla f(\vec{a}) \cdot (h\vec{u}) = f(\vec{a}) + h(\nabla f(\vec{a}) \cdot \vec{u})$.

Recognizing that $\nabla f(\vec{a}) \cdot \vec{u}$ is the directional derivative $\mathbb{D_{\vec{u}}f(\vec{a})}$, the approximation becomes $f(\vec{a}+\vec{h}) \approx f(\vec{a}) + h \cdot \mathbb{D_{\vec{u}}f(\vec{a})}$, meaning that the change in the function value $f(\vec{a}+\vec{h})$ is approximately equal to the distance moved (h) times the rate of change of the function in that very direction ($\mathbb{D_{\vec{u}}f(\vec{a})}$).

The accuracy of this approximation depends on the magnitude of $\vec{h}$ and the behavior of the function. If f is differentiable, the error of this approximation will be o($\vec{h}$), meaning it becomes negligible compared to $\vec{h}$ for small $\vec{h}$.

Definition. The affine function $\tilde{f}(x) = f(a) + Df(a)(x -a)$ is called the first-order approximation or linearization of the function f at the point x = a where a must lie in the interior of the domain of f and $Df(a)$ represents the derivative (or Jacobian matrix for multivariable functions) at a. This formula highlights that the approximation is constructed using the function’s value at a and the rate of change of the function at a scaled by the displacement from a.

Geometric Interpretation: Near the point a, the graph of f is well modeled by the tangent hyperplane $\{(x, \tilde f(x)) : x \approx a \}$. Moving a small distance from a in any direction, the change in f is nearly the directional derivative along that direction.

$f(x) = \tilde f(x) + r(x)$, where the remainder r(x) satisfies $\lim_{x \to a} \frac{\|r(x)\|}{\|x - a\|} = 0.$ In Landau notation, r(x) = o(|x - a|). This condition captures the statement that any deviation of f from its tangent hyperplane is negligible compared to the distance $\|x - a\|$.

Definition. Let $f: \real^n \to \real$ be a differentiable scalar-valued function. The derivative $Df(\vec{a})$ is a 1 x n matrix (a row vector). The gradient of f at $\vec{a}$, denoted by $\nabla f(\vec{a})$, is defined as the transpose of the derivative or the vector of partial derivatives: $\nabla f(\vec{a}) = Df(\vec{a})^T$ where $\vec{a} \in Interior(dom(f))$ (the interior of the domain of f). It plays a central role in the first-order approximation and in understanding rates of change. The i-th component of the gradient is given by the partial derivative with respect to the i-th variable, $\nabla f(\vec{a})_i = \frac{\partial f}{\partial x_i}(\vec{a})$, for i = 1, ···, n provided f is differentiable at x, $\nabla f(\vec{a}) = (\frac{\partial f}{\partial x_1}(\vec{a}), \frac{\partial f}{\partial x_2}(\vec{a}), \cdots, \frac{\partial f}{\partial x_n}(\vec{a}))$.

Function Type	Formula $f(x)$	Jacobian $Df(x)$	Gradient $\nabla f(x)$
Linear	$a^T x$	$a^T$	$a$
Matrix-Linear	$Ax$	$A$	N/A (vector-valued)
Affine	$Ax + b$	$A$	N/A (vector-valued)
Squared Norm	$\\|x\\|^2$	$2x^T$	$2x$
Quadratic	$x^T A x$	$x^T(A+A^T)$	$(A+A^T)x$
Symmetric Quad.	$x^T P x$ ($P=P^T$)	$2x^T P$	$2Px$
Least Squares	$\\|Ax - b\\|^2$	$2(Ax-b)^T A$	$2A^T(Ax-b)$

The First-Order Approximation: Linearization and Gradients