Logic will get you from A to B. Imagination will take you everywhere, Albert Einstein

image info

Motivation

The first-order approximation is one of the most powerful tools in mathematics and its applications. The fundamental insight is simple yet quite profound, near any point where a function is differentiable, the function behaves approximately like a linear function.

First Order Approximation

Definition. Let $f: U \subseteq \mathbb{R}^n \to \mathbb{R}$ be differentiable at a point $a \in int(dom(f))$. The affine function $\tilde{f}(x) = f(a) + \nabla f(a)^T(x-a)$ is called the first-order or linear approximation of f at a.

Unpacking the definition:

An equivalent definition, using the displacement h = x - a: $\tilde{f}(a + h) = f(a) + \nabla f(a)^T h = f(a) + \nabla f(a) \cdot h$
f(a) is function’s value at the base point a (center of approximation). Geometrically, in a plot of z = f(x), it’s the height of the graph directly above x = a.
$\nabla f(a)$ is the gradient of the function f evaluated at the base point a. It is a vector pointing in the direction of the greatest rate of increase (steepest increase) of the function at that point. It is perpendicular to the level sets of f. For a real-valued function of multiple variables, the gradient is a vector of partial derivatives: $\nabla f(a) = \begin{pmatrix} \frac{\partial f}{\partial x_1}(a) \\[6pt] \frac{\partial f}{\partial x_2}(a) \\[6pt] \vdots \\[6pt] \frac{\partial f}{\partial x_n}(a) \end{pmatrix}$
$\nabla f(a)^T$ is the transpose of the gradient vector so we can form a dot-product or matrix‐vector product.
(x - a) is the displacement vector, the difference between the point x (where we want to approximate the function) and the point a (where we’re basing our approximation). It’s a vector that represents the displacement from a to x, but in ℝⁿ, it lives in the same n-dimensional space as x.
$\nabla f(a)^T(x-a)$ is the dot product between the transposed gradient and the displacement vector, representing the directional derivative of f at a in the direction of (x -a). It tells us the change predicted by the linear term.
Finally, $f(a) + \nabla f(a)^T(x-a)$ start with the function’s value at a (f(a)) and add the change predicted by the gradient ($\nabla f(a)^T(x-a)$). This gives us a linear approximation of the function’s value at x.
The first-order approximation assumes that the function behaves linearly near a. This is a good approximation if x is very close to a. However, as x moves further away from a, because we have replaced a generally curved function by its tangent hyperplane, the curvature of the function typically becomes more significant and the linear approximation becomes less accurate.
Geometric Interpretation.
For f: ℝ → ℝ, the first-order (linear) approximation of f at a is: $\tilde{f}(x) = f(a) + f'(a)(x - a)$. This is the equation of the tangent line to the graph y = f(x) at the point (a, f(a)).
The tangent line is the straight line that best “hugs” the curve near that point. If you zoom in on the graph at (a, f(a)), the curve looks almost like this line (the slope f’(a) tells us how steep the line is).
For f: ℝ² → ℝ, the first-order approximation at (a, b) is $\tilde{f}(x, y) = f(a, b) + f_x(a, b)(x - a) + f_y(a, b)(y - b)$. This is the equation of the tangent plane to the surface z = f(x, y) at the point (a, b, f(a, b)).
Just as the tangent line touches the curve at one point and has the same slope, the tangent plane touches the surface and has the same partial derivatives in the x‑ and y‑directions.
The coefficients $f_x$ and $f_y$ give the slopes of the plane in the x‑ and y‑directions.
For a function of n variables f: ℝⁿ → ℝ, the first-order approximation at a point a defines a tangent hyperplane in $\mathbb{R}^{n+1}$: $\tilde{f}(x) = f(a) + \nabla f(a)^T (x - a)$.
The tangent hyperplane is the best n-dimensional flat subspace in $\mathbb{R}^{n+1}$ that just “touches” (or “kisses”) the graph at (a, f(a)).

The Approximation Property

Theorem (Differentiability Characterization). A function $f:U\subseteq \mathbb{R^{\mathnormal{n}}}\rightarrow \mathbb{R}$ is differentiable at a if and only if there exists a vector $g\in \mathbb{R^{\mathnormal{n}}}$ such that $f(x) = f(a) + g^T(x - a) + r(x)$ with the remainder r(x) satisfying $\lim_{x \to a} \frac{|r(x)|}{\|x - a\|} = 0$

This is the multivariable analogue of the single‑variable idea: $f(x)=f(a)+f'(a)(x-a)+o(|x-a|).$
Differentiability at a means: (i) Near a, the function behaves almost exactly like a linear function. (ii) The vector g is the best linear approximation to f at a. (iii) The error term r(x) becomes negligible faster than $\| x-a\|$.
In other words, the function is differentiable at a precisely when you can write: $f(x)=\mathrm{(value\ at\ }a)+\mathrm{(linear\ part)}+\mathrm{(tiny\ error)}.$
We write r(x) = o(∥x - a∥) to mean: $\lim_{x \to a} \frac{|r(x)|}{\|x - a\|} = 0$.
The remainder r(x) shrinks faster (it becomes negligible compared to the distance) than the distance ∥x - a∥.
Using this notation: $f(x) = f(a) + \nabla f(a)^T(x - a) + o(\|x - a\|)$
Or with h = x - a: $f(a + h) = f(a) + \nabla f(a)^T h + o(\|h\|)$
When this holds, g = ∇f(a).

Suppose the decomposition holds. Consider the directional derivative of f at a in direction v: $D_vf(a)=\lim _{t\rightarrow 0}\frac{f(a+tv)-f(a)}{t}.$

Using the decomposition: $f(a+tv)-f(a)=g^T(tv)+r(a+tv)=t\, g^Tv+r(a+tv).$

Next, we divide by t and get: $\frac{f(a+tv)-f(a)}{t}=g^Tv+\frac{r(a+tv)}{t}.$

However, $\frac{r(a+tv)}{t}=\frac{r(a+tv)}{\| tv\| }\, \| v\| \rightarrow 0$ because the remainder is little‑o of $\| x-a\|$. Thus, $D_vf(a)=g^Tv$. Now recall the definition of the gradient: $D_vf(a)=\nabla f(a)^Tv.$

Since this holds for every direction v, the only possibility is: $g=\nabla f(a).$

The C¹ Condition

Definition. f is continuously differentiable on an open set U containing a, $f \in C^1(U)$ if (1) all first-order partial derivatives of f exist on U and (2) these partial derivatives are continuous on U.

Theorem. If $f \in C^1(U)$, then for any $a \in U$: $f(x) = f(a) + \nabla f(a)^T(x - a) + o(\|x - a\|)$

C¹ is sufficient but not necessary for differentiability.

$f(x) = \begin{cases} x^2 sin(1/x), &x > 0 \\\\ 0, &x < 0 \end{cases}$

Analysis:
f’(x) exists for all x (including x = 0, where f’(0) = 0).
However, f’(x) = 2x sin(1/x) - cos(1/x) for x ≠ 0.
Hence, $\lim_{x \to 0} f'(x)$ does not exist (the cos(1/x) term oscillates “wildly” as x approaches zero).
So f’ exists everywhere, but f’ is not continuous at 0 ($f \notin C^1$).

Equivalence to the Derivative-Definition Limit

Theorem (First-Order Approximation Accuracy). Let $\mathbb{f}: U \subseteq \mathbb{R}^n \to \mathbb{R}$ be a real valued function defined on an open set $U = dom(\mathbf{f})$ containing a. If f is differentiable on U, then the following statement hods true: $\forall x \in U, \lim_{d \to 0} \frac{f(x + d) - f(x) - \nabla f(x)^T d}{\|d\|} = 0$

In words, this first order approximation accuracy theorem states that for a differentiable function f at an arbitrary point x in its domain, the first order linear approximation provided by the gradient $\nabla f(x) = \begin{pmatrix}\frac{\partial f}{\partial x_1}\\[3pt] \frac{\partial f}{\partial x_2}\\[3pt] \cdots \\[3pt] \frac{\partial f}{\partial x_n}\end{pmatrix}, f(x) + \nabla f(x)^Td$ , becomes increasingly accurate as the displacement d from x approaches zero.

As the displacement d becomes smaller and smaller, the difference between the actual function value $\mathbb{f}(x+d)$ and its linear approximation $\mathbb{f}(x) + \nabla \mathbb{f}(x)^Td$ becomes negligible compared to the magnitude of the displacement ∣∣d∣∣ as d → 0. In other words, the linear approximation becomes increasingly accurate as we zoom in closer to the point x.

Single‐Variable Special Case

When we restrict to a function of one real variable, n = 1, f: ℝ → ℝ, the multivariable machinery collapses to familiar single-variable calculus. In this setting, the gradient ∇f(x) is just the ordinary derivative f′(x) and linearization becomes $\tilde{f}(x) = f(a) + f'(a)(x-a),$ the well-known tangent-line approximation from Calculus I.

Key Points:

It matches the actual function value at x = a, i.e., it passes through (a, f(a)).
Its slope equals the derivative f’(a), so it’s the tangent line.
Error Term and Remainder. The difference between the true value and the tangent-line approximation is negligible compared to (x-a). By Taylor’s theorem with remainder: $f(x) = f(a) + f'(a)(x - a) + \frac{1}{2}f''(\xi)(x - a)^2$ where ξ is between a and x.
Error bound: If |f’’(t)| ≤ M for all t between a and x: $|f(x) - \tilde{f}(x)| \leq \frac{M}{2}(x - a)^2$.
$|f(x)-\tilde {f}(x)|\leq \frac{M}{2}(x-a)^2 \implies \frac{|f(x)-\tilde {f}(x)|}{|x-a|}\leq \frac{M}{2}|x-a|.$
As $x\rightarrow a$, the right‑hand side tends to 0. Therefore, $\frac{|f(x)-\tilde {f}(x)|}{|x-a|}\rightarrow 0.$ This is exactly the definition of $f(x)-\tilde {f}(x)=o(|x-a|).$

Example. Let f(x) = sin(x), base point a = 0. True value at 0: f(0) = sin(0) = 0. Derivative at 0: f’(x) = cos(x), so f’(0) = 1.

Linearization about 0: $\tilde{f}(x) = f(a) + f'(a)(x-a) =[a = 0] f(0) + f'(0)(x-0) = 0 + 1·x = x.$ For small x, sin(x) ≈ x.

Error analysis: f’’(x) = -sin(x), so |f’’(x)| ≤ 1. $|\sin(x) - x| \leq \frac{1}{2}x^2$

Numerical check at x = 0.1. Approximation: sin(0.1) ≈ 0.1. True value (calculator): sin(0.1) ≈ 0.09983341664. Actual error: ∣0.09983341664−0.1∣=0.00016658336. Error bound: ½(0.1)² = 0.005. The actual error ≈ 0.000167 is much less than the upper bound 0.005.

Multi-Variable Case

Two Variables: f(x, y)

$f: \mathbb{R}^2 \rightarrow \mathbb{R}$. Base point: (a, b) and gradient: $\nabla f(a, b) = (f_x(a, b), f_y(a, b))$

Linearization: $\tilde{f}(x, y) = f(a, b) + f_x(a, b)(x - a) + f_y(a, b)(y - b)$. This is the tangent plane to the surface z = f(x, y) at the base point (a, b).

Example 1: f(x, y) = x² + y² near (a, b) = (1, 2).

True value: f(1, 2) = 1² + 2² = 5. Gradient: ∇f(x, y) = (2x, 2y), so ∇f(1, 2) = (2, 4).

Linearization about (1, 2): $\tilde{f}(x) = \mathbb{f}(a)+ \nabla \mathbb{f}(a)^T(x-a) = \mathbb{f}(1, 2) + \nabla \mathbb{f}(1, 2)^T(x-1, y-2) = $ 5 + 2(x−1) + 4(y−2).

Approximation at (1.1, 1.9): $\mathbb{f}(1.1, 1.9)$ = 5 + 2(0.1) + 4(−0.1) = 5 + 0.2 −0.4 = 4.8.

Actual value: f(1.1, 1.9) = 1.1² + 1.9² = 1.21 + 3.61 = 4.82. Error = 4.82 - 4.8 = 0.02. Error: |4.82 - 4.8| = 0.02.

Example 2: f(x, y) = xy + eˣ near (0, 1)

f(0, 1) = 0 · 1 + e⁰ = 1. Compute partial derivatives: $fₓ(x, y) = y + eˣ$, so $fₓ(0, 1) = 1 + 1 = 2, f_y(x, y) = x$, so $f_y(0, 1) = 0$

Linearization: $\tilde{f}(x, y) = 1 + 2(x - 0) + 0(y - 1) = 1 + 2x$

Approximation at (0.1, 1.05): $\tilde{f}(0.1, 1.05) = 1 + 2(0.1) = 1.2$

Actual value: f(0.1, 1.05) = $(0.1)(1.05) + e^{0.1} \approx 0.105 + 1.105 = 1.210$. Error ≈ 0.01.

General n Variables

Linearization: $\boxed{\tilde{f}(x) = f(a) + \sum_{i=1}^{n} \frac{\partial f}{\partial x_i}(a)(x_i - a_i)}$

Or in vector notation: $\tilde{f}(x) = f(a) + \nabla f(a)^T (x - a)$

Vector-Valued Functions

For $F: \mathbb{R}^n \to \mathbb{R}^m$ with F = (F₁, F₂, …, Fₘ), the first-order approximation uses the Jacobian matrix instead of the gradient.

Jacobian: $DF(a) = J_F(a) = \begin{pmatrix} \frac{\partial F_1}{\partial x_1}(a) & \cdots & \frac{\partial F_1}{\partial x_n}(a) \\ \vdots & \ddots & \vdots \\ \frac{\partial F_m}{\partial x_1}(a) & \cdots & \frac{\partial F_m}{\partial x_n}(a) \end{pmatrix}$

Linearization: $\tilde{F}(x) = F(a) + DF(a)(x - a)$. This is a vector equation: the approximation is an m-vector.

Example: $F(x, y) = (x^2y, e^{x+y})$ near (0, 0)

Values: F(0, 0) = (0, e⁰) = (0, 1). Jacobian: $DF(x, y) = \begin{pmatrix} 2xy & x^2 \\ e^{x+y} & e^{x+y} \end{pmatrix}$. Then, $DF(0, 0) = \begin{pmatrix} 0 & 0 \\ 1 & 1 \end{pmatrix}$

Linearization: $\tilde{F}(x, y) = \begin{pmatrix} 0 \\ 1 \end{pmatrix} + \begin{pmatrix} 0 & 0 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 0 \\ 1 + x + y \end{pmatrix}$

Approximation at (0.1, -0.05): $\tilde{F}(0.1, -0.05) = \begin{pmatrix} 0 \\ 1 + 0.1 - 0.05 \end{pmatrix} = \begin{pmatrix} 0 \\ 1.05 \end{pmatrix}$

Actual value: $F(0.1, -0.05) = \begin{pmatrix} (0.1)^2(-0.05) \\ e^{0.05} \end{pmatrix} = \begin{pmatrix} -0.0005 \\ 1.0513 \end{pmatrix}$. Therefore, the error is indeed small in both components.

Directional Derivatives and the Gradient

Definition. The directional derivative of a function f at a point a in the direction of a unit vector u quantifies the instantaneous rate of change of f as we move from a along u. Mathematically, $D_u f(a) = \lim_{t \to 0} \frac{f(a + tu) - f(a)}{t}$

The unit vector constraint ($||u|||=1$) ensures the derivative measures pure directional sensitivity, not scaled by the vector’s magnitude.

Relation to gradient. When f is differentiable at a, the directional derivative simplifies via the gradient $\nabla f(a)$, a vector of partial derivatives: $D_u f(a) = \nabla f(a) \cdot u = ∥∇f(a)∥ ∥u∥ \cos(\theta) = ∥∇f(a)∥cos(\theta)$ where $\theta$ is the angle between $\nabla f(a)$ and u. This dot product reveals that the directional derivative is the projection of the gradient onto u, capturing how steeply f rises or falls in that direction.

For small t, the function’s change is well-approximated by: $f(a + tu) \approx f(a) + \nabla f(a)^T (tu) = f(a) + t \cdot D_u f(a)$. In words, moving a distance t in the u direction changes f by approximately t times the rate of change in the u direction.

The gradient encodes ALL directional derivatives:

Coordinate Directions. $D_{e_1} f(a) = \frac{\partial f}{\partial x_1}(a)$ (partial derivatives are directional derivatives along basis vectors).
Arbitrary Directions. $D_u f(a) = \nabla f(a) \cdot u$.
Maximum Rate. The steepest ascent occurs when u aligns with ∇f(a) ((u ∥ ∇f(a)), $cos(\theta) = 1$), giving $\max_{\|u\|=1} D_u f(a) = \|\nabla f(a)\|$. The gradient ∇f(a) points in the direction of greatest increase of f at a, while its magnitude ||∇f(a)|| represents the maximum rate of change.

The First-Order Approximation 2: Linearization and Gradients