Logic will get you from A to B. Imagination will take you everywhere, Albert Einstein

The first-order approximation is one of the most powerful tools in mathematics and its applications. The fundamental insight is simple yet quite profound, near any point where a function is differentiable, the function behaves approximately like a linear function.
Definition. Let $f: U \subseteq \mathbb{R}^n \to \mathbb{R}$ be differentiable at a point $a \in int(dom(f))$. The affine function $\tilde{f}(x) = f(a) + \nabla f(a)^T(x-a)$ is called the first-order or linear approximation of f at a.
Unpacking the definition:
The first-order approximation assumes that the function behaves linearly near a. This is a good approximation if x is very close to a. However, as x moves further away from a, because we have replaced a generally curved function by its tangent hyperplane, the curvature of the function typically becomes more significant and the linear approximation becomes less accurate.
Theorem (Differentiability Characterization). A function $f:U\subseteq \mathbb{R^{\mathnormal{n}}}\rightarrow \mathbb{R}$ is differentiable at a if and only if there exists a vector $g\in \mathbb{R^{\mathnormal{n}}}$ such that $f(x) = f(a) + g^T(x - a) + r(x)$ with the remainder r(x) satisfying $\lim_{x \to a} \frac{|r(x)|}{\|x - a\|} = 0$
Suppose the decomposition holds. Consider the directional derivative of f at a in direction v: $D_vf(a)=\lim _{t\rightarrow 0}\frac{f(a+tv)-f(a)}{t}.$
Using the decomposition: $f(a+tv)-f(a)=g^T(tv)+r(a+tv)=t\, g^Tv+r(a+tv).$
Next, we divide by t and get: $\frac{f(a+tv)-f(a)}{t}=g^Tv+\frac{r(a+tv)}{t}.$
However, $\frac{r(a+tv)}{t}=\frac{r(a+tv)}{\| tv\| }\, \| v\| \rightarrow 0$ because the remainder is little‑o of $\| x-a\|$. Thus, $D_vf(a)=g^Tv$. Now recall the definition of the gradient: $D_vf(a)=\nabla f(a)^Tv.$
Since this holds for every direction v, the only possibility is: $g=\nabla f(a).$
Definition. f is continuously differentiable on an open set U containing a, $f \in C^1(U)$ if (1) all first-order partial derivatives of f exist on U and (2) these partial derivatives are continuous on U.
Theorem. If $f \in C^1(U)$, then for any $a \in U$: $f(x) = f(a) + \nabla f(a)^T(x - a) + o(\|x - a\|)$
C¹ is sufficient but not necessary for differentiability.
$f(x) = \begin{cases} x^2 sin(1/x), &x > 0 \\\\ 0, &x < 0 \end{cases}$
Analysis:
f’(x) exists for all x (including x = 0, where f’(0) = 0).
However, f’(x) = 2x sin(1/x) - cos(1/x) for x ≠ 0.
Hence, $\lim_{x \to 0} f'(x)$ does not exist (the cos(1/x) term oscillates “wildly” as x approaches zero).
So f’ exists everywhere, but f’ is not continuous at 0 ($f \notin C^1$).
Theorem (First-Order Approximation Accuracy). Let $\mathbb{f}: U \subseteq \mathbb{R}^n \to \mathbb{R}$ be a real valued function defined on an open set $U = dom(\mathbf{f})$ containing a. If f is differentiable on U, then the following statement hods true: $\forall x \in U, \lim_{d \to 0} \frac{f(x + d) - f(x) - \nabla f(x)^T d}{\|d\|} = 0$
In words, this first order approximation accuracy theorem states that for a differentiable function f at an arbitrary point x in its domain, the first order linear approximation provided by the gradient $\nabla f(x) = \begin{pmatrix}\frac{\partial f}{\partial x_1}\\[3pt] \frac{\partial f}{\partial x_2}\\[3pt] \cdots \\[3pt] \frac{\partial f}{\partial x_n}\end{pmatrix}, f(x) + \nabla f(x)^Td$ , becomes increasingly accurate as the displacement d from x approaches zero.
As the displacement d becomes smaller and smaller, the difference between the actual function value $\mathbb{f}(x+d)$ and its linear approximation $\mathbb{f}(x) + \nabla \mathbb{f}(x)^Td$ becomes negligible compared to the magnitude of the displacement ∣∣d∣∣ as d → 0. In other words, the linear approximation becomes increasingly accurate as we zoom in closer to the point x.
When we restrict to a function of one real variable, n = 1, f: ℝ → ℝ, the multivariable machinery collapses to familiar single-variable calculus. In this setting, the gradient ∇f(x) is just the ordinary derivative f′(x) and linearization becomes $\tilde{f}(x) = f(a) + f'(a)(x-a),$ the well-known tangent-line approximation from Calculus I.
Key Points:
Example. Let f(x) = sin(x), base point a = 0. True value at 0: f(0) = sin(0) = 0. Derivative at 0: f’(x) = cos(x), so f’(0) = 1.
Linearization about 0: $\tilde{f}(x) = f(a) + f'(a)(x-a) =[a = 0] f(0) + f'(0)(x-0) = 0 + 1·x = x.$ For small x, sin(x) ≈ x.
Error analysis: f’’(x) = -sin(x), so |f’’(x)| ≤ 1. $|\sin(x) - x| \leq \frac{1}{2}x^2$
Numerical check at x = 0.1. Approximation: sin(0.1) ≈ 0.1. True value (calculator): sin(0.1) ≈ 0.09983341664. Actual error: ∣0.09983341664−0.1∣=0.00016658336. Error bound: ½(0.1)² = 0.005. The actual error ≈ 0.000167 is much less than the upper bound 0.005.
$f: \mathbb{R}^2 \rightarrow \mathbb{R}$. Base point: (a, b) and gradient: $\nabla f(a, b) = (f_x(a, b), f_y(a, b))$
Linearization: $\tilde{f}(x, y) = f(a, b) + f_x(a, b)(x - a) + f_y(a, b)(y - b)$. This is the tangent plane to the surface z = f(x, y) at the base point (a, b).
Example 1: f(x, y) = x² + y² near (a, b) = (1, 2).
True value: f(1, 2) = 1² + 2² = 5. Gradient: ∇f(x, y) = (2x, 2y), so ∇f(1, 2) = (2, 4).
Linearization about (1, 2): $\tilde{f}(x) = \mathbb{f}(a)+ \nabla \mathbb{f}(a)^T(x-a) = \mathbb{f}(1, 2) + \nabla \mathbb{f}(1, 2)^T(x-1, y-2) = $ 5 + 2(x−1) + 4(y−2).
Approximation at (1.1, 1.9): $\mathbb{f}(1.1, 1.9)$ = 5 + 2(0.1) + 4(−0.1) = 5 + 0.2 −0.4 = 4.8.
Actual value: f(1.1, 1.9) = 1.1² + 1.9² = 1.21 + 3.61 = 4.82. Error = 4.82 - 4.8 = 0.02. Error: |4.82 - 4.8| = 0.02.
Example 2: f(x, y) = xy + eˣ near (0, 1)
f(0, 1) = 0 · 1 + e⁰ = 1. Compute partial derivatives: $fₓ(x, y) = y + eˣ$, so $fₓ(0, 1) = 1 + 1 = 2, f_y(x, y) = x$, so $f_y(0, 1) = 0$
Linearization: $\tilde{f}(x, y) = 1 + 2(x - 0) + 0(y - 1) = 1 + 2x$
Approximation at (0.1, 1.05): $\tilde{f}(0.1, 1.05) = 1 + 2(0.1) = 1.2$
Actual value: f(0.1, 1.05) = $(0.1)(1.05) + e^{0.1} \approx 0.105 + 1.105 = 1.210$. Error ≈ 0.01.
Linearization: $\boxed{\tilde{f}(x) = f(a) + \sum_{i=1}^{n} \frac{\partial f}{\partial x_i}(a)(x_i - a_i)}$
Or in vector notation: $\tilde{f}(x) = f(a) + \nabla f(a)^T (x - a)$
For $F: \mathbb{R}^n \to \mathbb{R}^m$ with F = (F₁, F₂, …, Fₘ), the first-order approximation uses the Jacobian matrix instead of the gradient.
Jacobian: $DF(a) = J_F(a) = \begin{pmatrix} \frac{\partial F_1}{\partial x_1}(a) & \cdots & \frac{\partial F_1}{\partial x_n}(a) \\ \vdots & \ddots & \vdots \\ \frac{\partial F_m}{\partial x_1}(a) & \cdots & \frac{\partial F_m}{\partial x_n}(a) \end{pmatrix}$
Linearization: $\tilde{F}(x) = F(a) + DF(a)(x - a)$. This is a vector equation: the approximation is an m-vector.
Example: $F(x, y) = (x^2y, e^{x+y})$ near (0, 0)
Values: F(0, 0) = (0, e⁰) = (0, 1). Jacobian: $DF(x, y) = \begin{pmatrix} 2xy & x^2 \\ e^{x+y} & e^{x+y} \end{pmatrix}$. Then, $DF(0, 0) = \begin{pmatrix} 0 & 0 \\ 1 & 1 \end{pmatrix}$
Linearization: $\tilde{F}(x, y) = \begin{pmatrix} 0 \\ 1 \end{pmatrix} + \begin{pmatrix} 0 & 0 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 0 \\ 1 + x + y \end{pmatrix}$
Approximation at (0.1, -0.05): $\tilde{F}(0.1, -0.05) = \begin{pmatrix} 0 \\ 1 + 0.1 - 0.05 \end{pmatrix} = \begin{pmatrix} 0 \\ 1.05 \end{pmatrix}$
Actual value: $F(0.1, -0.05) = \begin{pmatrix} (0.1)^2(-0.05) \\ e^{0.05} \end{pmatrix} = \begin{pmatrix} -0.0005 \\ 1.0513 \end{pmatrix}$. Therefore, the error is indeed small in both components.
Definition. The directional derivative of a function f at a point a in the direction of a unit vector u quantifies the instantaneous rate of change of f as we move from a along u. Mathematically, $D_u f(a) = \lim_{t \to 0} \frac{f(a + tu) - f(a)}{t}$
The unit vector constraint ($||u|||=1$) ensures the derivative measures pure directional sensitivity, not scaled by the vector’s magnitude.
Relation to gradient. When f is differentiable at a, the directional derivative simplifies via the gradient $\nabla f(a)$, a vector of partial derivatives: $D_u f(a) = \nabla f(a) \cdot u = ∥∇f(a)∥ ∥u∥ \cos(\theta) = ∥∇f(a)∥cos(\theta)$ where $\theta$ is the angle between $\nabla f(a)$ and u. This dot product reveals that the directional derivative is the projection of the gradient onto u, capturing how steeply f rises or falls in that direction.
For small t, the function’s change is well-approximated by: $f(a + tu) \approx f(a) + \nabla f(a)^T (tu) = f(a) + t \cdot D_u f(a)$. In words, moving a distance t in the u direction changes f by approximately t times the rate of change in the u direction.
The gradient encodes ALL directional derivatives: