JustToThePoint English Website Version
JustToThePoint en español

The First-Order Approximation 2: Linearization and Gradients

Logic will get you from A to B. Imagination will take you everywhere, Albert Einstein

image info

Motivation

The first-order approximation is one of the most powerful tools in mathematics and its applications. The fundamental insight is simple yet quite profound, near any point where a function is differentiable, the function behaves approximately like a linear function.

First Order Approximation

Definition. Let $f: U \subseteq \mathbb{R}^n \to \mathbb{R}$ be differentiable at a point $a \in int(dom(f))$. The affine function $\tilde{f}(x) = f(a) + \nabla f(a)^T(x-a)$ is called the first-order or linear approximation of f at a.

Unpacking the definition:

The Approximation Property

Theorem (Differentiability Characterization). A function $f:U\subseteq \mathbb{R^{\mathnormal{n}}}\rightarrow \mathbb{R}$ is differentiable at a if and only if there exists a vector $g\in \mathbb{R^{\mathnormal{n}}}$ such that $f(x) = f(a) + g^T(x - a) + r(x)$ with the remainder r(x) satisfying $\lim_{x \to a} \frac{|r(x)|}{\|x - a\|} = 0$

  1. This is the multivariable analogue of the single‑variable idea: $f(x)=f(a)+f'(a)(x-a)+o(|x-a|).$
  2. Differentiability at a means: (i) Near a, the function behaves almost exactly like a linear function. (ii) The vector g is the best linear approximation to f at a. (iii) The error term r(x) becomes negligible faster than $\| x-a\|$.
  3. In other words, the function is differentiable at a precisely when you can write: $f(x)=\mathrm{(value\ at\ }a)+\mathrm{(linear\ part)}+\mathrm{(tiny\ error)}.$
  4. We write r(x) = o(∥x - a∥) to mean: $\lim_{x \to a} \frac{|r(x)|}{\|x - a\|} = 0$.
    The remainder r(x) shrinks faster (it becomes negligible compared to the distance) than the distance ∥x - a∥.
    Using this notation: $f(x) = f(a) + \nabla f(a)^T(x - a) + o(\|x - a\|)$
    Or with h = x - a: $f(a + h) = f(a) + \nabla f(a)^T h + o(\|h\|)$
  5. When this holds, g = ∇f(a).

Suppose the decomposition holds. Consider the directional derivative of f at a in direction v: $D_vf(a)=\lim _{t\rightarrow 0}\frac{f(a+tv)-f(a)}{t}.$

Using the decomposition: $f(a+tv)-f(a)=g^T(tv)+r(a+tv)=t\, g^Tv+r(a+tv).$

Next, we divide by t and get: $\frac{f(a+tv)-f(a)}{t}=g^Tv+\frac{r(a+tv)}{t}.$

However, $\frac{r(a+tv)}{t}=\frac{r(a+tv)}{\| tv\| }\, \| v\| \rightarrow 0$ because the remainder is little‑o of $\| x-a\|$. Thus, $D_vf(a)=g^Tv$. Now recall the definition of the gradient: $D_vf(a)=\nabla f(a)^Tv.$

Since this holds for every direction v, the only possibility is: $g=\nabla f(a).$

The C¹ Condition

Definition. f is continuously differentiable on an open set U containing a, $f \in C^1(U)$ if (1) all first-order partial derivatives of f exist on U and (2) these partial derivatives are continuous on U.

Theorem. If $f \in C^1(U)$, then for any $a \in U$: $f(x) = f(a) + \nabla f(a)^T(x - a) + o(\|x - a\|)$

C¹ is sufficient but not necessary for differentiability.

$f(x) = \begin{cases} x^2 sin(1/x), &x > 0 \\\\ 0, &x < 0 \end{cases}$

Analysis:
f’(x) exists for all x (including x = 0, where f’(0) = 0).
However, f’(x) = 2x sin(1/x) - cos(1/x) for x ≠ 0.
Hence, $\lim_{x \to 0} f'(x)$ does not exist (the cos(1/x) term oscillates “wildly” as x approaches zero).
So f’ exists everywhere, but f’ is not continuous at 0 ($f \notin C^1$).

Equivalence to the Derivative-Definition Limit

Theorem (First-Order Approximation Accuracy). Let $\mathbb{f}: U \subseteq \mathbb{R}^n \to \mathbb{R}$ be a real valued function defined on an open set $U = dom(\mathbf{f})$ containing a. If f is differentiable on U, then the following statement hods true: $\forall x \in U, \lim_{d \to 0} \frac{f(x + d) - f(x) - \nabla f(x)^T d}{\|d\|} = 0$

In words, this first order approximation accuracy theorem states that for a differentiable function f at an arbitrary point x in its domain, the first order linear approximation provided by the gradient $\nabla f(x) = \begin{pmatrix}\frac{\partial f}{\partial x_1}\\[3pt] \frac{\partial f}{\partial x_2}\\[3pt] \cdots \\[3pt] \frac{\partial f}{\partial x_n}\end{pmatrix}, f(x) + \nabla f(x)^Td$ , becomes increasingly accurate as the displacement d from x approaches zero.

As the displacement d becomes smaller and smaller, the difference between the actual function value $\mathbb{f}(x+d)$ and its linear approximation $\mathbb{f}(x) + \nabla \mathbb{f}(x)^Td$ becomes negligible compared to the magnitude of the displacement ∣∣d∣∣ as d → 0. In other words, the linear approximation becomes increasingly accurate as we zoom in closer to the point x.

Single‐Variable Special Case

When we restrict to a function of one real variable, n = 1, f: ℝ → ℝ, the multivariable machinery collapses to familiar single-variable calculus. In this setting, the gradient ∇f(x) is just the ordinary derivative f′(x) and linearization becomes $\tilde{f}(x) = f(a) + f'(a)(x-a),$ the well-known tangent-line approximation from Calculus I.

Key Points:

Example. Let f(x) = sin(x), base point a = 0. True value at 0: f(0) = sin(0) = 0. Derivative at 0: f’(x) = cos(x), so f’(0) = 1.

Linearization about 0: $\tilde{f}(x) = f(a) + f'(a)(x-a) =[a = 0] f(0) + f'(0)(x-0) = 0 + 1·x = x.$ For small x, sin(x) ≈ x.

Error analysis: f’’(x) = -sin(x), so |f’’(x)| ≤ 1. $|\sin(x) - x| \leq \frac{1}{2}x^2$

Numerical check at x = 0.1. Approximation: sin(0.1) ≈ 0.1. True value (calculator): sin(0.1) ≈ 0.09983341664. Actual error: ∣0.09983341664−0.1∣=0.00016658336. Error bound: ½(0.1)² = 0.005. The actual error ≈ 0.000167 is much less than the upper bound 0.005.

Multi-Variable Case

Two Variables: f(x, y)

$f: \mathbb{R}^2 \rightarrow \mathbb{R}$. Base point: (a, b) and gradient: $\nabla f(a, b) = (f_x(a, b), f_y(a, b))$

Linearization: $\tilde{f}(x, y) = f(a, b) + f_x(a, b)(x - a) + f_y(a, b)(y - b)$. This is the tangent plane to the surface z = f(x, y) at the base point (a, b).

Example 1: f(x, y) = x² + y² near (a, b) = (1, 2).

True value: f(1, 2) = 1² + 2² = 5. Gradient: ∇f(x, y) = (2x, 2y), so ∇f(1, 2) = (2, 4).

Linearization about (1, 2): $\tilde{f}(x) = \mathbb{f}(a)+ \nabla \mathbb{f}(a)^T(x-a) = \mathbb{f}(1, 2) + \nabla \mathbb{f}(1, 2)^T(x-1, y-2) = $ 5 + 2(x−1) + 4(y−2).

Approximation at (1.1, 1.9): $\mathbb{f}(1.1, 1.9)$ = 5 + 2(0.1) + 4(−0.1) = 5 + 0.2 −0.4 = 4.8.

Actual value: f(1.1, 1.9) = 1.1² + 1.9² = 1.21 + 3.61 = 4.82. Error = 4.82 - 4.8 = 0.02. Error: |4.82 - 4.8| = 0.02.

Example 2: f(x, y) = xy + eˣ near (0, 1)

f(0, 1) = 0 · 1 + e⁰ = 1. Compute partial derivatives: $fₓ(x, y) = y + eˣ$, so $fₓ(0, 1) = 1 + 1 = 2, f_y(x, y) = x$, so $f_y(0, 1) = 0$

Linearization: $\tilde{f}(x, y) = 1 + 2(x - 0) + 0(y - 1) = 1 + 2x$

Approximation at (0.1, 1.05): $\tilde{f}(0.1, 1.05) = 1 + 2(0.1) = 1.2$

Actual value: f(0.1, 1.05) = $(0.1)(1.05) + e^{0.1} \approx 0.105 + 1.105 = 1.210$. Error ≈ 0.01.

General n Variables

Linearization: $\boxed{\tilde{f}(x) = f(a) + \sum_{i=1}^{n} \frac{\partial f}{\partial x_i}(a)(x_i - a_i)}$

Or in vector notation: $\tilde{f}(x) = f(a) + \nabla f(a)^T (x - a)$

Vector-Valued Functions

For $F: \mathbb{R}^n \to \mathbb{R}^m$ with F = (F₁, F₂, …, Fₘ), the first-order approximation uses the Jacobian matrix instead of the gradient.

Jacobian: $DF(a) = J_F(a) = \begin{pmatrix} \frac{\partial F_1}{\partial x_1}(a) & \cdots & \frac{\partial F_1}{\partial x_n}(a) \\ \vdots & \ddots & \vdots \\ \frac{\partial F_m}{\partial x_1}(a) & \cdots & \frac{\partial F_m}{\partial x_n}(a) \end{pmatrix}$

Linearization: $\tilde{F}(x) = F(a) + DF(a)(x - a)$. This is a vector equation: the approximation is an m-vector.

Example: $F(x, y) = (x^2y, e^{x+y})$ near (0, 0)

Values: F(0, 0) = (0, e⁰) = (0, 1). Jacobian: $DF(x, y) = \begin{pmatrix} 2xy & x^2 \\ e^{x+y} & e^{x+y} \end{pmatrix}$. Then, $DF(0, 0) = \begin{pmatrix} 0 & 0 \\ 1 & 1 \end{pmatrix}$

Linearization: $\tilde{F}(x, y) = \begin{pmatrix} 0 \\ 1 \end{pmatrix} + \begin{pmatrix} 0 & 0 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 0 \\ 1 + x + y \end{pmatrix}$

Approximation at (0.1, -0.05): $\tilde{F}(0.1, -0.05) = \begin{pmatrix} 0 \\ 1 + 0.1 - 0.05 \end{pmatrix} = \begin{pmatrix} 0 \\ 1.05 \end{pmatrix}$

Actual value: $F(0.1, -0.05) = \begin{pmatrix} (0.1)^2(-0.05) \\ e^{0.05} \end{pmatrix} = \begin{pmatrix} -0.0005 \\ 1.0513 \end{pmatrix}$. Therefore, the error is indeed small in both components.

Directional Derivatives and the Gradient

Definition. The directional derivative of a function f at a point a in the direction of a unit vector u quantifies the instantaneous rate of change of f as we move from a along u. Mathematically, $D_u f(a) = \lim_{t \to 0} \frac{f(a + tu) - f(a)}{t}$

The unit vector constraint ($||u|||=1$) ensures the derivative measures pure directional sensitivity, not scaled by the vector’s magnitude.

Relation to gradient. When f is differentiable at a, the directional derivative simplifies via the gradient $\nabla f(a)$, a vector of partial derivatives: $D_u f(a) = \nabla f(a) \cdot u = ∥∇f(a)∥ ∥u∥ \cos(\theta) = ∥∇f(a)∥cos(\theta)$ where $\theta$ is the angle between $\nabla f(a)$ and u. This dot product reveals that the directional derivative is the projection of the gradient onto u, capturing how steeply f rises or falls in that direction.

For small t, the function’s change is well-approximated by: $f(a + tu) \approx f(a) + \nabla f(a)^T (tu) = f(a) + t \cdot D_u f(a)$. In words, moving a distance t in the u direction changes f by approximately t times the rate of change in the u direction.

The gradient encodes ALL directional derivatives:

Bitcoin donation

JustToThePoint Copyright © 2011 - 2026 Anawim. ALL RIGHTS RESERVED. Bilingual e-books, articles, and videos to help your child and your entire family succeed, develop a healthy lifestyle, and have a lot of fun. Social Issues, Join us.

This website uses cookies to improve your navigation experience.
By continuing, you are consenting to our use of cookies, in accordance with our Cookies Policy and Website Terms and Conditions of use.