To err is human, to blame it on someone else is even more human, Jacob’s Law

Complex limits

In one variable, limits and continuity follow the familiar ε–δ pattern. In ℂ we use disks (or balls) instead of intervals, but the ideas carry over verbatim. Then, in ℝⁿ→ℝᵐ, “differentiability” becomes the existence of a unique linear map (the Jacobian) giving the best first‐order approximation. This article builds that bridge between different levels of abstraction in Calculus step by step.

Recall

A complex number is specified by an ordered pair of real numbers (a, b) ∈ ℝ² and expressed or written in the form z = a + bi, where a and b are real numbers, and i is the imaginary unit, defined by the property i² = −1 ⇔ i = $\sqrt{-1}$, e.g., 2 + 5i, $7\pi + i\sqrt{2}.$ ℂ= { a + bi ∣a, b ∈ ℝ}.

Definition. Let D ⊆ ℂ be a set of complex numbers. A complex-valued function f of a complex variable, defined on D, is a rule that assigns to each complex number z belonging to the set D a unique complex number w, f: D ➞ ℂ.

The set D is called the domain of the complex function f, D = Dom(f).
The set of all actual outputs {f(z): z ∈ D} is called the range (or image) of the complex function f, also denoted f(D) = {f(z): z ∈ D}.

We often call the elements of D as points. If z = x+ iy ∈ D, then f(z) is called the image of the point z under f. f: D ➞ ℂ means that f is a complex function with domain D. We often write f(z) = u(x ,y) + iv(x, y), where u, v: ℝ² → ℝ are the real and imaginary parts.

Definition. Let D ⊆ ℂ, $f: D \rarr \Complex$ be a function and z₀ be a limit point of D (so arbitrarily close points of D lie around z₀, though possibly z₀ ∉ D). A complex number L is said to be a limit of the function f as z approaches z₀, written or expressed as $\lim_{z \to z_0} f(z)=L$, if for every epsilon ε > 0, there exist a corresponding delta δ > 0 such that |f(z) -L| < ε whenever z ∈ D and 0 < |z - z₀| < δ.

Why 0 < |z - z₀|? We exclude z = z₀ itself because the limit cares about values near z₀, not at z₀ itself. When z₀ ∉ D, you cannot evaluate f(z₀), so you only care about z approaching z₀. When z₀ ∈ D, you still want the function’s nearby behavior; this separates “limit” from “value.”

Equivalently, if ∀ε >0, ∃ δ > 0: (for every ε > 0, there exist a corresponding δ > 0) such that whenever z ∈ D ∩ B'(z₀; δ), f(z) ∈ B(L; ε) ↭ f(D ∩ B'(z₀; δ)) ⊂ B(L; ε).

If no such L exists, then we say that f(z) does not have a limit as z approaches z₀. This is exactly the same ε–δ formulation we know from real calculus, but now z and L live in the complex plane ℂ, and neighborhoods are round disks rather than intervals.

Continuity in the complex plane

Definition. Let D ⊆ ℂ. A function f: D → ℂ is said to be continuous at a point z₀ ∈ D if given any arbitrarily small ε > 0, there is a corresponding δ > 0 such that |f(z) - f(z₀)| < ε whenever z ∈ D and |z - z₀| < δ.

In words, arbitrarily small output‐changes ε can be guaranteed by restricting z to lie in a sufficiently small disk of radius δ around z₀.

Alternative (Sequential) Definition. Let D ⊆ ℂ. A function f: D → ℂ is said to be continuous at a point z₀ ∈ D if for every sequence {z_n}^∞_n=1 such that z_n ∈ D ∀n∈ℕ & z_n → z₀, we have $\lim_{z_n \to z_0} f(z_n) = f(z_0)$ .

Global Continuity

Definition. A function f: D → ℂ is said to be continuous if it is continuous at every point in its domain (∀z₀ ∈ D).

Differentiability at a point

Definition. Differentiability at a point. Let $f : ℝ^n \to ℝ^m$ be a function and let x be an interior point of the domain of f, $x \in \text{interior(dom f)} $. The function f is differentiable at x if there exists a matrix $Df(x) \in ℝ^{m \times n}$ that satisfies $\lim_{\substack{z \in \text{dom} f \\ z \neq x, z \to x}} \frac{||f(z) - f(x) - Df(x)(z-x)||_2}{||(z-x)||_2} = 0$ [*]

This matrix Df(x) is called the derivative or the Jacobian matrix of f at the point x.

Differentiable function

Definition. A function f is called differentiable if its domain f (dom(f) ⊆ ℝⁿ) is open and f is differentiable at every point of its domain (∀x ∈ dom(f)).

image info

Figure. For f(x,y)=x²+y², the red plane at (1,1) is the Jacobian’s linear approximation.

The First‐Order Approximation

A first-order approximation (also called a linear approximation) provides a linear estimate of a function's value near a specific point using the function's value and its first derivative(s) at that point. Geometrically, it represents the tangent line (in 1D), tangent plane (in 2D), or tangent hyperplane (in higher dimensions) to the function’s graph. The remainder term, often expressed using Landau’s little-o notation (e.g., o(∥h∥)), quantifies the error of this approximation, indicating that the error shrinks faster than the displacement from the point of approximation.

1️⃣ In the single-variable case, the first-order approximation of a function f(x) at a point x = a is given by the equation of the tangent line to the curve y = f(x) at that point. Mathematically, this is expressed as: L(x) = f(a) + f′(a)(x−a). This linear function L(x) provides an approximation to f(x) for values of x close enough to a. That tangent line approximation is the best linear fit locally. Geometrically, this means we are approximating the curve by a straight line that just touches the curve at (a, f(a)) and has the same slope as the curve at that point.. The error of this approximation, E(x) = f(x) − L(x), is what is left over after subtracting the linear approximation from the actual function value. When f is twice differentiable, Taylor’s theorem provides a way to analyze this error term, often showing it to be proportional to (x−a)² for small x−a, indicating that the error shrinks quadratically as x approaches a.

2️⃣ For a scalar-valued function of two variables, f(x,y), the first-order approximation at a point (x₀, y₀) is given by the equation of the tangent plane to the surface z = f(x, y) at that point. Mathematically, this is formulated as: L(x, y) = f(x₀, y₀) + f_x(x₀, y₀)(x - x₀) + f_y(x₀, y₀)(y - y₀) where:

f(x₀, y₀) is the function value at the point (x₀, y₀).
f_x(x₀, y₀) and f_y(x₀, y₀) are the partial derivatives of f with respect to x and y, respectively, evaluated at (x₀, y₀). They represent the slopes of the tangent lines to the traces of the surface in planes parallel to the xz-plane and yz-plane, respectively.
(x - x₀) and (y - y₀) are the displacements in the x and y directions from the point (x₀, y₀).
Geometrically, the surface z = f(x, y) is approximated by the plane z = L(x, y) that is tangent to the surface at (x₀, y₀, f(x₀, y₀)).

3️⃣ This concept extends naturally to functions of n variables, f(x₁, x₂, ···, xₙ) where the first-order approximation at a point a = f(a₁, a₂, ···, aₙ) is $L(\vec{x}) = f(a) + \sum_{i=1}^{n}f_{x_i}(a)(x_i-a_i) =[\text{Notation}] f(a) + \sum_{i=1}^{n}\frac{\partial f}{\partial x_i}(x_i-a_i)$ where $f_{x_i}(a) = \frac{\partial f}{\partial x_i}(a)$ is the partial derivative of f with respect to x_i evaluated at a. This approximation is a hyperplane in (n+1)-dimensional space.

4️⃣ For a vector-valued function f: ℝⁿ → ℝᵐ, where $f(\vec{p}) = (f_1(\vec{p}), f_2(\vec{p}), \cdots, f_m(\vec{p}))$, the first-order approximation at a point $\vec{p}$ ∈ ℝⁿ involves the Jacobian matrix of f at $\vec{p}$. The Jacobian matrix, denoted $\mathbb{Df}(\vec{p})$, is an m×n matrix whose (i,j)-th entry is the partial derivative $\frac{\partial f_i}{\partial x_j}(\vec{p})$. The linear approximation $\mathbb{L}(\vec{v})$ of $\mathbb{f}(\vec{v}) \text{ for } \vec{v} \text{ near } \vec{p}$ is given by: $\mathbb{L}(\vec{v}) = \mathbb{f}(\vec{p}) + \mathbb{Df}(\vec{p})(\vec{v}-\vec{p})$ where:

$\mathbb{f}(\vec{p})$ is the function value (a vector in ℝᵐ) at $\vec{p}$.
$\mathbb{Df}(\vec{p})$ is the Jacobian matrix (an m×n matrix) evaluated at $\vec{p}$.
$\vec{v}-\vec{p}$ is the displacement vector (in ℝⁿ) from $\vec{p}$.
This provides a linear mapping from ℝⁿ to ℝᵐ that best approximates f near $\vec{p}$

The directional derivative of a function f(x), where x ∈ ℝ, at a point a in the direction of a unit vector u (i.e., ∥u∥ = 1), is defined as the instantaneous rate of change of the function as one moves away from a in the direction of u. It is denoted by D_uf(a) or f’_u(a) and is given by the limit: $D_uf(a) = \lim_{h \to 0} \frac{f(a + hu) - f(a)}{h}$ if this limit exists. Geometrically, it represents the slope of the tangent line to the curve obtained by intersecting the graph of f with the vertical plane passing through a in the direction of u.

If f is differentiable at a, the directional derivative is given by the dot product of the gradient vector ∇f(a) and the direction vector u: D_uf(a) = ∇f(a)⋅u. Crucially, if f is differentiable at a, this formula shows that the gradient vector ∇f(a) contains all the information needed to compute the rate of change of f in ANY direction at a. The directional derivative is simply the projection of the gradient onto the desired direction u. The partial derivatives f_xᵢ =[Notation] $\frac{\partial f}{\partial x_i}$ are special cases of directional derivatives where the direction u is the standard unit vector (basic vector) in the xᵢ-th coordinate direction, e_i = (0, ···, 1, ···, 0) (a 1 in the i-th position and 0 elsewhere).

The first-order approximation can be used to approximate the value of a function f(a + hu) when moving a small distance h from a point a in the direction of a unit vector u. Using the general first-order approximation formula f(a+h) ≈ f(a) + ∇f(a)⋅h, and letting h=hu, we get: f(a + hu) ≈ f(a) + ∇f(a)⋅(hu) = f(a) + h(∇f(a)⋅u). Recognizing that ∇f(a)⋅u is the directional derivative $\mathbb{D_uf(a)}$, the approximation becomes f(a + hu) ≈ f(a) + h·$\mathbb{D_uf(a)}$, meaning that the rate of change in the function value f(a + hu) - f(a) is approximately equal to the distance moved h times the rate of change of the function in that very direction $\mathbb{D_uf(a)}$

The accuracy of this approximation depends on the magnitude of h and the behavior of the function. If f is differentiable, the error of this approximation will be o(h), meaning it becomes negligible compared to h for small h.

Definition. The affine function $\tilde{f}(x) = f(a) + Df(a)(x -a)$ is called the first-order approximation or linearization of the function f at the point x = a where a must lie in the interior of the domain of f (i.e. a ∈ int(dom(f))) and Df(a) represents the derivative (or Jacobian matrix for multivariable functions) at a. This formula highlights that the approximation is constructed using the function’s value at a and the rate of change of the function at a scaled by the displacement from a.

Geometric Interpretation: Near the point a, the graph of f is well modeled by the tangent hyperplane {$(x, \tilde f(x)) : x \approx a$}. Moving a small distance from a in any direction, the change in f is nearly the directional derivative along that direction.

$f(x) = \tilde f(x) + r(x)$, where the remainder r(x) satisfies $\lim_{x \to a} \frac{\|r(x)\|}{\|x - a\|} = 0.$ In Landau notation, r(x) = o(|x - a|). This condition captures the statement that any deviation of f from its tangent hyperplane is negligible compared to the distance |x - a|.

Definition. Let $f: \real^n \to \real$ be a differentiable scalar-valued function. The derivative $Df(\vec{a})$ is a 1 x n matrix (a row vector). The gradient of f at $\vec{a}$, denoted by $\nabla f(\vec{a})$, is defined as the transpose of the derivative or the vector of partial derivatives: $\nabla f(\vec{a}) = Df(\vec{a})^T$ where $\vec{a} \in Interior(dom(f))$ (the interior of the domain of f). It plays a central role in the first-order approximation and in understanding rates of change. The i-th component of the gradient is given by the partial derivative with respect to the i-th variable, $\nabla f(\vec{a})_i = \frac{\partial f}{\partial x_i}(\vec{a})$, for i = 1, ···, n provided f is differentiable at x, $\nabla f(\vec{a}) = (\frac{\partial f}{\partial x_1}(\vec{a}), \frac{\partial f}{\partial x_2}(\vec{a}), \cdots, \frac{\partial f}{\partial x_n}(\vec{a}))$.

The first-order approximation f(a + h) ≈ f(a) + ∇f(a)⋅h shows that the gradient provides the coefficients for the linear terms in the approximation. As it was previously mentioned, for any unit vector u, the directional derivative $Df(\vec{a})$ is given by ∇f(a)⋅u. This dot product is maximized when u points in the same direction as or “aligns with” ∇f(a) (i.e., u = $\frac{\nabla f(a)}{\parallel \nabla f(a) \parallel}$. This meansthe gradient points in the direction of the steepest ascent of the function at a, and its magnitude ∥∇f(a)∥ is the rate of steepest ascent.

Mathematically, ∇f(a)⋅u = ∥∇f(a)∥cos(θ), where θ is the angle between ∇f(a) and u. This expression is maximized when cos(θ) =1 (i.e., θ = 0), meaning u points in the same direction as ∇f(a). The gradient is perpendicular to level curves/surfaces of f and points “uphill” most steeply.

Let f be a linear real function $f: \real^n \to \real, f(\vec{x}) = \vec{a}^T\vec{x} = \sum_{j=1}^n a_jx_j$.

The components of the gradients are given by the partial derivatives: $\nabla f(\vec{x})_i = \frac{\partial f(x)}{\partial x_i} =$

$\frac{\partial}{\partial x_i}(\sum_{j=1}^n a_jx_j) = a_i \implies \nabla f(\vec{x}) = \vec{a}$

Let f be a affine real function $f: \real^n \to \real, f(\vec{x}) = \vec{a}^T\vec{x} + b = \sum_{j=1}^n a_jx_j + b$.

The components of the gradients are given by the partial derivatives: $\nabla f(\vec{x})_i = \frac{\partial f(x)}{\partial x_i} =$

$\frac{\partial}{\partial x_i}(\sum_{j=1}^n a_jx_j + b) = a_i \implies \nabla f(\vec{x}) = \vec{a}$. As expected, the intercept b is just a constant term which doesn’t affect the gradient.

Let f be a quadratic function defined as $f: \real^n \to \real, f(\vec{x}) = \vec{x}^TA\vec{x} = \sum_{i=1}^n \sum_{j=1}^n x_ia_{ij}x_j = \sum_{i=1}^n a_{ii}x_i² + \sum_{i, j, i \neq j}^n x_ia_{ij}x_j$ where $A \in \real^{nxn}$. The components of the gradients are given by the partial derivatives: $\nabla f(\vec{x})_k = \frac{\partial f(x)}{\partial x_k}$ =

$2a_{kk}x_k + \sum_{i, i \neq k} x_ia_{ik}$ (these are linear terms where j = k and i = 1, ···, n, i ≠ k) + $\sum_{j, j \neq k} a_{kj}x_j$ (these are linear terms where i = k and j = 1, ···, n, j ≠ k) =[Moving one a_kk into each sum] $\sum_{i=1}^n x_ia_{ik} + \sum_{j=1}^n a_{kj}x_j$. In vector notation, this can be written as: $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x} =[\text{if A is symmetric, i.e., A =} A^T] 2A\vec{x}$.

$f(\vec{x}) = \vec{x}^TA\vec{x} = (\begin{smallmatrix}x_1 & x_2 & ··· & x_n \end{smallmatrix})\Biggl (\begin{smallmatrix}a_{11} & a_{12} & ··· & a_{1n}\\\\ a_{21} & a_{22} & ··· & a_{2n} \\\\ ··· & ··· & ··· & ···\\\\ a_{n1} & a_{n2} & ··· & a_{nn} \end{smallmatrix} \Biggr) \Biggl ( \begin{smallmatrix}x_1\\\\ x_2 \\\\ ··· \\\\ x_n \end{smallmatrix} \Biggr ) = (\sum_{i=1}^n x_ia_{i1}, \sum_{i=1}^n x_ia_{i2}, ···, \sum_{i=1}^n x_ia_{in}) \Biggl ( \begin{smallmatrix}x_1\\\\ x_2 \\\\ ··· \\\\ x_n \end{smallmatrix} \Biggr ) = \sum_{j=1}^n(\sum_{i=1}^n x_ia_{ij})x_j = \sum_{i=1}^n \sum_{j=1}^n x_ia_{ij}x_j$

A particular case is the gradient of squared norm $\mathbb{l}_2$. Let f be a quadratic form give by $f: \real^n \to \real, f(\vec{x}) = ||\vec{x}||_2² = \vec{x}^T\vec{x} =[\text{This can also be expressed as}] \vec{x}^TI\vec{x}$ where I is the n x n identity matrix. From the general result for the gradient of a quadratic form $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x}$, we can substitute A = I. Therefore, the gradient of f(x) is $\nabla f(\vec{x}) = (I^T+I)\vec{x} =[\text{Since the identity matrix is symmetric} I^T = I, \text{ this simplifies to}] 2I\vec{x} = 2\vec{x}$.

The $\mathbb{l}_2$, Euclidean or squared norm is a measure of the magnitude of a vector in Euclidean space. It is calculated as the square root of the sum of the squares of the vector’s components. The $\mathbb{l}_2$ norm of a vector $\mathbf{\vec{x}}$ is denoted as $∥\vec{x}∥^2_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2} = \vec{x}^T\vec{x}$.

Let $\mathbb{P} \in \mathbb{S}^n$ be a symmetric matrix (meaning $\mathbb{P} = \mathbb{P}^T$), $\vec{q} \in \mathbb{R}^n, r \in \mathbb{R}.$ Consider the quadratic function $f: \real^n \to \real$ defined as: $f(\vec{x}) = \frac{1}{2}\vec{x}^TP\vec{x} + \mathbb{q}^T\vec{x} + r$.

The gradient could be calculated as follows: $\nabla f(\vec{x}) = \nabla (\frac{1}{2}\vec{x}^TP\vec{x} + \mathbb{q}^T\vec{x} + r) = $[We can apply the gradient operator to each term separately, as the gradient is linear] $\frac{1}{2} \nabla(\vec{x}^TP\vec{x}) + \nabla(\mathbb{q}^T\vec{x}) + \nabla(r) =$ [From the general result for the gradient of a quadratic form $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x}$, we can substitute A = P, the gradient of a constant r is zero, the gradient of a linear function of the form $f(\vec{x}) = \vec{a}^T\vec{x}, \nabla f(\vec{x}) = \vec{a}$, so putting all together] $\frac{1}{2}(P^T+P)\vec{x}+\vec{q}$ =[Since P is symmetric] $P\vec{x}+\vec{q}$

Quick Jacobian Recipes

Function Type	Jacobian Df(x)
Linear: f(x) = A x	Df(x) = A
Affine: f(x) = A x + b	Df(x) = A
Quadratic: f(x) = xᵀ A x	Df(x) = (A + Aᵀ)x
Squared Norm: \|\|x\|\|²	Df(x) = 2x

The First‐Order Approximation: A Rigorous Perspective