To err is human, to blame it on someone else is even more human, Jacob’s Law

In one variable, limits and continuity follow the familiar ε–δ pattern. In ℂ we use disks (or balls) instead of intervals, but the ideas carry over verbatim. Then, in ℝⁿ→ℝᵐ, “differentiability” becomes the existence of a unique linear map (the Jacobian) giving the best first‐order approximation. This article builds that bridge between different levels of abstraction in Calculus step by step.
A complex number is specified by an ordered pair of real numbers (a, b) ∈ ℝ2 and expressed or written in the form z = a + bi, where a and b are real numbers, and i is the imaginary unit, defined by the property i2 = −1 ⇔ i = $\sqrt{-1}$, e.g., 2 + 5i, $7\pi + i\sqrt{2}.$ ℂ= { a + bi ∣a, b ∈ ℝ}.
Definition. Let D ⊆ ℂ be a set of complex numbers. A complex-valued function f of a complex variable, defined on D, is a rule that assigns to each complex number z belonging to the set D a unique complex number w, f: D ➞ ℂ.
We often call the elements of D as points. If z = x+ iy ∈ D, then f(z) is called the image of the point z under f. f: D ➞ ℂ means that f is a complex function with domain D. We often write f(z) = u(x ,y) + iv(x, y), where u, v: ℝ2 → ℝ are the real and imaginary parts.
Definition. Let D ⊆ ℂ, $f: D \rarr \Complex$ be a function and z0 be a limit point of D (so arbitrarily close points of D lie around z0, though possibly z0 ∉ D). A complex number L is said to be a limit of the function f as z approaches z0, written or expressed as $\lim_{z \to z_0} f(z)=L$, if for every epsilon ε > 0, there exist a corresponding delta δ > 0 such that |f(z) -L| < ε whenever z ∈ D and 0 < |z - z0| < δ.
Why 0 < |z - z0|? We exclude z = z0 itself because the limit cares about values near z0, not at z0 itself. When z0 ∉ D, you cannot evaluate f(z0), so you only care about z approaching z0. When z0 ∈ D, you still want the function’s nearby behavior; this separates “limit” from “value.”
Equivalently, if ∀ε >0, ∃ δ > 0: (for every ε > 0, there exist a corresponding δ > 0) such that whenever z ∈ D ∩ B'(z0; δ), f(z) ∈ B(L; ε) ↭ f(D ∩ B'(z0; δ)) ⊂ B(L; ε).
If no such L exists, then we say that f(z) does not have a limit as z approaches z0. This is exactly the same ε–δ formulation we know from real calculus, but now z and L live in the complex plane ℂ, and neighborhoods are round disks rather than intervals.
Definition. Let D ⊆ ℂ. A function f: D → ℂ is said to be continuous at a point z0 ∈ D if given any arbitrarily small ε > 0, there is a corresponding δ > 0 such that |f(z) - f(z0)| < ε whenever z ∈ D and |z - z0| < δ.
In words, arbitrarily small output‐changes ε can be guaranteed by restricting z to lie in a sufficiently small disk of radius δ around z0.
Alternative (Sequential) Definition. Let D ⊆ ℂ. A function f: D → ℂ is said to be continuous at a point z0 ∈ D if for every sequence {zn}∞n=1 such that zn ∈ D ∀n∈ℕ & zn → z0, we have $\lim_{z_n \to z_0} f(z_n) = f(z_0)$ .
Definition. A function f: D → ℂ is said to be continuous if it is continuous at every point in its domain (∀z0 ∈ D).
Definition. Differentiability at a point. Let $f : ℝ^n \to ℝ^m$ be a function and let x be an interior point of the domain of f, $x \in \text{interior(dom f)} $. The function f is differentiable at x if there exists a matrix $Df(x) \in ℝ^{m \times n}$ that satisfies $\lim_{\substack{z \in \text{dom} f \\ z \neq x, z \to x}} \frac{||f(z) - f(x) - Df(x)(z-x)||_2}{||(z-x)||_2} = 0$ [*]
This matrix Df(x) is called the derivative or the Jacobian matrix of f at the point x.
Definition. A function f is called differentiable if its domain f (dom(f) ⊆ ℝn) is open and f is differentiable at every point of its domain (∀x ∈ dom(f)).

Figure. For f(x,y)=x²+y², the red plane at (1,1) is the Jacobian’s linear approximation.
A first-order approximation (also called a linear approximation) provides a linear estimate of a function's value near a specific point using the function's value and its first derivative(s) at that point. Geometrically, it represents the tangent line (in 1D), tangent plane (in 2D), or tangent hyperplane (in higher dimensions) to the function’s graph. The remainder term, often expressed using Landau’s little-o notation (e.g., o(∥h∥)), quantifies the error of this approximation, indicating that the error shrinks faster than the displacement from the point of approximation.
1️⃣ In the single-variable case, the first-order approximation of a function f(x) at a point x = a is given by the equation of the tangent line to the curve y = f(x) at that point. Mathematically, this is expressed as: L(x) = f(a) + f′(a)(x−a). This linear function L(x) provides an approximation to f(x) for values of x close enough to a. That tangent line approximation is the best linear fit locally. Geometrically, this means we are approximating the curve by a straight line that just touches the curve at (a, f(a)) and has the same slope as the curve at that point.. The error of this approximation, E(x) = f(x) − L(x), is what is left over after subtracting the linear approximation from the actual function value. When f is twice differentiable, Taylor’s theorem provides a way to analyze this error term, often showing it to be proportional to (x−a)2 for small x−a, indicating that the error shrinks quadratically as x approaches a.
2️⃣ For a scalar-valued function of two variables, f(x,y), the first-order approximation at a point (x0, y0) is given by the equation of the tangent plane to the surface z = f(x, y) at that point. Mathematically, this is formulated as: L(x, y) = f(x0, y0) + fx(x0, y0)(x - x0) + fy(x0, y0)(y - y0) where:
3️⃣ This concept extends naturally to functions of n variables, f(x₁, x₂, ···, xₙ) where the first-order approximation at a point a = f(a₁, a₂, ···, aₙ) is $L(\vec{x}) = f(a) + \sum_{i=1}^{n}f_{x_i}(a)(x_i-a_i) =[\text{Notation}] f(a) + \sum_{i=1}^{n}\frac{\partial f}{\partial x_i}(x_i-a_i)$ where $f_{x_i}(a) = \frac{\partial f}{\partial x_i}(a)$ is the partial derivative of f with respect to xi evaluated at a. This approximation is a hyperplane in (n+1)-dimensional space.
4️⃣ For a vector-valued function f: ℝⁿ → ℝᵐ, where $f(\vec{p}) = (f_1(\vec{p}), f_2(\vec{p}), \cdots, f_m(\vec{p}))$, the first-order approximation at a point $\vec{p}$ ∈ ℝⁿ involves the Jacobian matrix of f at $\vec{p}$. The Jacobian matrix, denoted $\mathbb{Df}(\vec{p})$, is an m×n matrix whose (i,j)-th entry is the partial derivative $\frac{\partial f_i}{\partial x_j}(\vec{p})$. The linear approximation $\mathbb{L}(\vec{v})$ of $\mathbb{f}(\vec{v}) \text{ for } \vec{v} \text{ near } \vec{p}$ is given by: $\mathbb{L}(\vec{v}) = \mathbb{f}(\vec{p}) + \mathbb{Df}(\vec{p})(\vec{v}-\vec{p})$ where:
The directional derivative of a function f(x), where x ∈ ℝ, at a point a in the direction of a unit vector u (i.e., ∥u∥ = 1), is defined as the instantaneous rate of change of the function as one moves away from a in the direction of u. It is denoted by Duf(a) or f’u(a) and is given by the limit: $D_uf(a) = \lim_{h \to 0} \frac{f(a + hu) - f(a)}{h}$ if this limit exists. Geometrically, it represents the slope of the tangent line to the curve obtained by intersecting the graph of f with the vertical plane passing through a in the direction of u.
If f is differentiable at a, the directional derivative is given by the dot product of the gradient vector ∇f(a) and the direction vector u: Duf(a) = ∇f(a)⋅u. Crucially, if f is differentiable at a, this formula shows that the gradient vector ∇f(a) contains all the information needed to compute the rate of change of f in ANY direction at a. The directional derivative is simply the projection of the gradient onto the desired direction u. The partial derivatives fxᵢ =[Notation] $\frac{\partial f}{\partial x_i}$ are special cases of directional derivatives where the direction u is the standard unit vector (basic vector) in the xᵢ-th coordinate direction, ei = (0, ···, 1, ···, 0) (a 1 in the i-th position and 0 elsewhere).
The first-order approximation can be used to approximate the value of a function f(a + hu) when moving a small distance h from a point a in the direction of a unit vector u. Using the general first-order approximation formula f(a+h) ≈ f(a) + ∇f(a)⋅h, and letting h=hu, we get: f(a + hu) ≈ f(a) + ∇f(a)⋅(hu) = f(a) + h(∇f(a)⋅u). Recognizing that ∇f(a)⋅u is the directional derivative $\mathbb{D_uf(a)}$, the approximation becomes f(a + hu) ≈ f(a) + h·$\mathbb{D_uf(a)}$, meaning that the rate of change in the function value f(a + hu) - f(a) is approximately equal to the distance moved h times the rate of change of the function in that very direction $\mathbb{D_uf(a)}$
The accuracy of this approximation depends on the magnitude of h and the behavior of the function. If f is differentiable, the error of this approximation will be o(h), meaning it becomes negligible compared to h for small h.
Definition. The affine function $\tilde{f}(x) = f(a) + Df(a)(x -a)$ is called the first-order approximation or linearization of the function f at the point x = a where a must lie in the interior of the domain of f (i.e. a ∈ int(dom(f))) and Df(a) represents the derivative (or Jacobian matrix for multivariable functions) at a. This formula highlights that the approximation is constructed using the function’s value at a and the rate of change of the function at a scaled by the displacement from a.
Geometric Interpretation: Near the point a, the graph of f is well modeled by the tangent hyperplane {$(x, \tilde f(x)) : x \approx a$}. Moving a small distance from a in any direction, the change in f is nearly the directional derivative along that direction.
$f(x) = \tilde f(x) + r(x)$, where the remainder r(x) satisfies $\lim_{x \to a} \frac{\|r(x)\|}{\|x - a\|} = 0.$ In Landau notation, r(x) = o(|x - a|). This condition captures the statement that any deviation of f from its tangent hyperplane is negligible compared to the distance |x - a|.
Definition. Let $f: \real^n \to \real$ be a differentiable scalar-valued function. The derivative $Df(\vec{a})$ is a 1 x n matrix (a row vector). The gradient of f at $\vec{a}$, denoted by $\nabla f(\vec{a})$, is defined as the transpose of the derivative or the vector of partial derivatives: $\nabla f(\vec{a}) = Df(\vec{a})^T$ where $\vec{a} \in Interior(dom(f))$ (the interior of the domain of f). It plays a central role in the first-order approximation and in understanding rates of change. The i-th component of the gradient is given by the partial derivative with respect to the i-th variable, $\nabla f(\vec{a})_i = \frac{\partial f}{\partial x_i}(\vec{a})$, for i = 1, ···, n provided f is differentiable at x, $\nabla f(\vec{a}) = (\frac{\partial f}{\partial x_1}(\vec{a}), \frac{\partial f}{\partial x_2}(\vec{a}), \cdots, \frac{\partial f}{\partial x_n}(\vec{a}))$.
The first-order approximation f(a + h) ≈ f(a) + ∇f(a)⋅h shows that the gradient provides the coefficients for the linear terms in the approximation. As it was previously mentioned, for any unit vector u, the directional derivative $Df(\vec{a})$ is given by ∇f(a)⋅u. This dot product is maximized when u points in the same direction as or “aligns with” ∇f(a) (i.e., u = $\frac{\nabla f(a)}{\parallel \nabla f(a) \parallel}$. This meansthe gradient points in the direction of the steepest ascent of the function at a, and its magnitude ∥∇f(a)∥ is the rate of steepest ascent.
Mathematically, ∇f(a)⋅u = ∥∇f(a)∥cos(θ), where θ is the angle between ∇f(a) and u. This expression is maximized when cos(θ) =1 (i.e., θ = 0), meaning u points in the same direction as ∇f(a). The gradient is perpendicular to level curves/surfaces of f and points “uphill” most steeply.
The components of the gradients are given by the partial derivatives: $\nabla f(\vec{x})_i = \frac{\partial f(x)}{\partial x_i} =$
$\frac{\partial}{\partial x_i}(\sum_{j=1}^n a_jx_j) = a_i \implies \nabla f(\vec{x}) = \vec{a}$
The components of the gradients are given by the partial derivatives: $\nabla f(\vec{x})_i = \frac{\partial f(x)}{\partial x_i} =$
$\frac{\partial}{\partial x_i}(\sum_{j=1}^n a_jx_j + b) = a_i \implies \nabla f(\vec{x}) = \vec{a}$. As expected, the intercept b is just a constant term which doesn’t affect the gradient.
$2a_{kk}x_k + \sum_{i, i \neq k} x_ia_{ik}$ (these are linear terms where j = k and i = 1, ···, n, i ≠ k) + $\sum_{j, j \neq k} a_{kj}x_j$ (these are linear terms where i = k and j = 1, ···, n, j ≠ k) =[Moving one akk into each sum] $\sum_{i=1}^n x_ia_{ik} + \sum_{j=1}^n a_{kj}x_j$. In vector notation, this can be written as: $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x} =[\text{if A is symmetric, i.e., A =} A^T] 2A\vec{x}$.
$f(\vec{x}) = \vec{x}^TA\vec{x} = (\begin{smallmatrix}x_1 & x_2 & ··· & x_n \end{smallmatrix})\Biggl (\begin{smallmatrix}a_{11} & a_{12} & ··· & a_{1n}\\\\ a_{21} & a_{22} & ··· & a_{2n} \\\\ ··· & ··· & ··· & ···\\\\ a_{n1} & a_{n2} & ··· & a_{nn} \end{smallmatrix} \Biggr) \Biggl ( \begin{smallmatrix}x_1\\\\ x_2 \\\\ ··· \\\\ x_n \end{smallmatrix} \Biggr ) = (\sum_{i=1}^n x_ia_{i1}, \sum_{i=1}^n x_ia_{i2}, ···, \sum_{i=1}^n x_ia_{in}) \Biggl ( \begin{smallmatrix}x_1\\\\ x_2 \\\\ ··· \\\\ x_n \end{smallmatrix} \Biggr ) = \sum_{j=1}^n(\sum_{i=1}^n x_ia_{ij})x_j = \sum_{i=1}^n \sum_{j=1}^n x_ia_{ij}x_j$
A particular case is the gradient of squared norm $\mathbb{l}_2$. Let f be a quadratic form give by $f: \real^n \to \real, f(\vec{x}) = ||\vec{x}||_2² = \vec{x}^T\vec{x} =[\text{This can also be expressed as}] \vec{x}^TI\vec{x}$ where I is the n x n identity matrix. From the general result for the gradient of a quadratic form $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x}$, we can substitute A = I. Therefore, the gradient of f(x) is $\nabla f(\vec{x}) = (I^T+I)\vec{x} =[\text{Since the identity matrix is symmetric} I^T = I, \text{ this simplifies to}] 2I\vec{x} = 2\vec{x}$.
The $\mathbb{l}_2$, Euclidean or squared norm is a measure of the magnitude of a vector in Euclidean space. It is calculated as the square root of the sum of the squares of the vector’s components. The $\mathbb{l}_2$ norm of a vector $\mathbf{\vec{x}}$ is denoted as $∥\vec{x}∥^2_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2} = \vec{x}^T\vec{x}$.
The gradient could be calculated as follows: $\nabla f(\vec{x}) = \nabla (\frac{1}{2}\vec{x}^TP\vec{x} + \mathbb{q}^T\vec{x} + r) = $[We can apply the gradient operator to each term separately, as the gradient is linear] $\frac{1}{2} \nabla(\vec{x}^TP\vec{x}) + \nabla(\mathbb{q}^T\vec{x}) + \nabla(r) =$ [From the general result for the gradient of a quadratic form $\nabla f(\vec{x}) = A^T\vec{x} + A\vec{x} = (A^T+A)\vec{x}$, we can substitute A = P, the gradient of a constant r is zero, the gradient of a linear function of the form $f(\vec{x}) = \vec{a}^T\vec{x}, \nabla f(\vec{x}) = \vec{a}$, so putting all together] $\frac{1}{2}(P^T+P)\vec{x}+\vec{q}$ =[Since P is symmetric] $P\vec{x}+\vec{q}$
| Function Type | Jacobian Df(x) |
|---|---|
| Linear: f(x) = A x | Df(x) = A |
| Affine: f(x) = A x + b | Df(x) = A |
| Quadratic: f(x) = xᵀ A x | Df(x) = (A + Aᵀ)x |
| Squared Norm: ||x||² | Df(x) = 2x |