Journal of Indian Acad. Math.  
ISSN: 0970-5120  
Vol. 48, No. 1 (2026) pp. 1–9.  
MATHEMATICAL PROPERTIES OF ACTIVATION  
FUNCTIONS IN ARTIFICIAL INTELLIGENCE  
DEVELOPMENTS  
Analysis and Implications for Deep Neural Architectures  
Massimiliano Ferrara1, and Celeste Ciccia2  
Abstract. Activation functions govern the expressive power and training dynamics of  
deep neural networks through their analytical properties. This paper provides a rigorous  
mathematical analysis of six fundamental activation functions – Linear, Sigmoid, Hyper-  
bolic Tangent, ReLU, Parametric ReLU, and Exponential Linear Unit – examining how  
regularity, gradient structure, and spectral properties influence representational capac-  
ity, gradient flow stability, and convergence behavior in deep architectures. We establish  
formal results on the representational collapse of linear activations, derive sharp gradient  
decay bounds for saturating functions, prove gradient preservation theorems for piecewise-  
linear activations, and characterize the convergence advantages of smooth non-saturating  
units. Our analysis yields a unified mathematical framework connecting activation func-  
tion properties to network trainability, with direct implications for the design of deep  
learning architectures in sequential decision-making, continuous control, and safety-critical  
applications.  
Keywords: Activation functions, deep neural networks, gradient flow, vanishing gradi-  
ents, convergence analysis, ReLU, ELU, representational capacity.  
2010 AMS Subject Classification: 68T07, 65K10, 90C26, 41A25, 60H35.  
1. Introduction  
Deep neural networks derive their approximation power from the composition of pa-  
rameterized ane maps with nonlinear activation functions. While the universal ap-  
proximation theorem [1] establishes existence results for shallow networks, the practical  
trainability and generalization of deep architectures depend critically on the analytical  
properties of the chosen activation. Despite extensive empirical work surveying activation  
function performance in supervised learning [2, 3], a unified mathematical treatment con-  
necting regularity, gradient structure, and convergence guarantees in deep architectures  
remains incomplete.  
This paper addresses the gap by providing rigorous analysis of six canonical activation  
functions that represent the major paradigms in neural network design: the Linear func-  
tion, the Sigmoid, the Hyperbolic Tangent (TanH), the Rectified Linear Unit (ReLU) [4],  
the Parametric ReLU (PReLU) [5], and the Exponential Linear Unit (ELU) [6]. We fo-  
cus on four mathematical dimensions: (i) representational capacity through composition,  
(ii) gradient magnitude propagation across depth, (iii) regularity and Lipschitz properties,  
1
2
Massimiliano Ferrara, and Celeste Ciccia  
and (iv) convergence rate estimates under stochastic optimization. Our analysis is moti-  
vated by, and directly applicable to, the design of deep architectures for complex tasks in-  
cluding reinforcement learning, continuous control, and sequential decision-making, where  
gradient flow across both network depth and temporal horizons is essential.  
The remainder of the paper is organized as follows. Section 2 establishes notation and  
the formal framework. Section 3 treats the linear case and its representational collapse.  
Sections 4 and 5 analyze saturating and non-saturating activations respectively, establish-  
ing gradient bounds and convergence results. Section 6 presents a comparative synthesis  
with quantitative metrics. Section 7 oers architecture design implications, and Section 8  
concludes.  
2. Preliminaries and Notation  
n0  
nL  
Consider a feedforward neural network fθ : R  
R  
of depth L parameterized by  
L
n×n1  
nℓ  
θ = {(W, b)}=1, where WR  
and bR . The forward computation is defined  
recursively:  
h0 = x,  
z= Wh1 + b,  
h= σ(z),  
= 1, . . . , L,  
(1)  
nℓ  
where σ : R R is applied component-wise and hR  
denotes the activation vector  
at layer .  
Definition 2.1 (Activation function properties). Let σ : R R be an activation function.  
We define:  
(i) Saturation: σ is saturating if lim|x|→∞ |σ(x)| = 0.  
(ii) Gradient bound: σ has gradient bound (g, g) if g |σ(x)| g for all x in the  
support of the pre-activation distribution.  
|σ(x)σ(y)|  
(iii) Lipschitz constant: Lip(σ) = supx=y  
.
|xy|  
min(|σ(x)|,1)  
(iv) Gradient consistency: GC(σ) = ExD  
[0, 1].  
max(|σ(x)|,1)  
For gradient analysis, we adopt the standard backpropagation formalism. The gradient  
of a loss L with respect to parameters θat layer satisfies:  
L1  
L  
L  
hℓ  
=
Wk+1Dk+1  
·
,
(2)  
∂θℓ  
hL  
∂θℓ  
k=ℓ  
where Dk = diag(σ(zk1), . . . , σ(znk )) is the Jacobian of the activation at layer k.  
k
3. Linear Activations: Representational Collapse  
n0  
nL  
Theorem 3.1 (Depth collapse). Let fθ : R  
R  
be a network of depth L with linear  
nL×n0  
nL  
activations σ(x) = ax + c, a, cR. Then there exist A R  
and b R  
such  
that fθ(x) = Ax + b for all x.  
Proof. We proceed by induction on L. For L = 1:  
fθ(x) = σ1(W1x + b1) = a1(W1x + b1) + c11 = (a1W1)x + (a1b1 + c11),  
MATHEMATICAL PROPERTIES OF ACTIVATION FUNCTIONS  
3
which is ane. Suppose the result holds for depth L1, so that f(L1)(x) = AL1x+bL1  
.
Then:  
f
(L)(x) = σL WLf(L1)(x) + bL  
bL  
= aL WLAL1x + WLbL1 + bL + cL1  
= (aLWLAL1) x + aL(WLbL1 + bL) + cL1 .  
(3)  
ꢆꢇ  
ꢆꢇ  
AL  
By induction, fθ is ane regardless of depth.  
Corollary 3.2. The hypothesis class of networks with linear activations has the same  
Vapnik–Chervonenkis dimension as a single-layer linear model. Consequently, depth pro-  
vides no additional representational capacity, and any nonlinear decision boundary is  
unattainable.  
This result eliminates linear activations from consideration in deep architectures de-  
signed for complex function approximation, motivating the study of nonlinear alterna-  
tives.  
4. Saturating Activations: Gradient Decay Analysis  
4.1. Sigmoid function. The sigmoid σ(x) = (1+ex)1 maps R to (0, 1) with derivative  
σ(x) = σ(x)(1 σ(x)). The maximum derivative supx σ(x) = 1/4 is attained at x = 0.  
Proposition 4.1 (Exponential gradient decay). For an L-layer network with sigmoid  
activations, if Wkꢀ ≤ wmax for all k, then:  
ꢉ ꢉ  
ꢉ ꢉ  
L1  
L  
wmax  
L  
h1  
ꢉ ꢉ  
·
.
(4)  
ꢉ ꢉ  
∂θ1  
4
hL  
∂θ1  
In particular, when wmax < 4 (which holds under standard initialization schemes), the  
gradient decays exponentially as O (wmax/4)L .  
Proof. From (2), each Jacobian factor satisfies Wk+1Dk+1ꢀ ≤ ꢀWk+1· Dk+1, and  
Dk= maxi |σ(zi )| 1/4. Applying submultiplicativity across L 1 layers yields the  
k
bound.  
For a network with L = 10 and wmax 1 (typical under Xavier initialization), the  
gradient magnitude at layer 1 scales as (1/4)9 3.8×106, rendering early-layer learning  
negligible.  
4.2. Hyperbolic tangent. The function σ(x) = tanh(x) maps to (1, 1) with σ(x) =  
1 tanh2(x) and supx σ(x) = 1, achieved at x = 0.  
Proposition 4.2 (TanH gradient bound). Under the same hypotheses as Proposition 4.1,  
a TanH network satisfies:  
ꢉ ꢉ  
ꢉ ꢉ  
L  
L  
h1  
L1  
ꢉ ꢉ  
·
wmax  
.
(5)  
ꢉ ꢉ  
∂θ1  
hL  
∂θ1  
4
Massimiliano Ferrara, and Celeste Ciccia  
However, for |zki | > 2, the local derivative satisfies |σ(zki )| < 0.07, and eective gradient  
decay in saturation regions scales as O(0.07L).  
Although TanH exhibits a favorable maximum gradient of 1 and zero-centered outputs  
(reducing internal covariate shift [2]), it shares the fundamental saturation defect with  
sigmoid: for pre-activation magnitudes exceeding approximately 2, gradient flow degrades  
exponentially. The zero-centered property yields symmetric gradient updates, beneficial  
for advantage estimation in actor-critic architectures where A(s, a) = Q(s, a) V (s)  
is naturally centered around zero. Nonetheless, this advantage is contingent on pre-  
activations remaining near the origin — a condition that becomes increasingly dicult to  
maintain in deep networks without explicit normalization.  
5. Non-Saturating Activations: Gradient Preservation and Convergence  
5.1. ReLU: Piecewise-linear gradient structure. The Rectified Linear Unit σ(x) =  
max(0, x) has derivative σ(x) = I[x > 0], where I[·] denotes the indicator function. This  
piecewise-linear structure eliminates saturation for positive inputs.  
Theorem 5.1 (ReLU gradient preservation). In a ReLU network, define the binary mask  
Mk = diag I[z1 > 0], . . . , I[znk > 0] . Then for any layer < L:  
k
k
L1  
L  
L  
hℓ  
=
Wk+1Mk+1  
·
.
(6)  
∂θℓ  
hL  
∂θℓ  
k=ℓ  
Each mask Mk has entries in {0, 1}, so the activation derivative contributes no scaling  
factors other than 0 or 1 along each pathway. Gradient magnitude through active pathways  
scales as:  
ꢉ ꢉ  
ꢉ ꢉ  
L  
L  
hℓ  
ꢉ ꢉ  
·
L1 Wk+1·  
.
(7)  
ꢉ ꢉ  
∂θℓ  
hL  
∂θℓ  
k=ℓ  
With He initialization [5] ensuring Wkꢀ ≈ 1, gradients are approximately preserved with-  
out exponential attenuation.  
Proof. From (2), the Jacobian at each ReLU layer is Dk = Mk with Mkꢀ ≤ 1. Therefore  
L1  
L1  
Wk+1Dk+1ꢀ ≤ ꢀWk+1, and the product  
Wk+1Mk+1ꢀ ≤  
Wk+1, which  
k=ℓ  
k=ℓ  
depends solely on weight norms, independent of activation derivatives.  
Remark 5.2 (Dying neurons). A ReLU neuron i at layer k becomes permanently inactive  
if zki 0 for all inputs in the training distribution, yielding Mkii 0. The probability of  
neuron death under gradient updates with learning rate α and weight variance σw2 grows  
as:  
α2T  
2σw2  
Pdeath(T) 1 exp −  
,
(8)  
where T is the number of training steps. This can reduce eective network width by 10–  
20% in practice, motivating the parametric extensions below.  
MATHEMATICAL PROPERTIES OF ACTIVATION FUNCTIONS  
5
5.2. PReLU: Learnable negative slopes. The Parametric ReLU σ(x) = max(αx, x)  
with learnable α > 0 (typically initialized at 0.01 or 0.25) has derivative:  
α, x < 0,  
σ(x) =  
(9)  
1,  
x 0.  
Proposition 5.3 (PReLU gradient bounds). For a PReLU network with parameter bounds  
0 < αmin αk αmax < 1, the gradient satisfies:  
α
L1 Wk+1· C≤  
L1 Wk+1· C,  
(10)  
L  
L1  
min  
∂θℓ  
k=ℓ  
k=ℓ  
where C= L/hL· h/∂θ. Thus PReLU gradients are bounded both above and  
below, precluding both vanishing and explosion along any pathway, with the minimum  
scaling controlled by αmin  
.
The key consequence is that PReLU eliminates dead neurons: since σ(x) = α > 0  
for x < 0, every neuron maintains a nonzero gradient pathway. Empirically, this reduces  
dead neuron prevalence from approximately 15% to below 2% [5]. The additional per-  
layer parameter α introduces negligible overhead — one scalar per layer, or per channel  
in convolutional architectures.  
5.3. ELU: Smooth non-saturating activation. The Exponential Linear Unit is de-  
fined as:  
x,  
x 0,  
σ(x) =  
(11)  
(12)  
α(ex 1), x < 0,  
with α > 0 (commonly α = 1). The derivative is:  
1,  
x 0,  
σ(x) =  
αex, x < 0,  
which is continuous at x = 0, unlike ReLU.  
Lemma 5.4 (ELU regularity). With α = 1, the ELU function satisfies:  
(i) σ C1(R) with Lip(σ) = 1;  
(ii) limx→−∞ σ(x) = α (bounded negative saturation);  
(iii) E[σ(X)] 0 for X N(0, 1), providing near zero-mean activations;  
(iv) σ(x) > 0 for all x R (strictly positive gradients everywhere).  
Properties (i) and (iv) together guarantee that no neuron can become permanently  
inactive, while the C1 regularity ensures stable gradient flow near the origin — the region  
where ReLU exhibits a discontinuous derivative.  
Theorem 5.5 (ELU convergence advantage). Consider a parameterized value function  
Vθ(s) trained via temporal dierence learning with step size µ > 0 and discount factor  
γ [0, 1). Let δt = rt + γVθ(st+1) Vθ(st) be the Bellman error. Then:  
6
Massimiliano Ferrara, and Celeste Ciccia  
(a) For ELU activations with zero-mean property |E[h]| c1 for small c1 > 0:  
log(1/ε)  
µ(1 γ)2  
TELU(ε) = O  
.
(13)  
(14)  
(b) For ReLU activations with positive-biased mean E[h] c2 > 0:  
log(1/ε)  
TReLU(ε) = O  
.
µ(1 γ)3/2  
Proof sketch. The variance of the Bellman error decomposes as:  
Var[δt] Var[rt] + γ2Var[Vθ(st+1)] + Var[Vθ(st)].  
ELU’s zero-mean activations yield tighter variance bounds on Vθ(s) via reduced internal  
covariate shift, since |E[h]| c1 propagates through layers without systematic bias accu-  
mulation. In contrast, ReLU’s positive mean E[h] c2 introduces additive bias at each  
layer, inflating Var[Vθ]. By standard stochastic approximation results [8], lower update  
variance improves the convergence rate from (1γ)3/2 to (1γ)2 in the discount factor  
dependence.  
Remark 5.6 (Computational cost). The exponential computation in ELU’s negative  
branch requires substantially more floating-point operations than ReLU’s comparison. On  
modern GPU architectures, ELU is approximately 50× slower per element. This creates  
a fundamental trade-o: superior convergence and regularity properties versus compu-  
tational overhead, whose optimal resolution depends on application-specific latency con-  
straints.  
6. Comparative Synthesis  
We now synthesize the mathematical properties analyzed above into a unified compar-  
ison. Table 1 summarizes key metrics.  
Table 1. Mathematical properties of activation functions.  
Property  
Sigmoid TanH  
ReLU  
PReLU  
ELU  
Range  
(0, 1)  
0.25  
0  
Yes  
C∞  
0.25  
No  
(1, 1)  
1
0  
Yes  
C∞  
1
Yes  
No  
0.41  
[0, )  
(−∞, )  
[α, )  
sup |σ|  
1
1
α
1
0+  
Soft (x < 0)  
C1  
inf |σ| (eective)  
Saturating  
Ck regularity  
Lipschitz constant  
Zero-centered  
Dead neurons  
GC (gradient consistency)  
0
No (x > 0)  
No  
C0  
1
No  
No  
0.85  
C0  
1
1
No  
Yes  
0.78  
Yes  
No  
0.12  
No  
0.91  
The gradient consistency metric GC(σ) (Definition 2.1) provides a scalar summary of  
trainability: values near 1 indicate stable gradient flow across depth, while values near 0  
signal pathological gradient attenuation. The ranking GCELU > GCPReLU > GCReLU  
MATHEMATICAL PROPERTIES OF ACTIVATION FUNCTIONS  
7
GCTanH > GCSigmoid reflects the theoretical analysis: non-saturating activations with pos-  
itive gradients everywhere achieve highest consistency, followed by ReLU which sacrifices  
consistency in the negative region, and saturating functions which degrade rapidly.  
Proposition 6.1 (Gradient decay ordering). For an L-layer network with weight norms  
bounded by wmax, the gradient magnitude at layer 1 satisfies the ordering:  
ꢀ∇θ Lꢀ  
≪ ꢀ∇θ Lꢀ  
< ꢀ∇θ Lꢀ  
≤ ꢀ∇θ Lꢀ  
≤ ꢀ∇θ Lꢀ  
,
(15)  
1
1
1
1
1
Sig  
TanH  
ReLU  
PReLU  
ELU  
where the first inequality is exponentially strict (ratio scales as (4/wmax)L), the second  
reflects saturation-region losses in TanH, and the final inequalities follow from the gradient  
lower bounds in Propositions 5.3 and Lemma 5.4(iv).  
The Lipschitz properties merit particular attention for robustness analysis. Both ReLU  
and ELU satisfy Lip(σ) = 1, but ELU’s C1 regularity provides stronger stability guar-  
antees. For a network fθ with Lipschitz-1 activation and bounded weight norms, the  
L
=1  
end-to-end Lipschitz constant satisfies Lip(fθ) ≤  
W. However, the C1 regularity  
of ELU additionally ensures that local sensitivity varies smoothly with the input, enabling  
tighter perturbation analysis in safety-critical settings where worst-case output deviations  
must be bounded [2].  
7. Implications for Architecture Design  
The mathematical analysis developed in the preceding sections yields principled criteria  
for activation function selection in deep architectures. Rather than prescribing a single  
universal choice, these criteria delineate a trade-osurface whose optimal operating point  
depends on the dominant design constraint of the application at hand.  
The first and arguably most decisive criterion concerns gradient flow. When network  
depth exceeds approximately 10 layers, or when temporal credit assignment must span  
long horizons as in reinforcement learning with delayed rewards, non-saturating activa-  
tions become essential. The exponential gradient decay O(0.25L) established for sigmoid  
in Proposition 4.1 renders it fundamentally unsuitable for deep architectures, and while  
TanH oers improvement through its unit maximum derivative, it remains vulnerable  
in saturation regions. Among non-saturating alternatives, PReLU and ELU provide the  
strongest gradient flow guarantees (Proposition 5.3 and Lemma 5.4), making them the  
natural candidates for architectures where gradient propagation is the binding concern.  
A second important dimension is regularity. For tasks requiring smooth output map-  
pings — such as continuous control, where policy smoothness translates directly to phys-  
ical stability — the C1 regularity of ELU is mathematically preferred over the C0 alter-  
natives ReLU and PReLU, whose derivative discontinuity at zero propagates through the  
network and manifests as non-smooth gradient landscapes and less stable optimization  
trajectories near decision boundaries. This regularity advantage interacts with conver-  
gence eciency: as established in Theorem 5.5, ELU’s zero-mean activations improve the  
8
Massimiliano Ferrara, and Celeste Ciccia  
discount-factor dependence in convergence rates from (1 γ)3/2 to (1 γ)2, a substan-  
tial gain when the discount factor γ is close to 1, which justifies ELU’s computational  
overhead in sample-limited settings.  
These theoretical advantages must, however, be weighed against computational con-  
straints. When inference latency is the binding requirement, the analysis reduces to a  
straightforward calculus: ReLU’s comparison operation is approximately 50× faster than  
ELU’s exponential, and PReLU — requiring one additional multiplication per negative  
activation — oers an intermediate point. For architectures operating under hard real-  
time constraints at sub-millisecond timescales, this eciency gap dominates all other  
considerations regardless of the theoretical merits of smoother alternatives. A promising  
resolution to this tension lies in hybrid strategies: since the mathematical requirements dif-  
fer across network components, architectures employing ReLU in early feature-extraction  
layers and ELU in terminal policy-generation layers can simultaneously satisfy eciency  
and smoothness constraints, exploiting the compositional structure of deep networks to  
achieve near-optimal performance along multiple criteria.  
8. Conclusion  
We have established a rigorous mathematical framework characterizing six fundamen-  
tal activation functions along the dimensions of representational capacity, gradient flow,  
regularity, and convergence. The key results are: (i) Theorem 3.1 demonstrates the repre-  
sentational collapse of linear activations through depth, motivating nonlinear alternatives;  
(ii) Propositions 4.1–4.2 quantify the exponential gradient decay that renders saturating  
activations unsuitable for deep architectures; (iii) Theorem 5.1 establishes that ReLU’s  
binary gradient structure eliminates depth-dependent attenuation along active pathways;  
(iv) Proposition 5.3 shows that PReLU achieves bounded gradient flow in both positive  
and negative regions; and (v) Theorem 5.5 demonstrates ELU’s convergence advantage  
stemming from its C1 regularity and zero-mean property.  
No single activation function dominates across all mathematical criteria. The anal-  
ysis reveals a fundamental trade-osurface: computational eciency (favoring ReLU),  
gradient completeness (favoring PReLU), and regularity with convergence optimality (fa-  
voring ELU). Optimal architecture design requires selecting the appropriate operating  
point on this surface based on domain-specific constraints. Future research directions  
include adaptive activation selection via meta-learning, formal sample complexity bounds  
parameterized by activation properties, and the analysis of emerging activation paradigms  
including those inspired by quantum computing and neuromorphic hardware.  
9. Acknowledgement  
The authors thank the Decisions LAB team at University Mediterranea of Reggio Cal-  
abria for computational resources and valuable discussions.  
MATHEMATICAL PROPERTIES OF ACTIVATION FUNCTIONS  
9
REFERENCES  
[1] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural  
Networks 2(5) (1989) 359–366.  
[2] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.  
[3] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521(7553) (2015) 436–444.  
[4] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proc. AISTATS, JMLR W&CP  
15, 2011, pp. 315–323.  
[5] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on  
ImageNet classification, in: Proc. IEEE ICCV, 2015, pp. 1026–1034.  
[6] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear  
units (ELUs), arXiv preprint arXiv:1511.07289, 2015.  
[7] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is dicult, IEEE  
Trans. Neural Netw. 5(2) (1994) 157–166.  
[8] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.  
[9] X. Glorot, Y. Bengio, Understanding the diculty of training deep feedforward neural networks, in: Proc.  
AISTATS, JMLR W&CP 9, 2010, pp. 249–256.  
[10] Z. Allen-Zhu, Y. Li, Z. Song, A convergence theory for deep learning via over-parameterization, in: Proc.  
ICML, PMLR 97, 2019, pp. 242–252.  
[11] P. Ramachandran, B. Zoph, Q. V. Le, Searching for activation functions, arXiv preprint arXiv:1710.05941,  
2017.  
[12] V. Mnih et al., Human-level control through deep reinforcement learning, Nature 518(7540) (2015) 529–533.  
(Received, November 12, 2025)  
(Revised, February 13, 2025)  
1,2Decisions LAB,  
University Mediterranea of Reggio Calabria, Italy.  
Email1 massimiliano.ferrara@unirc.it  
Email2 celeste.ciccia@unirc.it