Journal of Indian Acad. Math.

ISSN: 0970-5120

Vol. 48, No. 1 (2026) pp. 1–9.

MATHEMATICAL PROPERTIES OF ACTIVATION

FUNCTIONS IN ARTIFICIAL INTELLIGENCE

DEVELOPMENTS

Analysis and Implications for Deep Neural Architectures

Massimiliano Ferrara¹, and Celeste Ciccia²

Abstract. Activation functions govern the expressive power and training dynamics of

deep neural networks through their analytical properties. This paper provides a rigorous

mathematical analysis of six fundamental activation functions – Linear, Sigmoid, Hyper-

bolic Tangent, ReLU, Parametric ReLU, and Exponential Linear Unit – examining how

regularity, gradient structure, and spectral properties inﬂuence representational capac-

ity, gradient ﬂow stability, and convergence behavior in deep architectures. We establish

formal results on the representational collapse of linear activations, derive sharp gradient

decay bounds for saturating functions, prove gradient preservation theorems for piecewise-

linear activations, and characterize the convergence advantages of smooth non-saturating

units. Our analysis yields a uniﬁed mathematical framework connecting activation func-

tion properties to network trainability, with direct implications for the design of deep

learning architectures in sequential decision-making, continuous control, and safety-critical

applications.

Keywords: Activation functions, deep neural networks, gradient ﬂow, vanishing gradi-

ents, convergence analysis, ReLU, ELU, representational capacity.

2010 AMS Subject Classiﬁcation: 68T07, 65K10, 90C26, 41A25, 60H35.

1. Introduction

Deep neural networks derive their approximation power from the composition of pa-

rameterized aﬃne maps with nonlinear activation functions. While the universal ap-

proximation theorem [1] establishes existence results for shallow networks, the practical

trainability and generalization of deep architectures depend critically on the analytical

properties of the chosen activation. Despite extensive empirical work surveying activation

function performance in supervised learning [2, 3], a uniﬁed mathematical treatment con-

necting regularity, gradient structure, and convergence guarantees in deep architectures

remains incomplete.

This paper addresses the gap by providing rigorous analysis of six canonical activation

functions that represent the major paradigms in neural network design: the Linear func-

tion, the Sigmoid, the Hyperbolic Tangent (TanH), the Rectiﬁed Linear Unit (ReLU) [4],

the Parametric ReLU (PReLU) [5], and the Exponential Linear Unit (ELU) [6]. We fo-

cus on four mathematical dimensions: (i) representational capacity through composition,

(ii) gradient magnitude propagation across depth, (iii) regularity and Lipschitz properties,

1

2

Massimiliano Ferrara, and Celeste Ciccia

and (iv) convergence rate estimates under stochastic optimization. Our analysis is moti-

vated by, and directly applicable to, the design of deep architectures for complex tasks in-

cluding reinforcement learning, continuous control, and sequential decision-making, where

gradient ﬂow across both network depth and temporal horizons is essential.

The remainder of the paper is organized as follows. Section 2 establishes notation and

the formal framework. Section 3 treats the linear case and its representational collapse.

Sections 4 and 5 analyze saturating and non-saturating activations respectively, establish-

ing gradient bounds and convergence results. Section 6 presents a comparative synthesis

with quantitative metrics. Section 7 oﬀers architecture design implications, and Section 8

concludes.

2. Preliminaries and Notation

n₀

n_L

Consider a feedforward neural network f_θ: R

→ R

of depth L parameterized by

L

n_ℓ×n_ℓ−1

n_ℓ

θ = {(W_ℓ, b_ℓ)}_ℓ=1, where W_ℓ∈ R

and b_ℓ∈ R . The forward computation is deﬁned

recursively:

h₀= x,

z_ℓ= W_ℓh_ℓ−1+ b_ℓ,

h_ℓ= σ(z_ℓ),

ℓ = 1, . . . , L,

(1)

n_ℓ

where σ : R → R is applied component-wise and h_ℓ∈ R

denotes the activation vector

at layer ℓ.

Deﬁnition 2.1 (Activation function properties). Let σ : R → R be an activation function.

We deﬁne:

(i) Saturation: σ is saturating if lim_|x|→∞|σ^′(x)| = 0.

(ii) Gradient bound: σ has gradient bound (g, g) if g ≤ |σ^′(x)| ≤ g for all x in the

support of the pre-activation distribution.

|σ(x)−σ(y)|

(iii) Lipschitz constant: Lip(σ) = sup_x=y

.

|x−y|

ꢀ

ꢁ

min(|σ^′(x)|,1)

(iv) Gradient consistency: GC(σ) = E_x∼D

∈ [0, 1].

max(|σ^′(x)|,1)

For gradient analysis, we adopt the standard backpropagation formalism. The gradient

of a loss L with respect to parameters θ_ℓat layer ℓ satisﬁes:

L−1

ꢂ

∂L

∂h_ℓ

=

W_k^⊤₊₁D_k+1

·

,

(2)

∂θ_ℓ

∂h_L

∂θ_ℓ

k=ℓ

where D_k= diag(σ^′(z_k¹), . . . , σ^′(zⁿ^k)) is the Jacobian of the activation at layer k.

k

3. Linear Activations: Representational Collapse

n₀

n_L

Theorem 3.1 (Depth collapse). Let f_θ: R

→ R

be a network of depth L with linear

n_L×n₀

n_L

activations σ_ℓ(x) = a_ℓx + c_ℓ, a_ℓ, c_ℓ∈ R. Then there exist A ∈ R

and b ∈ R

such

that f_θ(x) = Ax + b for all x.

Proof. We proceed by induction on L. For L = 1:

f_θ(x) = σ₁(W₁x + b₁) = a₁(W₁x + b₁) + c₁1 = (a₁W₁)x + (a₁b₁+ c₁1),

MATHEMATICAL PROPERTIES OF ACTIVATION FUNCTIONS

3

which is aﬃne. Suppose the result holds for depth L−1, so that f^(L−1)(x) = A_L−1x+b_L−1

.

Then:

ꢃ

ꢄ

f

^(L)(x) = σ_LW_Lf^(L−1)(x) + b_L

ꢃ

ꢄ

b_L

= a_LW_LA_L−1x + W_Lb_L−1+ b_L+ c_L1

= (a_LW_LA_L−1) x + a_L(W_Lb_L−1+ b_L) + c_L1 .

(3)

ꢅ

ꢆꢇ

ꢈ

ꢅ

ꢆꢇ

ꢈ

A_L

By induction, f_θis aﬃne regardless of depth.

□

Corollary 3.2. The hypothesis class of networks with linear activations has the same

Vapnik–Chervonenkis dimension as a single-layer linear model. Consequently, depth pro-

vides no additional representational capacity, and any nonlinear decision boundary is

unattainable.

This result eliminates linear activations from consideration in deep architectures de-

signed for complex function approximation, motivating the study of nonlinear alterna-

tives.

4. Saturating Activations: Gradient Decay Analysis

4.1. Sigmoid function. The sigmoid σ(x) = (1+e^−x)⁻¹maps R to (0, 1) with derivative

σ^′(x) = σ(x)(1 − σ(x)). The maximum derivative sup_xσ^′(x) = 1/4 is attained at x = 0.

Proposition 4.1 (Exponential gradient decay). For an L-layer network with sigmoid

activations, if ꢀW_kꢀ ≤ w_maxfor all k, then:

ꢉ

ꢉ ꢉ

ꢉ

ꢊ

ꢋ

L−1

∂L

w_max

∂L

∂h₁

ꢉ ꢉ

·

≤

.

(4)

ꢉ ꢉ

∂θ₁

4

∂h_L

∂θ₁

In particular, when w_max< 4 (which holds under standard initialization schemes), the

ꢃ

ꢄ

gradient decays exponentially as O (w_max/4)^L.

Proof. From (2), each Jacobian factor satisﬁes ꢀW_k^⊤₊₁D_k+1ꢀ ≤ ꢀW_k+1ꢀ · ꢀD_k+1ꢀ, and

ꢀD_kꢀ = max_i|σ^′(zⁱ)| ≤ 1/4. Applying submultiplicativity across L − 1 layers yields the

k

bound.

□

For a network with L = 10 and w_max≈ 1 (typical under Xavier initialization), the

gradient magnitude at layer 1 scales as (1/4)⁹≈ 3.8×10⁻⁶, rendering early-layer learning

negligible.

4.2. Hyperbolic tangent. The function σ(x) = tanh(x) maps to (−1, 1) with σ^′(x) =

1 − tanh²(x) and sup_xσ^′(x) = 1, achieved at x = 0.

Proposition 4.2 (TanH gradient bound). Under the same hypotheses as Proposition 4.1,

a TanH network satisﬁes:

ꢉ

ꢉ ꢉ

ꢉ

∂L

∂h₁

L−1

ꢉ ꢉ

·

≤ w_max

.

(5)

ꢉ ꢉ

∂θ₁

∂h_L

∂θ₁

4

Massimiliano Ferrara, and Celeste Ciccia

However, for |z_kⁱ| > 2, the local derivative satisﬁes |σ^′(z_kⁱ)| < 0.07, and eﬀective gradient

decay in saturation regions scales as O(0.07^L).

Although TanH exhibits a favorable maximum gradient of 1 and zero-centered outputs

(reducing internal covariate shift [2]), it shares the fundamental saturation defect with

sigmoid: for pre-activation magnitudes exceeding approximately 2, gradient ﬂow degrades

exponentially. The zero-centered property yields symmetric gradient updates, beneﬁcial

for advantage estimation in actor-critic architectures where A(s, a) = Q(s, a) − V (s)

is naturally centered around zero. Nonetheless, this advantage is contingent on pre-

activations remaining near the origin — a condition that becomes increasingly diﬃcult to

maintain in deep networks without explicit normalization.

5. Non-Saturating Activations: Gradient Preservation and Convergence

5.1. ReLU: Piecewise-linear gradient structure. The Rectiﬁed Linear Unit σ(x) =

max(0, x) has derivative σ^′(x) = I[x > 0], where I[·] denotes the indicator function. This

piecewise-linear structure eliminates saturation for positive inputs.

Theorem 5.1 (ReLU gradient preservation). In a ReLU network, deﬁne the binary mask

ꢃ

ꢄ

M_k= diag I[z¹> 0], . . . , I[zⁿ^k> 0] . Then for any layer ℓ < L:

k

L−1

ꢂ

ꢃ

ꢄ

∂L

∂h_ℓ

=

W_k^⊤₊₁M_k+1

·

.

(6)

∂θ_ℓ

∂h_L

∂θ_ℓ

k=ℓ

Each mask M_khas entries in {0, 1}, so the activation derivative contributes no scaling

factors other than 0 or 1 along each pathway. Gradient magnitude through active pathways

scales as:

ꢉ

ꢉ ꢉ

ꢉ

ꢂ

∂L

∂h_ℓ

ꢉ ꢉ

·

≤

^L−1ꢀW_k+1ꢀ ·

.

(7)

ꢉ ꢉ

∂θ_ℓ

∂h_L

∂θ_ℓ

k=ℓ

With He initialization [5] ensuring ꢀW_kꢀ ≈ 1, gradients are approximately preserved with-

out exponential attenuation.

Proof. From (2), the Jacobian at each ReLU layer is D_k= M_kwith ꢀM_kꢀ ≤ 1. Therefore

ꢌ

L−1

ꢀW_k^⊤₊₁D_k+1ꢀ ≤ ꢀW_k+1ꢀ, and the product

ꢀW_k^⊤₊₁M_k+1ꢀ ≤

ꢀW_k+1ꢀ, which

k=ℓ

depends solely on weight norms, independent of activation derivatives.

□

Remark 5.2 (Dying neurons). A ReLU neuron i at layer k becomes permanently inactive

if z_kⁱ≤ 0 for all inputs in the training distribution, yielding M_kⁱⁱ≡ 0. The probability of

neuron death under gradient updates with learning rate α and weight variance σ_w²grows

as:

ꢍ

ꢎ

α²T

2σ_w²

P_death(T) ≈ 1 − exp −

,

(8)

where T is the number of training steps. This can reduce eﬀective network width by 10–

20% in practice, motivating the parametric extensions below.

MATHEMATICAL PROPERTIES OF ACTIVATION FUNCTIONS

5

5.2. PReLU: Learnable negative slopes. The Parametric ReLU σ(x) = max(αx, x)

with learnable α > 0 (typically initialized at 0.01 or 0.25) has derivative:

ꢏ

α, x < 0,

σ^′(x) =

(9)

1,

x ≥ 0.

Proposition 5.3 (PReLU gradient bounds). For a PReLU network with parameter bounds

0 < α_min≤ α_k≤ α_max< 1, the gradient satisﬁes:

ꢉ

α

^L−1ꢀW_k+1ꢀ · C_ℓ≤

≤

^L−1ꢀW_k+1ꢀ · C_ℓ,

(10)

ꢂ

∂L

L−1

min

∂θ_ℓ

k=ℓ

where C_ℓ= ꢀ∂L/∂h_Lꢀ · ꢀ∂h_ℓ/∂θ_ℓꢀ. Thus PReLU gradients are bounded both above and

below, precluding both vanishing and explosion along any pathway, with the minimum

scaling controlled by α_min

.

The key consequence is that PReLU eliminates dead neurons: since σ^′(x) = α > 0

for x < 0, every neuron maintains a nonzero gradient pathway. Empirically, this reduces

dead neuron prevalence from approximately 15% to below 2% [5]. The additional per-

layer parameter α introduces negligible overhead — one scalar per layer, or per channel

in convolutional architectures.

5.3. ELU: Smooth non-saturating activation. The Exponential Linear Unit is de-

ﬁned as:

ꢏ

x,

x ≥ 0,

σ(x) =

(11)

(12)

α(e^x− 1), x < 0,

with α > 0 (commonly α = 1). The derivative is:

ꢏ

1,

x ≥ 0,

σ^′(x) =

αe^x, x < 0,

which is continuous at x = 0, unlike ReLU.

Lemma 5.4 (ELU regularity). With α = 1, the ELU function satisﬁes:

(i) σ ∈ C¹(R) with Lip(σ) = 1;

(ii) lim_x→−∞σ(x) = −α (bounded negative saturation);

(iii) E[σ(X)] ≈ 0 for X ∼ N(0, 1), providing near zero-mean activations;

(iv) σ^′(x) > 0 for all x ∈ R (strictly positive gradients everywhere).

Properties (i) and (iv) together guarantee that no neuron can become permanently

inactive, while the C¹regularity ensures stable gradient ﬂow near the origin — the region

where ReLU exhibits a discontinuous derivative.

Theorem 5.5 (ELU convergence advantage). Consider a parameterized value function

V_θ(s) trained via temporal diﬀerence learning with step size µ > 0 and discount factor

γ ∈ [0, 1). Let δ_t= r_t+ γV_θ(s_t+1) − V_θ(s_t) be the Bellman error. Then:

6

Massimiliano Ferrara, and Celeste Ciccia

(a) For ELU activations with zero-mean property |E[h_ℓ]| ≤ c₁for small c₁> 0:

ꢍ

ꢎ

log(1/ε)

µ(1 − γ)²

T_ELU(ε) = O

.

(13)

(14)

(b) For ReLU activations with positive-biased mean E[h_ℓ] ≥ c₂> 0:

ꢍ

ꢎ

log(1/ε)

T_ReLU(ε) = O

.

µ(1 − γ)^3/2

Proof sketch. The variance of the Bellman error decomposes as:

Var[δ_t] ≤ Var[r_t] + γ²Var[V_θ(s_t+1)] + Var[V_θ(s_t)].

ELU’s zero-mean activations yield tighter variance bounds on V_θ(s) via reduced internal

covariate shift, since |E[h_ℓ]| ≤ c₁propagates through layers without systematic bias accu-

mulation. In contrast, ReLU’s positive mean E[h_ℓ] ≥ c₂introduces additive bias at each

layer, inﬂating Var[V_θ]. By standard stochastic approximation results [8], lower update

variance improves the convergence rate from (1−γ)^−3/2to (1−γ)⁻²in the discount factor

dependence.

□

Remark 5.6 (Computational cost). The exponential computation in ELU’s negative

branch requires substantially more ﬂoating-point operations than ReLU’s comparison. On

modern GPU architectures, ELU is approximately 50× slower per element. This creates

a fundamental trade-oﬀ: superior convergence and regularity properties versus compu-

tational overhead, whose optimal resolution depends on application-speciﬁc latency con-

straints.

6. Comparative Synthesis

We now synthesize the mathematical properties analyzed above into a uniﬁed compar-

ison. Table 1 summarizes key metrics.

Table 1. Mathematical properties of activation functions.

Property

Sigmoid TanH

ReLU

PReLU

ELU

Range

(0, 1)

0.25

→ 0

Yes

C^∞

0.25

No

(−1, 1)

1

→ 0

Yes

C^∞

1

Yes

No

0.41

[0, ∞)

(−∞, ∞)

[−α, ∞)

sup |σ^′|

1

α

1

→ 0⁺

Soft (x < 0)

C¹

inf |σ^′| (eﬀective)

Saturating

C^kregularity

Lipschitz constant

Zero-centered

Dead neurons

GC (gradient consistency)

0

No (x > 0)

No

C⁰

1

No

0.85

C⁰

1

No

Yes

0.78

≈ Yes

No

0.12

No

0.91

The gradient consistency metric GC(σ) (Deﬁnition 2.1) provides a scalar summary of

trainability: values near 1 indicate stable gradient ﬂow across depth, while values near 0

signal pathological gradient attenuation. The ranking GC_ELU> GC_PReLU> GC_ReLU

≫

MATHEMATICAL PROPERTIES OF ACTIVATION FUNCTIONS

7

GC_TanH> GC_Sigmoidreﬂects the theoretical analysis: non-saturating activations with pos-

itive gradients everywhere achieve highest consistency, followed by ReLU which sacriﬁces

consistency in the negative region, and saturating functions which degrade rapidly.

Proposition 6.1 (Gradient decay ordering). For an L-layer network with weight norms

bounded by w_max, the gradient magnitude at layer 1 satisﬁes the ordering:

ꢀ∇_θLꢀ

≪ ꢀ∇_θLꢀ

< ꢀ∇_θLꢀ

≤ ꢀ∇_θLꢀ

,

(15)

1

Sig

TanH

ReLU

PReLU

ELU

where the ﬁrst inequality is exponentially strict (ratio scales as (4/w_max)^L), the second

reﬂects saturation-region losses in TanH, and the ﬁnal inequalities follow from the gradient

lower bounds in Propositions 5.3 and Lemma 5.4(iv).

The Lipschitz properties merit particular attention for robustness analysis. Both ReLU

and ELU satisfy Lip(σ) = 1, but ELU’s C¹regularity provides stronger stability guar-

antees. For a network f_θwith Lipschitz-1 activation and bounded weight norms, the

ꢌ

L

ℓ=1

end-to-end Lipschitz constant satisﬁes Lip(f_θ) ≤

ꢀW_ℓꢀ. However, the C¹regularity

of ELU additionally ensures that local sensitivity varies smoothly with the input, enabling

tighter perturbation analysis in safety-critical settings where worst-case output deviations

must be bounded [2].

7. Implications for Architecture Design

The mathematical analysis developed in the preceding sections yields principled criteria

for activation function selection in deep architectures. Rather than prescribing a single

universal choice, these criteria delineate a trade-oﬀ surface whose optimal operating point

depends on the dominant design constraint of the application at hand.

The ﬁrst and arguably most decisive criterion concerns gradient ﬂow. When network

depth exceeds approximately 10 layers, or when temporal credit assignment must span

long horizons as in reinforcement learning with delayed rewards, non-saturating activa-

tions become essential. The exponential gradient decay O(0.25^L) established for sigmoid

in Proposition 4.1 renders it fundamentally unsuitable for deep architectures, and while

TanH oﬀers improvement through its unit maximum derivative, it remains vulnerable

in saturation regions. Among non-saturating alternatives, PReLU and ELU provide the

strongest gradient ﬂow guarantees (Proposition 5.3 and Lemma 5.4), making them the

natural candidates for architectures where gradient propagation is the binding concern.

A second important dimension is regularity. For tasks requiring smooth output map-

pings — such as continuous control, where policy smoothness translates directly to phys-

ical stability — the C¹regularity of ELU is mathematically preferred over the C⁰alter-

natives ReLU and PReLU, whose derivative discontinuity at zero propagates through the

network and manifests as non-smooth gradient landscapes and less stable optimization

trajectories near decision boundaries. This regularity advantage interacts with conver-

gence eﬃciency: as established in Theorem 5.5, ELU’s zero-mean activations improve the

8

Massimiliano Ferrara, and Celeste Ciccia

discount-factor dependence in convergence rates from (1 − γ)^−3/2to (1 − γ)⁻², a substan-

tial gain when the discount factor γ is close to 1, which justiﬁes ELU’s computational

overhead in sample-limited settings.

These theoretical advantages must, however, be weighed against computational con-

straints. When inference latency is the binding requirement, the analysis reduces to a

straightforward calculus: ReLU’s comparison operation is approximately 50× faster than

ELU’s exponential, and PReLU — requiring one additional multiplication per negative

activation — oﬀers an intermediate point. For architectures operating under hard real-

time constraints at sub-millisecond timescales, this eﬃciency gap dominates all other

considerations regardless of the theoretical merits of smoother alternatives. A promising

resolution to this tension lies in hybrid strategies: since the mathematical requirements dif-

fer across network components, architectures employing ReLU in early feature-extraction

layers and ELU in terminal policy-generation layers can simultaneously satisfy eﬃciency

and smoothness constraints, exploiting the compositional structure of deep networks to

achieve near-optimal performance along multiple criteria.

8. Conclusion

We have established a rigorous mathematical framework characterizing six fundamen-

tal activation functions along the dimensions of representational capacity, gradient ﬂow,

regularity, and convergence. The key results are: (i) Theorem 3.1 demonstrates the repre-

sentational collapse of linear activations through depth, motivating nonlinear alternatives;

(ii) Propositions 4.1–4.2 quantify the exponential gradient decay that renders saturating

activations unsuitable for deep architectures; (iii) Theorem 5.1 establishes that ReLU’s

binary gradient structure eliminates depth-dependent attenuation along active pathways;

(iv) Proposition 5.3 shows that PReLU achieves bounded gradient ﬂow in both positive

and negative regions; and (v) Theorem 5.5 demonstrates ELU’s convergence advantage

stemming from its C¹regularity and zero-mean property.

No single activation function dominates across all mathematical criteria. The anal-

ysis reveals a fundamental trade-oﬀ surface: computational eﬃciency (favoring ReLU),

gradient completeness (favoring PReLU), and regularity with convergence optimality (fa-

voring ELU). Optimal architecture design requires selecting the appropriate operating

point on this surface based on domain-speciﬁc constraints. Future research directions

include adaptive activation selection via meta-learning, formal sample complexity bounds

parameterized by activation properties, and the analysis of emerging activation paradigms

including those inspired by quantum computing and neuromorphic hardware.

9. Acknowledgement

The authors thank the Decisions LAB team at University Mediterranea of Reggio Cal-

abria for computational resources and valuable discussions.

MATHEMATICAL PROPERTIES OF ACTIVATION FUNCTIONS

9

REFERENCES

[1] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural

Networks 2(5) (1989) 359–366.

[2] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.

[3] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521(7553) (2015) 436–444.

[4] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectiﬁer neural networks, in: Proc. AISTATS, JMLR W&CP

15, 2011, pp. 315–323.

[5] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectiﬁers: Surpassing human-level performance on

ImageNet classiﬁcation, in: Proc. IEEE ICCV, 2015, pp. 1026–1034.

[6] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear

units (ELUs), arXiv preprint arXiv:1511.07289, 2015.

[7] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is diﬃcult, IEEE

Trans. Neural Netw. 5(2) (1994) 157–166.

[8] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.

[9] X. Glorot, Y. Bengio, Understanding the diﬃculty of training deep feedforward neural networks, in: Proc.

AISTATS, JMLR W&CP 9, 2010, pp. 249–256.

[10] Z. Allen-Zhu, Y. Li, Z. Song, A convergence theory for deep learning via over-parameterization, in: Proc.

ICML, PMLR 97, 2019, pp. 242–252.

[11] P. Ramachandran, B. Zoph, Q. V. Le, Searching for activation functions, arXiv preprint arXiv:1710.05941,

2017.

[12] V. Mnih et al., Human-level control through deep reinforcement learning, Nature 518(7540) (2015) 529–533.

(Received, November 12, 2025)

(Revised, February 13, 2025)

^1,2Decisions LAB,

University Mediterranea of Reggio Calabria, Italy.

Email¹massimiliano.ferrara@unirc.it

Email²celeste.ciccia@unirc.it