摘要
我们提出了双曲正切指数线性单元(TeLU),这是一种神经网络隐藏激活函数,定义为 T e L U ( x ) = x ⋅ t a n h ( e x ) TeLU(x)= x\cdot tanh(e^{x}) TeLU(x)=x⋅tanh(ex)。TeLU的设计基于关键激活函数的核心原则,通过在其活跃区域紧密逼近恒等函数来实现强收敛,同时有效缓解其饱和区域中的梯度消失问题。其简单的公式提高了计算效率,从而改善了可扩展性和收敛速度。与许多现代激活函数不同,TeLU无缝结合了ReLU的简单性和有效性与深度神经网络中学习稳定性所必需的平滑性和解析性。TeLU能够模仿ReLU的行为和最优超参数设置,同时引入了平滑性和曲率的优点,使其成为理想的直接替代品。其解析性质使TeLU成为一个强大的通用逼近器,在众多实验中增强了鲁棒性和泛化能力。我们通过理论分析和实验验证严格验证了这些主张,展示了TeLU在具有挑战性的基准测试中的表现:包括在ImageNet上的ResNet18、在Text8上的动态池化Transformer,以及在Penn TreeBank数据集上的循环神经网络(RNN)。这些结果突出了TeLU在激活函数中设定新标准的潜力,推动深度神经网络中更高效、更稳定的学习,从而加速各领域的科学发现。
1 引言
深度学习的快速发展显著提升了机器学习的能力,使机器能够以惊人的准确度执行从图像识别到自然语言处理的各种任务 [1, 2, 3]。这些进步的核心是神经网络,这是一种受人类大脑启发的计算模型,由多层相互连接的神经元组成。每个神经元对其输入应用数学运算,并将结果传递给下一层,从而使网络能够从数据中学习和泛化 [4]。随着神经网络在复杂性和深度上的增加,它们的成功越来越依赖于在架构上做出的细微选择,包括激活函数的选择 [5]。
激活函数通过向模型中引入非线性,使神经网络能够学习复杂模式和表示,这对于纯线性模型来说是不可能的 [6]。传统的激活函数,如逻辑斯蒂S型函数 [4]、双曲正切函数(tanh)和整流线性单元(ReLU),在深度学习的发展中发挥了基础性作用。然而,尽管这些函数被广泛使用,但它们也存在显著的局限性。S型函数和Tanh函数容易出现梯度消失问题,尤其是在深层网络中,这会导致学习过程停滞 [8, 9]。ReLU虽然由于其线性激活区域而表现出改进的收敛性,但容易出现死神经元问题;即由于负输入被归零,单元停止对学习过程做出贡献 [10, 11]。
现代激活函数,如指数线性单元(ELU)[12],通过接近原点的奇对称性,减轻了神经元死亡问题,并减少了正半定非线性中存在的前向偏移效应,如ReLU。负激活的使用使预期激活更接近于0,从而产生了类似于批量归一化的效果,进而加速了学习 [13]。高斯误差线性单元(GELU)[14] 利用其非单调增长特性,在原点附近提供可比的对称性,并具有饱和到零的额外益处。这种逐渐去激活的功能类似于标准神经元丢弃的确定性高斯形式 [15],其中神经元根据特征的存在与否而选择性参与,增强了模型的鲁棒性 [16]。
尽管有这些创新的发展,ReLU仍然是最流行的通用非线性函数,这主要归功于其简单性和快速收敛性 [17, 18]。我们认为,ReLU在前馈神经网络中的广泛应用使其成为评估新型激活函数的标准基准。然而,这些新型函数通常缺乏ReLU的计算效率和有效的梯度传播 [19]。我们还观察到,最近提出的激活函数通常在与ReLU相同的超参数设置下无法达到最佳性能 [20]。因此,这些现代非线性函数可能显得效率低下,与ReLU相比并没有提供一致的提升。
基于这一假设,我们着手开发一种激活函数,它不仅保留了ReLU的持久优势,还引入了超越现代激活函数所提供益处的创新。认识到平滑函数的鲁棒性和优化器兼容性 [16, 21],我们专注于发现一种解析非线性 [22]。在本文中,我们提出了双曲正切指数线性单元(TeLU),这是一种激活函数,它保留了ReLU的快速收敛性,同时解决了梯度消失问题和学习不稳定性等关键挑战。TeLU还增强了鲁棒性,并在学习过程中提供了更大的稳定性,使其成为推进深度神经网络的有力且通用的工具。
我们通过在各种基准数据集和架构上进行广泛的实证评估来验证这些主张。我们证明,TeLU在训练效率和最终模型准确性方面都优于传统激活函数。此外,我们的分析表明,这种新的激活函数提供了一个更鲁棒和稳定的学习过程,特别是在深层和复杂网络中。这些发现表明,TeLU可能是深度学习从业者工具包中的一个宝贵补充,为更强大和有效的神经网络模型铺平了道路。
本文的贡献如下:
- 双曲正切指数线性单元(TeLU)的提出: 一种人工神经元激活函数,它整合并优化了先前非线性函数的有益特性。
- 全面的理论分析: 对流行激活函数在活跃区域的近线性行为、饱和区域的持续梯度、运行时效率、通用逼近性和稳定性属性进行了详细的检查和比较,从理论上证明了TeLU的优势。
- 广泛的实验评估: 在多层感知机(MLP)、卷积神经网络(CNN)和循环神经网络(RNN)等多种架构上,对TeLU的独特理论优势进行了一系列广泛的实验验证。(CNN)、动态池化Transformer、循环神经网络(RNN)和变分自编码器(VAE)架构,并使用了ImageNet和Text8等具有竞争力的数据集。
第2节探讨了相关的激活函数,详细阐述了激活函数的创新序列。第3节讨论了现有激活函数的局限性,强调了替代解决方案的必要性。第4节概述了我们的设计过程,并介绍了TeLU的公式。第5节深入探讨了TeLU优势的理论基础。5.1小节研究了流行激活函数对于越来越负的输入的梯度消失问题,并描述了TeLU如何缓解这一问题。5.2小节解释了TeLU如何逼近正输入的恒等函数,与其他激活函数相比,从而实现了更快的收敛。5.3小节演示了TeLU简单公式的计算效率。5.4小节讨论了TeLU与ReLU在操作上的相似性,使其能够与现有的ReLU项目兼容。5.5小节证明了TeLU架构是解析通用逼近器,概述了相应的稳定性和鲁棒性优势。最后,5.6小节讨论了学习稳定性,借鉴文献中的启发式方法,说明TeLU如何增强深度神经网络训练的稳定性。
第6节对第5节讨论的理论结果进行了实验验证。6.1小节表明,TeLU缓解梯度消失问题对于从强梯度消失条件中恢复以及提高卷积神经网络的准确性至关重要。6.2小节使用ImageNet数据集和ResNet34架构以及Text8数据集和动态池化Transformer架构,为TeLU增强的收敛特性提供了广泛证据。6.3小节对流行激活函数的运行时间进行了基准测试,说明了TeLU在各种系统上的计算效率。6.4小节强调了TeLU和ReLU之间的调优相似性,表明为ReLU优化的配置应用于TeLU时会产生更高的准确性。6.5小节展示了TeLU的解析通用逼近在变分自编码器(VAE)、循环神经网络(RNN)和鲁棒性基准测试中的优势。最后,6.6小节展示了TeLU在不同架构变化、初始化方法和优化器中无与伦比的学习稳定性。
第7节详细讨论了TeLU激活函数的价值,结合理论见解和实验结果,说明了其在神经网络训练中的优势。本节深入探讨了TeLU如何应对改进收敛性、缓解梯度消失、提高计算效率以及与现有训练配置兼容等挑战,并强调了其在各种架构中的实际意义。在此基础上,第8节总结了本研究,回顾了主要目标和贡献,并强调了TeLU作为ReLU有效替代品的作用。第9节展望了未来的研究方向,建议扩大理论保证范围并进行进一步的实验验证。它建议探索TeLU在更多样化架构中的性能,并完善其数学性质,为深度学习领域的持续进步奠定基础。附录A包含了与我们的理论分析和实验设置相关的补充表格。附录B提供了额外的定理,对TeLU的逼近、稳定性和收敛性保证提供了更深入的见解。
2 相关工作
本节简述了激活函数的发展历史,从早期的模型如阶跃函数和S型函数开始,为更复杂的方法铺平了道路。它探讨了整流线性单元(ReLU)等创新如何克服先前模型的关键局限性,提高了深度网络的训练效率和收敛性。随后,我们将考察更近期的进展,如指数线性单元(ELU)、S型线性单元(SiLU)和高斯误差线性单元(GELU),它们通过平滑的非线性进一步解决了死神经元和输出偏差等问题。每一次创新都反映了人们对激活函数如何增强神经网络学习动力的理解上的进步。通过评估这些历史发展及其贡献,我们获得了对激活函数更广泛领域的宝贵见解,以及它们在推动深度学习持续进步中的关键作用。
2.1 早期激活函数
在神经科学中,人们曾经认为生物神经元是根据亥维赛单位阶跃函数激活的,该函数在输出为1时处于激活状态,在输出为0时处于非激活状态。作为亥维赛函数的可微近似,逻辑斯蒂S型函数 σ ( x ) = 1 1 + e − x \sigma(x)=\frac{1}{1 + e^{-x}} σ(x)=1+e−x1在早期神经网络中变得普遍,它提供了一种能够进行通用逼近的实际激活函数[24, 25, 26]。然而,当预激活值趋近于 ± ∞ \pm\infty ±∞且激活值饱和于0或1时,逻辑斯蒂S型函数会遭受梯度消失问题。梯度消失问题会严重阻碍深度架构的训练。在反向传播过程中,由于微积分中的链式法则,来自上游神经元的小梯度会缩小下游激活的梯度[27]。因此,当足够多的上游神经元具有接近零的梯度时,较早层的参数更新变得微不足道。
双曲正切(tanh)激活函数,定义为 tanh ( x ) = e x − e − x e x + e − x \tanh(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} tanh(x)=ex+e−xex−e−x,通过为预激活值在0附近的区域提供比逻辑斯蒂函数更大的梯度,有助于缓解梯度消失问题。在实际应用中,当使用适当的L1或L2权重正则化时,这种效果几乎可以得到保证。此外,双曲正切函数关于原点对称,导致平均激活值更接近于0。与逻辑斯蒂S型函数等正函数相比,这种对称性提高了收敛效率[28, 29, 30, 31]。
2.2 整流线性单元
整流线性单元(ReLU),定义为 R e L U ( x ) = max ( 0 , x ) ReLU(x)=\max(0,x) ReLU(x)=max(0,x),并在图[1]中显示,最初由Fukushima[7]发现。后来,人们认识到它由于计算效率高、接近线性和梯度消失问题减少等优点,能够加速收敛[18]。ReLU在计算和求导方面的简单性导致了更快的训练时间和神经架构的更好可扩展性。它对正输入的线性行为提供了强大的梯度和更快的收敛速度,尽管这是以显著的输出偏差为代价的[28]。
在负域中,ReLU神经元处于非激活状态,导致稀疏激活,这有助于减少过拟合。这实际上是一种内置确定性丢弃正则化的形式[15]。然而,这种非激活状态也可能导致网络表示能力的永久性下降,这是一种被称为“死亡ReLU”问题的现象。Lu等人[32]的研究表明,ReLU网络出现神经元非激活状态的可能性随着网络深度的增加而增加,随着网络宽度的增加而减少。这使得深度ReLU架构特别难以训练,除非相应地增加宽度,而这反过来又增加了计算成本[33]。此外,ReLU的分段线性性质带来了鲁棒性挑战[34]:输入扰动可能使关键神经元失活或激活之前未参与训练的神经元,从而可能损害模型的稳定性和性能。
2.3 ReLU变体
尽管存在缺点,但ReLU仍然是隐藏激活函数的流行默认选择,并启发了众多变体。泄漏ReLU(LReLU)[10],定义为 L R e L U ( x ) = max ( 0.01 x , x ) LReLU(x)=\max(0.01x,x) LReLU(x)=max(0.01x,x),通过为负输入分配一个小的常数斜率来解决神经元非激活的问题。类似地,参数化ReLU(PReLU)[35], L R e L U ( x ) = max ( α x , x ) LReLU(x)=\max(\alpha x,x) LReLU(x)=max(αx,x),将负输入的斜率视为可学习参数 α \alpha α。虽然在PReLU中引入额外的可学习参数可能会增加过拟合的风险,但可以通过积极的正则化和数据增强技术来管理这种风险。
我已经从您上传的图片中识别出了英文内容,以下是识别结果:
随机漏失修正线性单元(Random Leaky ReLU,RReLU)[36], R R e L U ( x ) = m a x ( β x , x ) RReLU(x) = max(βx, x) RReLU(x)=max(βx,x),其中 β ∼ U ( l , u ) β \sim \mathcal{U}(l, u) β∼U(l,u)且 l , u ∈ [ 0 , 1 ) l, u \in [0, 1) l,u∈[0,1),它通过在小正值的均匀分布上随机设置负区域的斜率,避免了额外防止过拟合的开销。这种随机性引入了一种隐式的正则化形式,使得RReLU在训练准确率上低于ReLU、LReLU和PReLU。这些变体将ReLU的失活区域替换为无界行为,试图避免神经元死亡问题。然而,像这样的分段线性函数被发现具有较差的鲁棒性[16],使它们容易受到对抗性噪声的影响。因此,ReLU的平滑近似,如 S o f t p l u s ( x ) = l n ( 1 + e x ) Softplus(x) = ln(1 + e^x) Softplus(x)=ln(1+ex),可能更适合构建安全的应用程序。
2.4 指数线性单元
平滑非线性在它们与二阶优化算法(如自然梯度下降[37,38])的兼容性方面也提供了显著优势。与仅依赖于误差函数梯度的一阶优化器不同,二阶优化器利用海森矩阵,提供误差表面的曲率信息。这种额外的信息允许对模型参数进行更精确的更新,从而在训练期间实现更快、更高效的收敛[27]。
指数线性单元(Exponential Linear Unit,ELU)[12]试图在不使用二阶优化器的情况下,通过减少常见激活函数中普遍存在的输出偏差,来接近自然梯度[38]进行学习。这种输出偏差的减少导致了一个具有较小非对角线项的费舍尔信息矩阵[39][30]。因此,ELU有助于使神经网络更接近费舍尔最优学习[13],其中学习由自然梯度引导,以改善收敛[31][29][40]。ELU对于正输入定义为恒等函数,保留了修正线性单元(ReLU)的快速收敛性。对于负输入,ELU定义为 E L U ( x ) = e x − 1 ELU(x) = e^x - 1 ELU(x)=ex−1,产生一个小的负值而不是将激活置零。因此,随着预激活变为负,ELU饱和于-1而不是0,这意味着只有当ELU神经元的预激活接近0时,它们才会被排除在推理之外。这种设计牺牲了ReLU的稀疏性和失活特性,转而有利于捕捉负特征。尽管ELU具有连续的一阶导数,但其二阶导数不连续,因此不适合用于二阶优化方法。
2.5 Sigmoid线性单元
与ELU不同,像SiLU这样的平滑非单调ReLU近似保留了密集的失活区域,同时提供了连续、非零可微性的优势。Sigmoid线性单元(Sigmoid Linear Unit,SiLU)[41],也称为Swish激活函数,是通过神经架构搜索(Neural Architecture Search,NAS)[42]发现的。数学上,SiLU定义为 Si L U ( x ) = x ⋅ σ ( x ) \operatorname{Si} L U(x)=x \cdot \sigma(x) SiLU(x)=x⋅σ(x),其中 σ ( x ) = 1 1 + e − x \sigma(x)=\frac{1}{1+e^{-x}} σ(x)=1+e−x1。它的发现涉及从单变量和双变量函数构建树,重点是平衡计算效率和表达能力。在搜索过程中,观察到可以推广为线性缩放累积分布函数(LSCDFs)的更简单函数往往更受青睐。SiLU的平滑、非单调特性导致更平滑的优化景观,通过避免尖锐过渡和局部最小值等问题,有助于更有效训练更深层次的网络。SiLU的开创性研究[41]表明,SiLU在各种任务中优于传统激活函数,如ReLU和Leaky ReLU。这些发现引发了深度学习中平滑、非单调激活函数日益增长的认可[14, 43]。
2.6 高斯误差线性单元
Hendrycks和Gimpel认识到了LSCDFs(如ReLU和SiLU)非活跃区域与dropout正则化效果之间的数学相似性[14]。GELU的公式是对SiLU激活函数的重新解释,其中使用了缩放误差函数来替代SiLU中的逻辑累积分布函数。结果 x 2 ⋅ [ 1 + er f ( x / 2 ) ] \frac{x}{2} \cdot[1+\operatorname{er} f(x / \sqrt{2})] 2x⋅[1+erf(x/2)]理论上允许GELU利用软伪集成[44],其中负激活被忽略,正激活被保留。伪集成指的是神经网络的子集暂时变得不活跃,使网络子图能够像传统集成方法一样独立学习执行任务[45]。这种方法通过受益于多顾问原则(在集成机器学习模型中很常见)来增强模型的鲁棒性。支持这一策略,非单调GELU激活函数在MNIST[46]、CIFAR-10和CIFAR-100[47]数据集中表现出优于ReLU和ELU的性能。
2.7 Mish激活函数
Mish激活函数[43]定义为 Mish ( x ) = x ⋅ tanh ( ln ( 1 + e x ) ) \operatorname{Mish}(x)=x \cdot \tanh \left(\ln \left(1+e^{x}\right)\right) Mish(x)=x⋅tanh(ln(1+ex)),引入了一种新的LSCDF,类似于SiLU和GELU。Mish是在观察SiLU有益特性的原因时发现的,并通过初步实验与其他竞争公式进行了验证。使用的累积分布函数 tanh ( ln ( 1 + e x ) ) \tanh \left(\ln \left(1+e^{x}\right)\right) tanh(ln(1+ex))使得Mish非线性表现出自正则化特性,因此不需要其他外部正则化手段。这种设计选择使Mish在稳定性方面相对于ReLU和SiLU表现出改进,特别是在处理网络深度和权重初始化策略时。该函数在介绍性实验中还表现出比ReLU和SiLU对高斯噪声更强的鲁棒性[43]。
2.8 Logish和Smish激活函数
与Mish类似,Logish函数 Logish ( x ) = x ⋅ ln ( 1 + 1 1 + e − x \operatorname{Logish}(x)=x \cdot \ln \left(1+\frac{1}{1+e^{-x}}\right. Logish(x)=x⋅ln(1+1+e−x1[48]通过自正则化效果实现了非单调非线性,随着输入的增加,其斜率独特地趋近于 ln ( 2 ) \ln (2) ln(2)。Smish函数 Smish ( x ) = x ⋅ tanh ( ln ( 1 + 1 1 + e − x ) \operatorname{Smish}(x)=x \cdot \tanh \left(\ln \left(1+\frac{1}{1+e^{-x}}\right)\right. Smish(x)=x⋅tanh(ln(1+1+e−x1)[20]通过对其平滑二元函数应用双曲正切进一步发展了Logish,创造了一种新的非线性,随着输入的增加,其斜率趋向于0.6,尽管计算成本更高。Logish和Smish中斜率降低的影响仍是一个活跃的研究领域,但它们各自基础研究的早期实验结果表明,与广泛使用的激活函数(如ReLU、SiLU和Mish)相比,这两个函数都表现出更大的稳定性。这在CIFAR10数据集上尤为明显,即使在应用最小L2正则化的条件下也是如此。
值得注意的是,展示Smish增强稳定性和性能的实验是在没有使用权重衰减正则化的情况下进行的。这种缺乏外部正则化方法的情况表明,Smish具有内在的自正则化特性,能够在不依赖显式正则化技术的情况下实现高精度。因此,在稳定性和自正则化至关重要但计算效率可以忽略不计的场景中,Smish可能特别有利。
3 动机
尽管深度学习取得了重大进展,但激活函数的选择相对保持不变,ReLU由于其计算效率高和在活跃区域中有效的梯度传播特性,继续占据主导地位[2, 5, 49, 18]。虽然ReLU的简单公式在浅层网络中导致快速收敛,但在更深层次的架构中,其局限性变得明显。众所周知的“ReLU死亡”问题可能导致显著的能力损失,阻碍优化并降低整体模型表达能力[11]。随着网络在深度和复杂性上的增加,这一问题变得更加突出,在这种情况下,保持稳定的梯度流对于有效的训练至关重要[32]。
为了解决这些问题,提出了各种替代方案,如ELU [12]、SiLU [41]、GELU [14]和Mish [43],每种方案都引入了修改以平滑激活转换并减少死亡神经元。例如,ELU的负饱和鼓励零中心激活,提高收敛稳定性。类似地,像SiLU和GELU这样的非单调函数提供了更平滑的梯度,防止神经元饱和并改善梯度流[50]。然而,这些设计通常增加了计算复杂性,并且实证评估表明,它们的优势在各种架构和数据集上并不普遍一致[49, 51, 19, 5, 52]。
最近的研究表明,理想的激活函数应平衡强大的梯度传播、零均值激活、平滑转换和计算效率[53, 41, 18]。通过引入负激活,非单调函数通常导致激活在零附近更对称的分布,这有助于加速收敛[12, 19]。然而,像GELU和Mish这样的函数虽然具有这些理想特性,但计算需求高且产生的梯度较弱,减慢了学习过程[5]。这种理论优势与实际性能之间的脱节表明,仍然没有单一的激活函数能够有效统一现有方法的优势。
设计空间的这种碎片化促使需要一种激活函数,该函数结合了ReLU的学习效率[18]与平滑函数的稳定性和泛化能力[50]。我们提出的激活函数通过整合平滑转换、接近零均值的激活和稳健的梯度动态来解决这些挑战,使其能够在广泛的任务中实现一致的性能。在强大的理论和实证结果的支持下,我们的函数旨在克服当前激活的局限性,带来更好的收敛特性、增强的稳定性和在浅层和深层架构上的改进泛化。
通过在其前身的优势基础上并解决其各自的弱点,我们的激活函数在激活设计中建立了新的基准。它不仅加速了训练并减少了计算开销,还为大规模模型提供了更稳定的优化环境。这种方法提供了一个统一的解决方案,可以为激活函数设定新的标准,使复杂深度学习场景中的学习更加可靠和高效。
4 公式
4.1 设计目标
我们对于隐藏激活函数的主要设计目标是提供足够的表达能力,使神经网络架构能够逼近由当前任务定义的任何未知目标函数。假设这些目标函数是连续的,我们可以从理论上将这一目标简化为实现一个通用逼近器[24]。除了表达能力,我们还寻求一种在正向和反向传播中都计算效率高的激活函数,减少硬件需求和每个训练周期的持续时间。此外,我们旨在设计一种促进快速收敛的激活函数,降低模型在隐藏数据上取得成功所需的周期数。综合这些要求,通过最小化训练数据上损失的必要时间和周期数,确保训练效率。
除了训练效率,保持训练稳定性至关重要[54]。这意味着使用我们激活函数的模型应在训练和未见数据上始终获得良好的评估,即使在模型和训练配置略有变化的情况下也是如此。模型配置包括架构选择,如深度、层宽度和所用层的类型,而训练配置涵盖与优化相关的决策,如算法选择、学习率和权重衰减系数。一种特别有害的不稳定形式发生在模型的参数导致数值不稳定的输出时[55]。当神经元的输出溢出或下溢时,可能导致整个模型推断错误,导致性能几乎不超过随机猜测。因此,我们旨在设计一种最小化这些数值不稳定实例的激活函数,确保有利且可靠的训练结果。
我们的最终目标集聚焦于确保模型在训练数据上的成功有效地转化为未见数据上的成功。为了实现这一点,我们需要设计一种促进强大泛化的激活函数,使神经网络能够在训练集之外保持其性能。传统上,过拟合通过在模型和优化级别减少模型容量[56]或使用正则化技术(如dropout [15]、权重衰减[4]和批量归一化[57])来管理。众所周知,结合各种形式的正则化通常会导致更好的泛化[58, 59, 15]。考虑到这一点,我们旨在在我们的激活函数内直接结合适度但有意义的自正则化水平,基于这样的假设,即它可以减少对传统方法的依赖。然而,我们必须谨慎;引入过多的自正则化可能会无意中限制模型的容量。此外,我们寻求开发一种稳健的激活函数,增强模型的弹性,使其即使在暴露于具有小扰动的数据时也能可靠地执行。
4.2 策略
Leshno等人[60]指出,一个函数可以作为通用逼近器当且仅当它是连续且非多项式的。因此,为了实现我们创建通用逼近器的目标,我们必须专注于开发一个非多项式函数。为了计算效率,激活函数还必须具有简单的公式和易于可微的一阶导数。正如ReLU所展示的,通过一个在活跃区域(对应于正输出)保持强梯度的简单函数可以实现快速收敛[18]。为了满足这一快速收敛要求,我们可以采用一个对正输入表现为恒等函数的线性单元。然而,我们应该避免超过恒等函数的增长,因为这可能导致梯度爆炸并在模型中引入数值不稳定性。通过对正输入保持这种类似恒等的行为,我们降低了活跃区域中梯度消失的风险。此外,在我们的线性单元的饱和区域中维持非零梯度至关重要,这有助于防止神经网络中的子图停滞学习——这是ReLU架构中的常见问题[32]。
为了增强神经网络中的学习稳定性,我们首先分析ELU公式背后的策略[12]。ELU在原点附近使用近似奇对称性,对所有负输入产生负输出。这种设计使饱和区域能够平衡线性区域,使预期输出对于均值零输入更接近零[28]。ELU通过避免对负输入的失活来实现这一点,采用分段定义,当输入接近 − ∞ -\infty −∞时饱和到-1。我们假设,具有单一定义和逐渐失活的激活函数可以进一步提高稳定性。非单调激活通常在零附近表现出奇对称性,减少输入分布在小标准差时的输出偏差,如标准归一化技术所示[61, 62]。此外,最小化非线性项可以通过限制下溢或溢出的机会来减少数值不稳定性。因此,我们旨在设计一个符合我们计算效率目标的简单公式。
认识到通过结合各种正则化技术可以增强泛化[58, 59, 15],我们旨在开发一种自正则化激活函数。像GELU [14]、Mish [43]、Logish [48]和Smish [20]这样的激活函数以其固有的自正则化特性而闻名。例如,Hendrycks和Gimpel [14]将GELU设计为随机正则化器的期望值,而Mish、Logish和Smish中的自正则化源于其公式中的附加项。这一观察引导我们假设,具有隐式正则化和增强泛化的新激活函数可能属于平滑、非单调非线性的类别[43, 48, 20]。此外,Rosca等人[50]强调了平滑激活函数在改善泛化和不确定性估计中的关键作用。通过追求平滑、非单调激活函数,我们旨在利用平滑性和隐式自正则化的综合优势。
除了传统泛化,我们还旨在使使用我们激活函数的模型在受到小扰动破坏的未见数据上也能表现良好[63]。Xie等人[16]已经证明,像GELU和SiLU这样的平滑激活函数在准确性和鲁棒性方面比分段线性激活函数具有优势。这种固有的平滑性,以在每一点的可微性为特征,是解析函数的定义特征[22]。解析函数可以表示为收敛的幂级数,涉及无限多个导数的表示。解析函数还与高阶优化技术兼容,如自然梯度下降[38],这些技术利用损失函数的曲率而不仅仅是其值,导致在更少的步骤中实现更稳定的学习,直到收敛[64, 21, 65, 27]。然而,正如ELU所展示的,通过结合表现出一定程度的原点附近奇对称性的非线性,可以接近二阶优化技术的好处而不直接使用它们。这一特性有助于减少前向输出偏差,从而通过接近费舍尔最优学习来增强学习效率[28, 31, 29, 40, 13]。
4.3 设计要求
基于我们开发的策略,寻找激活函数的数学要求如下:
- 当输入接近 ∞ \infty ∞时应近似线性
- 必须是解析的,或在所有点无限可微,而不坍缩为零
- 应在原点附近 y = − x y=-x y=−x附近表现出接近奇对称性
- 必须包括一个密集的非活跃区域,向零饱和
- 在输出向零饱和时应表现出最小的梯度衰减
- 应具有低计算复杂性
- 必须是非多项
5 理论框架
本节深入探讨了TeLU激活函数的理论优势,扩展了激活函数设计的关键原则及其对神经网络性能的影响。它探讨了诸如梯度行为、收敛速度、计算效率和整体稳定性等基本属性,为TeLU的潜在优势奠定了理论基础。该分析展示了TeLU如何解决常见的深度学习挑战,包括低效学习、梯度消失问题和计算复杂性。
5.1 饱和区域的持续梯度
我们观察到,在深度学习中,具有下界的激活函数通常是有利的。除了某些ReLU变体之外,我们讨论的所有激活函数都具有下界。这与生物神经元的行为一致,生物神经元要么保持不活跃,要么以特定频率激发。有趣的是,即使没有对具有下界的函数的初始偏好,Nader和Azar的进化神经架构搜索(NAS)在所有测试的数据集上都表现出了对它们的偏好。从多个顾问的角度来看,似乎大多数任务本质上都倾向于具有下界的函数。此外,我们遵循这样一个启发式原则:当神经元变得不活跃时,激活函数应饱和到输出0,这使得神经元的不活跃类似于推理过程中的dropout。然而,这带来了一个挑战:如果非活跃区域的梯度衰减过快到零,神经元在训练期间可能会被不可逆地dropout。
为了解决这一问题,我们的非线性设计为在神经元不活跃时饱和到0,同时确保非活跃区域的梯度尽可能长时间地保持在零以上。这种方法最大限度地增加了优化器在需要时重新激活神经元的机会。理想情况下,正如我们希望在活跃区域有强梯度一样,我们也希望在饱和的非活跃区域有相对强的梯度。为了评估每个非线性的曲率在这方面的情况,我们首先检查其一阶导数的图像。图11展示了每个激活函数一阶导数的绝对值。该图在原点附近的负值上展示,以了解每个函数的导数接近0的相对速率。虽然一些梯度显然比其他梯度更快饱和,但我们寻求对其衰减率的更清晰比较。
图12通过展示TeLU的绝对导数与每个导数的比率来解决这一问题。这一比较表明,大多数非单调非线性比其单调对应项保持更慢的梯度衰减,尽管在零点有一个奇异点的权衡。这种权衡可以看作是梯度持久性的必要妥协。在实践中,这些解决方案主要由于低精度浮点变量的下溢而出现,因为它们可能是无理数。此外,图12表明,与其他函数相比,TeLU在不活跃开始时表现出更强的梯度。为了清楚地了解梯度衰减的速率,我们将渐近增长类的概念扩展到也包括渐近衰减。因此,我们定义 O O O, Ω \Omega Ω和 Θ \Theta Θ渐近类如下:
定义 5.1 f ( n ) ∈ O ( g ( n ) ) f(n) \in O(g(n)) f(n)∈O(g(n)) 当且仅当存在正常数 c c c和 n 0 n_{0} n0使得 f ( n ) ≤ c ⋅ g ( n ) f(n) \leq c \cdot g(n) f(n)≤c⋅g(n)对于所有 n ≥ n 0 n \geq n_{0} n≥n0;其中 n ∈ Z n \in \mathbb{Z} n∈Z。
定义 5.2 f ( n ) ∈ Ω ( g ( n ) ) f(n) \in \Omega(g(n)) f(n)∈Ω(g(n)) 当且仅当存在正常数 c c c和 n 0 n_{0} n0使得 c ⋅ g ( n ) ≤ f ( n ) c \cdot g(n) \leq f(n) c⋅g(n)≤f(n)对于所有 n ≥ n 0 n \geq n_{0} n≥n0;其中 n ∈ Z n \in \mathbb{Z} n∈Z。
定义 5.3 f ( n ) ∈ Θ ( g ( n ) ) f(n) \in \Theta(g(n)) f(n)∈Θ(g(n)) 当且仅当存在正常数 c 1 , c 2 c_{1}, c_{2} c1,c2和 n 0 n_{0} n0使得 c 1 ⋅ g ( n ) ≤ f ( n ) ≤ c 2 ⋅ g ( n ) c_{1} \cdot g(n) \leq f(n) \leq c_{2} \cdot g(n) c1⋅g(n)≤f(n)≤c2⋅g(n)对于所有 n ≥ n 0 n \geq n_{0} n≥n0;其中 n ∈ Z n \in \mathbb{Z} n∈Z。
在传统术语中,如果函数 f ( x ) f(x) f(x)在某一点之后被 g ( x ) g(x) g(x)的缩放版本上界,则它属于类 O ( g ( x ) ) O(g(x)) O(g(x))。类似地,如果 g ( x ) g(x) g(x)在固定输入值之后作为 f ( x ) f(x) f(x)的下界,则 f ( x ) f(x) f(x)属于类 Ω ( g ( x ) ) \Omega(g(x)) Ω(g(x))。函数 f ( x ) f(x) f(x)被分类为 Θ ( g ( x ) ) \Theta(g(x)) Θ(g(x))当且仅当它属于 O ( g ( x ) ) O(g(x)) O(g(x))和 Ω ( g ( x ) ) \Omega(g(x)) Ω(g(x))。常见的渐近增长类包括 O ( 1 ) , O ( x ) , O ( x c ) , O ( c x ) , O ( x ! ) O(1), O(x), O\left(x^{c}\right), O\left(c^{x}\right), O(x!) O(1),O(x),O(xc),O(cx),O(x!),其中 c c c是正常数。 a ! a! a!表示对变量 a a a的阶乘操作,即当 a ∈ Z + a \in \mathbb{Z}^{+} a∈Z+时为 ∏ i = a 1 i \prod_{i=a}^{1} i ∏i=a1i。为了将阶乘操作扩展到实数,我们利用Gamma函数 Γ ( z ) = ∫ 0 inf t z − 1 e − t d t \Gamma(z)=\int_{0}^{\inf } t^{z-1} e^{-t} d t Γ(z)=∫0inftz−1e−tdt。
为了定义渐近衰减类,我们反转这些增长类,得到像 Θ ( 1 ) \Theta(1) Θ(1), Θ ( 1 / x ) \Theta(1 / x) Θ(1/x), Θ ( 1 / x 2 ) \Theta\left(1 / x^{2}\right) Θ(1/x2), Θ ( 1 / e x ) \Theta\left(1 / e^{x}\right) Θ(1/ex)和 Θ ( 1 / x ! ) \Theta(1 / x!) Θ(1/x!)这样的分类。当我们应用这个框架到非单调激活函数时,我们发现某些函数除以相应 g ( x ) g(x) g(x)函数的极限不收敛到非零常数。因此,我们用 Θ ( x / e x ) \Theta\left(x / e^{x}\right) Θ(x/ex)和 Θ ( 1 / ( x 2 ) ! ) \Theta\left(1 /\left(x^{2}\right)!\right) Θ(1/(x2)!)扩展衰减类,以更好地描述TeLU、SiLU、Mish、Logish和Smish等函数的渐近衰减。有趣的是,我们发现GELU不属于 Θ ( 1 / x ! ) \Theta(1 / x!) Θ(1/x!),因为其衰减率更大,导致我们将GELU分类为 O ( 1 / x ! ) O(1 / x!) O(1/x!)和 Ω ( 1 / ( x 2 ) ! ) \Omega(1 /(x 2)!) Ω(1/(x2)!)的元素。我们在表1中总结这些渐近衰减分类,每个函数关于 x = 0 \mathrm{x}=0 x=0镜像以更清楚地展示数学符号。
我们通过确定在使用Pytorch的float32精度时每个非线性的梯度下溢的大致区域,进一步展示了这些渐近衰减分类的相关性。为了确保公平比较,我们使用其典型子函数定义每个函数,如表4所示。对于每个非线性的一阶导数,我们检查了域子集 [ − 200 , 0 ] [-200,0] [−200,0],记录梯度达到零的实例,步长为0.0001。我们以两个浮点小数位的精度记录结果。
有趣的是,未检测到非单调零梯度奇异点,表明它们在训练期间不太可能常见。此外,显然TeLU、SiLU、Mish、Logish和Smish由于其较慢的渐近衰减而延迟了下溢区域。相比之下,GELU和ELU由于数值下溢而表现出更广的零梯度区域。另一方面,ReLU在其整个不活跃区域始终表现出零梯度,这是“dying ReLU”问题的特征。然而,在TeLU、SiLU、Mish、Logish和Smish中观察到的共同下溢区域源于其指数子函数的数值下溢,这导致它们默认为0。因此,这些函数的sigmoid组件评估为0,当乘以恒等函数时导致乘积为0。类似地,由于误差函数实现的行为,GELU激活函数被驱动到0。
为了更好地理解这些非线性的比较数值衰减,我们检查了其导数在不活跃区域内的各个点的值。尽管我们的完整实验考虑了100个不同的点,但为了简单起见,我们展示了在输入-100和-10时的梯度值,我们认为这足以捕捉整体行为。这些值证实,TeLU和SiLU在其奇点后不活跃区域的开始时保持更强的梯度。相比之下,Mish、Logish和Smish最初表现出较弱的梯度,但最终导致稍强的梯度。这种关系类似于双曲正切和Sigmoid非线性之间的比较:虽然双曲正切在不活跃区域的早期保持更强的梯度,但Logistic Sigmoid在不活跃状态的更深层次最终表现出稍强的梯度。
5.2 活跃区域的近线性
在其无界区域内模仿恒等函数的激活函数,受益于其活跃区域内的强梯度,从而实现更高效的训练 [2, 5]。这些强的单位梯度防止了可学习参数更新在反向传播过程中被饱和函数的上游偏导数缩小,使模型能够更快地在损失景观中达到局部最小值。在这个阶段,学习率调度器可以降低有效学习率以促进精确收敛。相比之下,在活跃区域内具有次恒等增长的激活函数通常需要专门的学习率调度器来抵消其较弱的梯度。然而,这种增加的复杂性增加了训练超参数之间的相互依赖性,导致设计模块化程度较低。
为了在平滑激活函数中实现类似的收敛速度和模块化,我们旨在在活跃区域内近似线性。在此背景下, Te L U ( x ) = x ⋅ tanh ( e x ) \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) TeLU(x)=x⋅tanh(ex) 可以看作是 Mish ( x ) = x ⋅ tanh ( ln ( 1 + e x ) ) \operatorname{Mish}(x)=x \cdot \tanh \left(\ln \left(1+e^{x}\right)\right) Mish(x)=x⋅tanh(ln(1+ex)) 的一个变体,但去除了对数项,使双曲正切更快地饱和到1。恒等函数然后与这个饱和的双曲正切相乘,导致在活跃区域内接近线性。如表3所示,对于正输入,TeLU的激活比所考虑的任何其他平滑函数都更接近恒等函数,这是通过L1和L2度量来衡量的。与线性的L1距离度量计算为对于每个激活函数 f ( x ) f(x) f(x), ∫ 0 ∞ ∣ f ( x ) − m x ∣ d x \int_{0}^{\infty}|f(x)-m x| d x ∫0∞∣f(x)−mx∣dx,其中 m m m表示当输入增大时 f ( x ) f(x) f(x)所趋近的斜率。类似地,L2距离度量评估为 ∫ 0 ∞ ( f ( x ) − m x ) 2 d x \int_{0}^{\infty}(f(x)-m x)^{2} d x ∫0∞(f(x)−mx)2dx。
在所列的激活函数中,TeLU脱颖而出,其活跃区域最有效且快速地近似恒等函数。这很重要,因为理论上,更强的梯度增强了反向传播过程中的梯度传播,从而最小化了梯度消失问题。然而,重要的是要避免梯度持续超过1,因为这可能导致梯度爆炸。以梯度为1线性增长的函数有效地缓解了梯度消失和梯度爆炸问题。分段函数如ReLU和ELU在正输入时定义为恒等函数,但在负区域缺乏持续的梯度。这种缺陷可能导致神经元变得不活跃或“死亡”,这是在基于ReLU的网络中当负输入梯度变为零时的常见问题。相比之下,TeLU在其整个定义域内提供持续的非零梯度,降低了神经元不活跃的可能性。如第5.1小节所述,TeLU提供的持续梯度促进了更稳定和一致的权重更新,导致更平滑的训练动态。在非活跃区域的持续梯度和活跃区域的强单位梯度的结合,使得使用TeLU的神经网络比使用其他激活函数的神经网络收敛更快。这使得TeLU成为实现高效和稳健训练结果的优越选择。
5.3 运行时效率
具有简单数学公式的非线性可以显著提高神经元在前向和后向传递中的效率。存在许多强大的优化方法,可以在数值计算中实现主要的速度提升。动态规划 [70]、线性插值 [71]、低级语言实现 [72] 和机器学习硬件加速器 [73] [74] [73] 等方法都可以为计算提供巨大的速度提升。这些优化方法一起可以为激活函数计算提供不确定的大幅性能提升。然而,这些提升是以工程、硬件或功耗成本为代价的。在本小节中,我们抽象掉这些优化,专注于每个非线性的基本定义。我们将这些表达式定义为仅包含常见表达式,即: x x x, e x e^{x} ex, ln ( x ) \ln (x) ln(x), tanh ( x ) \tanh (x) tanh(x),最大值,分段,以及误差函数。我们在表4中以它们的常见形式表达每个非线性,这反映了它们在最初工作 [7, 12, 41, 14, 43, 48, 20] 中的定义。
为了评估我们研究中激活函数的数学复杂性,我们尝试量化它们的复杂性。我们假设函数计算复杂性的主要贡献者是它包含的分段操作数量。我们将分段操作与典型的非线性操作区分开来,因为它们通常通过条件控制流而不是单一计算来实现。接下来是每个激活函数包含的非分段非线性操作。为此,我们计算每个非线性中存在的对数、指数、三角函数和误差函数的实例。为了进一步评估计算复杂性,我们还统计了每个非线性所需乘法(乘法或除法)和加法(加法或减法)操作的数量,这些操作发生在之前未计算的非线性函数之外。这些操作通常涉及神经网络中的长向量,使得它们计算成本高昂。最后,我们考虑需要分配用于计算非线性的常数值向量的数量。尽管这些分配可以通过各种方式进行优化或摊销,但它们仍然会产生现有的内存成本。我们在表5中总结了这些量化结果。由于有几种合理的方式来表示这些激活函数的一阶导数,我们只在附录中的表40中进行了这种启发式评估。
5.4 ReLU兼容性
修正线性单元(ReLU)激活函数在无状态架构(如卷积神经网络(CNNs)、Transformer、自动编码器和扩散模型)中的广泛成功,使其牢固地确立为深度学习研究和开发中的默认选择。其简单性、计算效率以及缓解梯度消失问题的能力,使其成为训练各种深度架构不可或缺的。因此,一个紧密模仿ReLU行为同时提供潜在改进的激活函数更有可能在社区中获得关注。这样的函数不仅应保留ReLU的核心特性,还应增强网络稳定性、梯度流动性和在各种架构中的表达能力。
为了系统地识别满足这些标准的激活函数,我们首先定义了一个严格的评估指标,用于衡量候选函数与ReLU行为的接近程度。设 r ( x ) = max ( 0 , x ) r(x)=\max (0, x) r(x)=max(0,x)表示ReLU函数, f ( x ) f(x) f(x)表示候选激活函数。总近似误差定义为:
L total = ∫ − ∞ ∞ ∣ r ( x ) − f ( x ) ∣ d x \mathcal{L}_{\text {total }}=\int_{-\infty}^{\infty}|r(x)-f(x)| d x Ltotal =∫−∞∞∣r(x)−f(x)∣dx
该积分量化了 f ( x ) f(x) f(x)在整个输入域上与ReLU的总体偏差。然而,由于ReLU在两个不同的区域(非活跃区域 ( x < 0 ) (x<0) (x<0)和活跃区域 ( x ≥ 0 ) (x \geq 0) (x≥0))内操作,我们进一步将误差分解为来自每个区域的贡献:
L inactive = ∫ − ∞ 0 ∣ r ( x ) − f ( x ) ∣ d x , L active = ∫ 0 ∞ ∣ r ( x ) − f ( x ) ∣ d x \mathcal{L}_{\text {inactive }}=\int_{-\infty}^{0}|r(x)-f(x)| d x, \quad \mathcal{L}_{\text {active }}=\int_{0}^{\infty}|r(x)-f(x)| d x Linactive =∫−∞0∣r(x)−f(x)∣dx,Lactive =∫0∞∣r(x)−f(x)∣dx
这种分解使我们能够分析近似误差是集中在负值的抑制上,还是集中在正值的缩放上,这可以表明 f ( x ) f(x) f(x)的特定属性,这些属性可能对特定模型类有益或有害。我们将这些指标应用于一组常用的激活函数,包括TeLU、ELU、SiLU、GELU、Mish、Logish、Smish、Leaky ReLU( LReLU ( x ) = max ( 0.01 x , x ) \operatorname{LReLU}(x)=\max (0.01 x, x) LReLU(x)=max(0.01x,x))和Softplus( Softplus ( x ) = ln ( 1 + e x ) \text{Softplus}(x)=\ln \left(1+e^{x}\right) Softplus(x)=ln(1+ex))。表6总结了计算出的接近值,提供了对每个函数在不同区域中复制ReLU行为的稳健比较。这种系统的评估框架不仅有助于选择可以作为ReLU直接替代品的激活函数,还指导了新函数的开发,这些新函数在保持ReLU有利特性的同时,可能缓解已知问题,如“dying ReLU”问题。最终,我们的方法支持发现能够无缝集成到各种深度学习架构中的激活函数,从CNNs和自动编码器到Transformer和新兴的生成模型(如扩散模型),确保兼容性并增强整体模型鲁棒性。
为了开始识别与ReLU学习方式相似的激活函数,我们首先假设对该激活函数的评估与ReLU非常接近。我们通过确定 ∫ − ∞ ∞ ∣ r ( x ) − f ( x ) ∣ d x \int_{-\infty}^{\infty}|r(x)-f(x)| d x ∫−∞∞∣r(x)−f(x)∣dx来量化这种近似的损失,其中 r ( x ) = ReLU ( x ) r(x)=\operatorname{ReLU}(x) r(x)=ReLU(x), f ( x ) f(x) f(x)表示我们正在评估的每个候选函数在ReLU近似方面的表现。为了进一步深入了解这种近似,我们还评估了 ∫ − ∞ 0 ∣ r ( x ) − f ( x ) ∣ d x \int_{-\infty}^{0}|r(x)-f(x)| d x ∫−∞0∣r(x)−f(x)∣dx和 ∫ 0 ∞ ∣ r ( x ) − f ( x ) ∣ d x \int_{0}^{\infty}|r(x)-f(x)| d x ∫0∞∣r(x)−f(x)∣dx,以量化ReLU非活跃区域和活跃区域的近似损失。我们在表6中对TeLU、ELU、SiLU、GELU、Mish、Logish、Smish、 LReLU = max ( 0.01 x , x ) \text{LReLU}=\max (0.01 x, x) LReLU=max(0.01x,x)和 Softplus ( x ) = ln ( 1 + e x ) \text{Softplus}(x)=\ln \left(1+e^{x}\right) Softplus(x)=ln(1+ex)进行了这些计算。
综上所述,我们的研究结果表明,在候选函数池中,只有非单调激活函数能够在有界的 L 1 L_{1} L1范数误差下近似ReLU。具体而言,这些非单调函数具有与ReLU行为密切相关的关键数学特性:
- f ( 0 ) = 0 f(0)=0 f(0)=0
- lim x → − ∞ f ( x ) = 0 \lim _{x \rightarrow-\infty} f(x)=0 limx→−∞f(x)=0
- lim x → ∞ f ( x ) = ∞ \lim _{x \rightarrow \infty} f(x)=\infty limx→∞f(x)=∞
- lim x → ∞ f ( x ) x = 1 \lim _{x \rightarrow \infty} \frac{f(x)}{x}=1 limx→∞xf(x)=1
这些特性并非偶然,而是任何旨在复制ReLU在深度学习架构中独特作用的函数所必须满足的基本要求。首先, f ( 0 ) = 0 f(0)=0 f(0)=0确保当预激活输入为零时,使用该函数的神经元将保持不活跃状态,从而保留了基于ReLU网络中至关重要的稀疏性和抑制性。任何偏离这一特性的情况都会改变激活模式,进而影响网络的整体推理动态。其次,当 x → − ∞ x \rightarrow-\infty x→−∞时的渐近行为保证了对于大负输入有一个密集的非激活区域,这有助于通过使大部分神经元保持不活跃状态来诱导稀疏性。如果没有这一特性,激活函数将产生虚假的激活,从而破坏稀疏表示的好处。
此外,当 x → ∞ x \rightarrow \infty x→∞时的线性增长是ReLU有效性的标志,使得大正输入具有强梯度。如果增长率低于线性,函数将遭受梯度消失问题,导致训练过程中收敛不良。相反,如果函数增长速度快于 x x x,则会加剧梯度爆炸问题,导致数值不稳定。因此,成功的ReLU近似必须满足这些增长条件,以确保与ReLU优化的架构兼容。
观察到只有非单调函数满足这些标准,这表明非单调性不仅仅是人为产物,而是准确模拟ReLU行为的必要特征。当我们检查ReLU和其他广泛采用的激活函数(如TeLU、SiLU、GELU和Mish)之间的结构相似性时,这一点变得更加清晰。这些函数中的每一个都可以表示为形式为 x ⋅ σ ( x ) x \cdot \sigma(x) x⋅σ(x)的缩放累积分布函数,其中 σ ( x ) \sigma(x) σ(x)是一个范围在 ( 0 , 1 ) (0,1) (0,1)内的平滑、单调递增函数。对于ReLU, σ ( x ) \sigma(x) σ(x)是Heaviside阶跃函数:
σ ( x ) = { 0 当 x < 0 1 当 x ≥ 0 . \sigma(x)=\left\{\begin{array}{ll}0 & \text { 当 } x<0 \\1 & \text { 当 } x \geq 0\end{array} .\right. σ(x)={01 当 x<0 当 x≥0.
在TeLU中,这个分布是 σ ( x ) = tanh ( e x ) \sigma(x)=\tanh \left(e^{x}\right) σ(x)=tanh(ex),而其他激活函数则有各自独特但相似的累积分布。从这个角度来看,我们可以将这些函数中的非单调性解释为ReLU在 x = 0 x=0 x=0处突变的平滑近似。ReLU可以看作是广义函数类 x ⋅ σ ( x ) x \cdot \sigma(x) x⋅σ(x)的一个退化情况,这表明任何ReLU的平滑近似都必须固有地引入非单调行为,以弥合活跃区域和非活跃区域之间的差距。
事实上,我们可以证明非单调性是唯一能够近似ReLU的单一定义函数类,同时满足5.4中列出的基本属性。为了形式化这一点,我们采用反证法。假设一个纯单调函数满足5.4中的所有条件。那么,根据构造,它必须在非激活区域 ( x < 0 ) (x<0) (x<0)中无法充分失活,或者在活跃区域 ( x > 0 ) (x>0) (x>0)中无法线性增长。这一矛盾表明,单调性无法同时实现复制ReLU所需的稀疏失活和线性激活。因此,非单调性不仅仅是ReLU的一个特征——它是任何ReLU平滑近似的先决条件。
这一见解将我们的关注点从单纯寻找ReLU的替代品,转向设计非单调激活函数,这些函数在保留ReLU基本特性的同时,还提供了增强的数值稳定性和梯度行为。通过这种方式构建问题,我们为开发下一代激活函数奠定了理论基础,这些函数既在数学上有原则,又在经验上有效。
定理5.1 设 r ( x ) = ReLU ( x ) = max ( 0 , x ) r(x)=\operatorname{ReLU}(x)=\max (0, x) r(x)=ReLU(x)=max(0,x),定义为:
r ( x ) = { 0 , 如果 x ≤ 0 x , 如果 x > 0 r(x)=\left\{\begin{array}{ll}0, & \text { 如果 } x \leq 0 \\x, & \text { 如果 } x>0\end{array}\right. r(x)={0,x, 如果 x≤0 如果 x>0
该函数具有以下特性:
- r ( 0 ) = 0 r(0)=0 r(0)=0,
- lim x → − ∞ r ( x ) = 0 \lim _{x \rightarrow-\infty} r(x)=0 limx→−∞r(x)=0,
- lim x → ∞ r ( x ) = ∞ \lim _{x \rightarrow \infty} r(x)=\infty limx→∞r(x)=∞。
设 f ( x ) f(x) f(x)是在 R \mathbb{R} R上定义的一个连续且可微的单一定义函数,对于所有 x ∈ R x \in \mathbb{R} x∈R,它满足与 r ( x ) r(x) r(x)相同的特性。那么,任何近似 r ( x ) r(x) r(x)的 f ( x ) f(x) f(x)都必须是非单调的。换句话说, f ( x ) f(x) f(x)必须在其整个定义域内既增加又减少。
证明:第一步:ReLU及其导数的特性
ReLU函数 r ( x ) = max ( 0 , x ) r(x)=\max (0, x) r(x)=max(0,x)具有分段线性形式:
r ( x ) = { 0 , 如果 x ≤ 0 x , 如果 x > 0 r(x)=\left\{\begin{array}{ll}0, & \text { 如果 } x \leq 0 \\x, & \text { 如果 } x>0\end{array}\right. r(x)={0,x, 如果 x≤0 如果 x>0
其导数 r ′ ( x ) r^{\prime}(x) r′(x)为:
r ′ ( x ) = { 0 , 如果 x < 0 1 , 如果 x > 0 r^{\prime}(x)=\left\{\begin{array}{ll}0, & \text { 如果 } x<0 \\1, & \text { 如果 } x>0\end{array}\right. r′(x)={0,1, 如果 x<0 如果 x>0
ReLU的导数在 x = 0 x=0 x=0处突变,使得 r ( x ) r(x) r(x)在该点不可微。这种突变是任何平滑近似函数 f ( x ) f(x) f(x)必须捕捉的,以紧密模仿 r ( x ) r(x) r(x)。
第二步: f ( x ) f(x) f(x)的平滑近似条件
假设 f ( x ) f(x) f(x)是一个在所有 x ∈ R x \in \mathbb{R} x∈R上定义的平滑、连续且可微的函数。此外,假设 f ( x ) f(x) f(x)是一个单调函数,在所有 x ∈ R x \in \mathbb{R} x∈R上既不增加也不减少。为了近似 r ( x ) r(x) r(x), f ( x ) f(x) f(x)必须满足:
- f ( 0 ) = 0 f(0)=0 f(0)=0,
- lim x → − ∞ f ( x ) = 0 \lim _{x \rightarrow-\infty} f(x)=0 limx→−∞f(x)=0,
- lim x → ∞ f ( x ) = ∞ \lim _{x \rightarrow \infty} f(x)=\infty limx→∞f(x)=∞。
由于 f ( x ) f(x) f(x)在所有点上都是连续且可微的,因此只有三种可能的情况描述了 f ( x ) f(x) f(x)在所有 x ∈ ( − ∞ , 0 ] x \in(-\infty, 0] x∈(−∞,0]上的增长:
- f ( x ) f(x) f(x)在所有 x ∈ ( − ∞ , 0 ] x \in(-\infty, 0] x∈(−∞,0]上取常数值,既不增加也不减少。
- f ( x ) f(x) f(x)在 a < x < b a<x<b a<x<b上增加,其中 a , b ∈ ( − ∞ , 0 ] a, b \in(-\infty, 0] a,b∈(−∞,0],增加量为 d ∈ R + d \in \mathbb{R}^{+} d∈R+。
- f ( x ) f(x) f(x)在 a < x < b a<x<b a<x<b上减少,其中 a , b ∈ ( − ∞ , 0 ] a, b \in(-\infty, 0] a,b∈(−∞,0],减少量为 d ∈ R + d \in \mathbb{R}^{+} d∈R+。
我们接下来分别考虑每种情况。
情况1:假设 f ( x ) f(x) f(x)在所有 x ∈ ( − ∞ , 0 ] x \in(-\infty, 0] x∈(−∞,0]上为常数。为了使 f ( x ) f(x) f(x)在该区间上保持常数,它必须被定义为常数,比如 f ( x ) = 0 f(x)=0 f(x)=0,对于所有 x ∈ ( − ∞ , 0 ] x \in(-\infty, 0] x∈(−∞,0]。现在,如果 f ( x ) f(x) f(x)要在 x > 0 x>0 x>0时增加,那么在某个点 c > 0 c>0 c>0处, f ( x ) f(x) f(x)必须表现出增长,即 f ( x ) f(x) f(x)不再保持常数。这意味着 f ( x ) f(x) f(x)必须在 x = 0 x=0 x=0处改变其行为,从在 x ∈ ( − ∞ , 0 ] x \in(-\infty, 0] x∈(−∞,0]上保持常数转变为在 x > 0 x>0 x>0时增加。例如,我们可以将 f ( x ) f(x) f(x)重新定义为在 x > 0 x>0 x>0时为增加函数,比如 g ( x ) g(x) g(x)。然而,这种构造会导致矛盾,因为函数 f ( x ) f(x) f(x)不再由在所有 x ∈ R x \in \mathbb{R} x∈R上的单一表达式定义。相反,它将需要两个不同的定义:一个用于 x ∈ ( − ∞ , 0 ] x \in(-\infty, 0] x∈(−∞,0],其中 f ( x ) = 0 f(x)=0 f(x)=0;另一个用于 x > 0 x>0 x>0,其中 f ( x ) f(x) f(x)遵循某个增加函数。这种分段定义与假设 f ( x ) f(x) f(x)是在 R \mathbb{R} R上的单一连续函数相矛盾。
总之,为了使 f ( x ) f(x) f(x)在 x ∈ ( − ∞ , 0 ] x \in(-\infty, 0] x∈(−∞,0]上既为常数又在 x > 0 x>0 x>0时增加,它必须由两个不同的表达式定义,这与 f ( x ) f(x) f(x)是在所有 x ∈ R x \in \mathbb{R} x∈R上的单一函数的前提相矛盾。因此, f ( x ) f(x) f(x)无法在不连续或分段定义的情况下同时满足这两个条件。
情况2:设 f ( x ) f(x) f(x)在区间 ( a , b ) (a, b) (a,b)上为连续且单调递增的函数。根据定义,这意味着对于任何 x 1 , x 2 ∈ ( a , b ) x_{1}, x_{2} \in(a, b) x1,x2∈(a,b)且 x 1 < x 2 x_{1}<x_{2} x1<x2,我们有 f ( x 1 ) ≤ f ( x 2 ) f\left(x_{1}\right) \leq f\left(x_{2}\right) f(x1)≤f(x2)。因此, f ( x ) f(x) f(x)在 ( a , b ) (a, b) (a,b)上不表现出任何减少行为。进一步假设 lim x → − ∞ f ( x ) = 0 \lim _{x \rightarrow-\infty} f(x)=0 limx→−∞f(x)=0,且 f f f在区间 ( a , b ) (a, b) (a,b)上增加量为 d d d,即 f ( b ) = f ( a ) + d f(b)=f(a)+d f(b)=f(a)+d。因此, f ( a ) f(a) f(a)和 f ( b ) f(b) f(b)之间的关系为:
f ( b ) = f ( a ) + d f(b)=f(a)+d f(b)=f(a)+d
其中 d > 0 d>0 d>0,因为 f ( x ) f(x) f(x)是增加的。根据介值定理(IVT),由于 f ( x ) f(x) f(x)在 ( a , b ) (a, b) (a,b)上连续,对于任何值 c ∈ ( f ( a ) , f ( b ) ) c \in(f(a), f(b)) c∈(f(a),f(b)),都存在某个 x 0 ∈ ( a , b ) x_{0} \in(a, b) x0∈(a,b)使得 f ( x 0 ) = c f\left(x_{0}\right)=c f(x0)=c。特别是,对于任何小的 ϵ > 0 \epsilon>0 ϵ>0,我们可以选择 c = f ( b ) − ϵ = d − ϵ c=f(b)-\epsilon=d-\epsilon c=f(b)−ϵ=d−ϵ。因此,存在某个 x 0 ∈ ( a , b ) x_{0} \in(a, b) x0∈(a,b)使得:
f ( x 0 ) = d − ϵ f\left(x_{0}\right)=d-\epsilon f(x0)=d−ϵ
这并不意味着 f ( x ) f(x) f(x)在 ( a , b ) (a, b) (a,b)上的任何地方都减少。相反,因为 f ( x ) f(x) f(x)是单调递增的,它穿过值 d − ϵ d-\epsilon d−ϵ,同时继续从 f ( a ) f(a) f(a)增加到 f ( b ) f(b) f(b)。
总结来说,根据介值定理(IVT), f ( x ) f(x) f(x)可以在 f ( a ) f(a) f(a)和 f ( b ) f(b) f(b)之间取任何值,并且由于 f ( x ) f(x) f(x)是增加的,它在不减少的情况下达到这些值。因此,对于某个 x 0 ∈ ( a , b ) x_{0} \in(a, b) x0∈(a,b), f ( x ) f(x) f(x)等于 d − ϵ d-\epsilon d−ϵ这一事实,与 f ( x ) f(x) f(x)是单调递增函数的前提完全一致。因此,我们得出结论,函数 f ( x ) f(x) f(x)从 f ( a ) f(a) f(a)增加到 f ( b ) f(b) f(b),增量为 d d d,并且介值定理保证了在区间 ( a , b ) (a, b) (a,b)内, f ( a ) f(a) f(a)和 f ( b ) f(b) f(b)之间的所有值都会被取到,而无需 f ( x ) f(x) f(x)在任何地方减少。
情况3:设 f ( x ) f(x) f(x)在区间 ( a , b ) (a, b) (a,b)上为连续且单调递减的函数。根据定义,对于任何 x 1 , x 2 ∈ ( a , b ) x_{1}, x_{2} \in(a, b) x1,x2∈(a,b)且 x 1 < x 2 x_{1}<x_{2} x1<x2,我们必须有 f ( x 1 ) ≥ f ( x 2 ) f\left(x_{1}\right) \geq f\left(x_{2}\right) f(x1)≥f(x2)。因此, f ( x ) f(x) f(x)在 ( a , b ) (a, b) (a,b)上不表现出任何增加行为。
现在,假设 lim x → − ∞ f ( x ) = 0 \lim _{x \rightarrow-\infty} f(x)=0 limx→−∞f(x)=0,且 f ( x ) f(x) f(x)在区间 a < x < b a<x<b a<x<b上减少量为 d d d。这意味着:
f ( b ) = f ( a ) − d f(b)=f(a)-d f(b)=f(a)−d
其中 d > 0 d>0 d>0,因为 f ( x ) f(x) f(x)是递减的。因此, f ( b ) f(b) f(b)小于 f ( a ) f(a) f(a)。此外,假设 f ( 0 ) = 0 f(0)=0 f(0)=0且 d < f ( 0 ) = 0 d<f(0)=0 d<f(0)=0,即对于某个负值 d d d,有 f ( b ) = d f(b)=d f(b)=d,即 f ( b ) = d < 0 f(b)=d<0 f(b)=d<0。根据介值定理(IVT),由于 f ( x ) f(x) f(x)是连续且递减的,对于任何值 c ∈ ( f ( b ) , f ( a ) ) c \in(f(b), f(a)) c∈(f(b),f(a)),必须存在某个 x 0 ∈ ( a , b ) x_{0} \in(a, b) x0∈(a,b)使得 f ( x 0 ) = c f\left(x_{0}\right)=c f(x0)=c。特别是,对于任何小的 ϵ > 0 \epsilon>0 ϵ>0,我们可以选择 c = d + ϵ c=d+\epsilon c=d+ϵ,因此存在某个 x 0 ∈ ( a , b ) x_{0} \in(a, b) x0∈(a,b)使得:
f ( x 0 ) = d + ϵ f\left(x_{0}\right)=d+\epsilon f(x0)=d+ϵ
其中 d + ϵ ∈ ( d , 0 ) d+\epsilon \in(d, 0) d+ϵ∈(d,0)。
然而,这并不意味着 f ( x ) f(x) f(x)在 ( a , b ) (a, b) (a,b)上的任何地方都增加。由于 f ( x ) f(x) f(x)是单调递减的,它在从 f ( a ) f(a) f(a)到 f ( b ) f(b) f(b)的递减过程中达到值 d + ϵ d+\epsilon d+ϵ。因此,函数无需在任何点增加,且介值定理不会导致任何矛盾。认为 f ( x ) f(x) f(x)必须在 ( − ∞ , 0 ] (-\infty, 0] (−∞,0]内的某点增加的断言是不正确的。单调函数仍然可以在保持严格递减的同时取到像 d + ϵ d+\epsilon d+ϵ这样的值。因此,认为 f ( x ) f(x) f(x)必须在 ( − ∞ , 0 ] (-\infty, 0] (−∞,0]内既减少又增加的想法是错误的,并且不遵循给定的假设。
总之,假设 f ( x ) f(x) f(x)是单调的并不会导致任何矛盾。因此,声称 f ( x ) f(x) f(x)必须是非单调的是没有根据的。 f ( x ) f(x) f(x)可以保持单调递减,而不违反介值定理或给定的条件。
在证明了必须紧密近似ReLU的函数是非单调的之后,我们接下来根据ReLU近似来比较我们的非单调候选函数。我们回顾一下,在表6中,TeLU在激活子域 x ∈ [ 0 , ∞ ] x \in[0, \infty] x∈[0,∞]上实现了对ReLU的最佳近似。另一方面,GELU在非激活子域 x ∈ ( − ∞ , 0 ] x \in(-\infty, 0] x∈(−∞,0]上最接近ReLU。我们认为,在激活区域ReLU表现出强梯度时对其进行紧密近似,比在接近零的非激活区域对其进行紧密近似更为重要。激活区域表现出更强的梯度,因此对神经网络的学习方式有更大的影响。因此,我们假设ReLU应该更类似于TeLU而不是GELU。我们在下面的引理中详细阐述了我们的推理:
引理5.1 设 r ( x ) = ReLU ( x ) = max ( 0 , x ) r(x)=\operatorname{ReLU}(x)=\max (0, x) r(x)=ReLU(x)=max(0,x),定义为:
r ( x ) = { 0 , 如果 x ≤ 0 x , 如果 x > 0 r(x)=\left\{\begin{array}{ll}0, & \text { 如果 } x \leq 0 \\x, & \text { 如果 } x>0\end{array}\right. r(x)={0,x, 如果 x≤0 如果 x>0
定义 r ( x ) r(x) r(x)的激活子域为 A = [ 0 , ∞ ) \mathcal{A}=[0, \infty) A=[0,∞),非激活子域为 I = ( − ∞ , 0 ) \mathcal{I}=(-\infty, 0) I=(−∞,0)。
设 t ( x ) = TeLU ( x ) t(x)=\operatorname{TeLU}(x) t(x)=TeLU(x)和 g ( x ) = GELU ( x ) g(x)=\operatorname{GELU}(x) g(x)=GELU(x)为两个平滑、连续且可微的激活函数,满足:
- 对于 x ∈ A x \in \mathcal{A} x∈A, t ( x ) ≈ r ( x ) t(x) \approx r(x) t(x)≈r(x),
- 对于 x ∈ I x \in \mathcal{I} x∈I, g ( x ) ≈ r ( x ) g(x) \approx r(x) g(x)≈r(x)。
此外,设 f ( x ) f(x) f(x)为在整个域 x ∈ R x \in \mathbb{R} x∈R上近似 r ( x ) r(x) r(x)的任意连续且可微函数。
定义 f ( x ) f(x) f(x)在激活子域 A \mathcal{A} A和非激活子域 I \mathcal{I} I上的近似误差为:
E A ( f ) = ∫ 0 ∞ ∣ f ( x ) − r ( x ) ∣ 2 d x E I ( f ) = ∫ − ∞ 0 ∣ f ( x ) − r ( x ) ∣ 2 d x \begin{array}{l}E_{\mathcal{A}}(f)=\int_{0}^{\infty}|f(x)-r(x)|^{2} d x \\E_{\mathcal{I}}(f)=\int_{-\infty}^{0}|f(x)-r(x)|^{2} d x\end{array} EA(f)=∫0∞∣f(x)−r(x)∣2dxEI(f)=∫−∞0∣f(x)−r(x)∣2dx
如果 E A ( t ) < E A ( g ) E_{\mathcal{A}}(t)<E_{\mathcal{A}}(g) EA(t)<EA(g)且 E I ( g ) < E I ( t ) E_{\mathcal{I}}(g)<E_{\mathcal{I}}(t) EI(g)<EI(t),则 t ( x ) t(x) t(x)在激活区域提供了比 g ( x ) g(x) g(x)更接近 r ( x ) r(x) r(x)的近似,而 g ( x ) g(x) g(x)在非激活区域提供了比 t ( x ) t(x) t(x)更接近 r ( x ) r(x) r(x)的近似。
此外, f ( x ) f(x) f(x)对神经网络训练的影响由每个子域中的梯度决定:
I A ( f ) = ∫ 0 ∞ ∣ f ′ ( x ) ∣ 2 d x I I ( f ) = ∫ − ∞ 0 ∣ f ′ ( x ) ∣ 2 d x \begin{array}{l}I_{\mathcal{A}}(f)=\int_{0}^{\infty}\left|f^{\prime}(x)\right|^{2} d x \\I_{\mathcal{I}}(f)=\int_{-\infty}^{0}\left|f^{\prime}(x)\right|^{2} d x\end{array} IA(f)=∫0∞∣f′(x)∣2dxII(f)=∫−∞0∣f′(x)∣2dx
如果 I A ( t ) > I I ( g ) I_{\mathcal{A}}(t)>I_{\mathcal{I}}(g) IA(t)>II(g),则 t ( x ) t(x) t(x)在激活区域对训练的影响比 g ( x ) g(x) g(x)在非激活区域的影响更大。
如果使用 t ( x ) t(x) t(x)和 g ( x ) g(x) g(x)来近似 r ( x ) r(x) r(x),则由于在激活子域 A \mathcal{A} A中具有更强的梯度和更小的近似误差, t ( x ) t(x) t(x)是神经网络训练中ReLU的更好替代品。因此,对于激活区域对学习影响更大的任务, t ( x ) t(x) t(x)应该比 g ( x ) g(x) g(x)更类似于 r ( x ) r(x) r(x)。
TeLU表现出与ReLU激活函数中观察到的强恒等增长密切相似的梯度行为。具体来说,当输入为正时,TeLU和ReLU都产生大约等于一的激活梯度,导致两者在训练期间的收敛速度相当。
This rapid convergence arises because strong gradients enable more significant parameter updates with each training iteration, helping the model learn faster. However, this property also implies that TeLU, like ReLU, does not impose much implicit regularization on the model. Implicit regularization refers to the tendency of some activation functions to naturally limit the magnitude of updates, thereby acting as a form of regulation on model complexity. In the case of ReLU and TeLU, the strong gradients reduce this implicit regularization effect, allowing the network to fit the data more efficiently.
In contrast, other activation functions, such as Logish or Smish, have weaker active gradients-especially as inputs move away from zero. This results in a “damping” effect where gradient updates are smaller, slowing down convergence. This behavior effectively regularizes the learning process by preventing overly large parameter updates, which can help in scenarios where controlling model complexity and preventing overfitting are desired. Thus, the distinct behavior of TeLU, with its strong gradients and rapid convergence, sets it apart from these other functions that inherently moderate the learning speed through gradient damping.
This characteristic leads to a more modular design of the neural network training procedure, as other components, such as the optimization algorithm and learning rate scheduler, can take on a more direct role in managing learning steps. With the activation function playing a lesser role in regularization, the optimization algorithm can be fine-tuned to control learning aggressiveness through parameters like momentum and decay rates, while the learning rate scheduler dynamically adjusts based on training progress. This approach allows for more precise tailoring of the training dynamics using external techniques, resulting in more efficient and effective training. Consequently, network regularization and convergence behavior become more configurable, with the activation function primarily introducing non-linearity rather than acting as a built-in regularization tool.
In summary, non-monotonic functions are well-suited to smoothly approximate the \operatorname{ReLU}=\max (0, x) nonlinearity while mimicking ReLU’s saturating behavior as x \rightarrow-\infty , its deactivation at x=0 , and its unbounded linear growth as x \rightarrow \infty . The TeLU activation function, in particular, closely replicates the inference and learning dynamics of models using ReLU when substituted in its place. Both TeLU and ReLU exhibit strong gradients that help reduce the implicit regularization imposed by the activation function, enabling a more modular training design. These similarities make TeLU an excellent drop-in replacement for ReLU in deep neural networks.
5.5 Analytic Universal Approximation
Definition 5.4 A function g: \mathbb{X} \rightarrow \mathbb{R}^{n} whose domain \mathbb{X} \subset \mathbb{R}^{n} is a universal approximator over \mathbb{X} if and only if there exist constant vectors \alpha, w , and b \in \mathbb{R}^{n} such that \left|\sum_{j=1}^{n} \alpha_{j} \cdot g\left(w_{j}^{\top} x+b_{j}\right)-f(x)\right|<\epsilon for all x \in \mathbb{X} , where f: \mathbb{X} \rightarrow \mathbb{R}^{n} is any arbitrary continuous function and \epsilon is any arbitrarily small positive error.
In other words, a univariate function g(x) is considered a universal approximator if it can be used in a finite linear combination of compositions, each with an affine transformation of the input, to approximate any multivariate continuous function over a bounded domain. In the context of deep learning, this concept guarantees that there exists a neural network with a single hidden layer and a universal approximation activation function that can effectively perform any task, provided the target function to be approximated is continuous and the input range is bounded. This principle underpins the capability of neural networks to model a vast range of complex functions with sufficient precision.
A pivotal breakthrough that underpins the use of sigmoidal activation functions in neural networks is the universal approximation theorem, first established independently by Cybenko (1989), Hornik et al. (1989), and Funahashi (1989). This foundational theorem asserts that a standard neural network with a single hidden layer of activation \sigma: \mathbb{X} \rightarrow \mathbb{R}^{n} can uniformly approximate any continuous function over a bounded domain \mathbb{X} \subset \mathbb{R}^{n} provided that \sigma is sigmoidal. A function \sigma is sigmoidal if and only if \lim _{x \rightarrow-\infty} \sigma(x)=0 and \lim _{x \rightarrow \infty} \sigma(x)=1 . The compact domain \mathbb{X} \subset \mathbb{R}^{n} constraint is a given within digital computer systems that use a finite number of bits to represent numbers. In essence, this means that such networks, despite their shallow structure, possess the remarkable ability to approximate any continuous functions with arbitrary precision.
Building on this foundation, Hornik (1991) expanded the universal approximation theorem to include any continuous, bounded, and non-constant activation function, thereby broadening the potential of neural networks as universal approximators. Further refinement came from the work of Leshno et al. (1993), who demonstrated that for continuous activation functions, a standard neural network with one hidden layer can uniformly approximate any continuous function if the activation function employed by the hidden layer is non-polynomial. This highlights the importance of non-polynomial activation functions in ensuring the universal approximation capabilities of neural networks. Since TeLU cannot be expressed as a polynomial, it is a universal approximator. However, we will demonstrate that TeLU is a universal approximator ourselves with support from Cybenko’s (1989) approach to gain a mathematical handle on the concept. This will allow us to be able to prove further properties regarding the quality of universal approximation of T e L U . Now, we will state and prove that T e L U(x)=x \cdot \tanh \left(e^{x}\right) is a universal approximator:
Lemma 5.2 Let I_{n}=[a, b]^{n} be an n -dimensional hypercube, and let C\left(I_{n}\right) denote the space of continuous functions on I_{n} , where a, b \in \mathbb{R} and a<b . Suppose f \in C\left(I_{n}\right) is a continuous function. Then, there exists a finite linear combination of the form
g(x)=\sum_{i=1}^{M} c_{i} \cdot \operatorname{TeLU}\left(w_{i}^{\top} x+b_{i}\right)
where \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) , such that for any \epsilon>0 ,
|f(x)-g(x)|<\epsilon \quad \forall x \in I_{n},
where w_{i} \in \mathbb{R}^{n} , and c_{i}, b_{i} \in \mathbb{R} .
Proof: We first analyze the asymptotic behavior of the function \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) :
- As x \rightarrow-\infty , we have e^{x} \rightarrow 0 and therefore \tanh \left(e^{x}\right) \rightarrow 0 , so
\operatorname{TeLU}(x)=x \cdot \tanh \left(e^{x}\right) \rightarrow 0
- As x \rightarrow \infty, e^{x} \rightarrow \infty and \tanh \left(e^{x}\right) \rightarrow 1 , thus
\operatorname{TeLU}(x)=x \cdot \tanh \left(e^{x}\right) \rightarrow x .
Now define the function \sigma(x)=\operatorname{Te} L U(x)-\operatorname{Te} L U(x-1) . We claim that \sigma(x) behaves as a sigmoidal function:
\begin{aligned}
- \text { As } x \rightarrow-\infty, \text { both } \operatorname{TeLU}(x) \rightarrow 0 \text { and } \operatorname{TeLU}(x-1) & \rightarrow 0, \text { so } \
\sigma(x) & \rightarrow 0 .
\end{aligned} - As x \rightarrow \infty, \operatorname{TeLU}(x) \rightarrow x and \operatorname{TeLU}(x-1) \rightarrow x-1 , thus
\sigma(x)=\operatorname{Te} L U(x)-\operatorname{Te} L U(x-1) \rightarrow 1
Thus, \sigma(x) has the asymptotic properties of a sigmoidal function: \sigma(x) \rightarrow 0 as x \rightarrow-\infty and \sigma(x) \rightarrow 1 as x \rightarrow \infty .
By Cybenko’s universal approximation theorem, for any f \in C\left(I_{n}\right) and any \epsilon>0 , there exist coefficients \alpha_{j} \in \mathbb{R} , vectors y_{j} \in \mathbb{R}^{n} , and biases b_{j} \in \mathbb{R} such that
f(x) \approx \sum_{j=1}^{N} \alpha_{j} \sigma\left(y_{j}^{\top} x+b_{j}\right),
where \sigma(x) is a sigmoidal function. Substituting \sigma(x)=T e L U(x)-T e L U(x-1) , this becomes:
f(x) \approx \sum_{j=1}^{N} \alpha_{j}\left[\operatorname{TeLU}\left(y_{j}^{\top} x+b_{j}\right)-\operatorname{TeLU}\left(y_{j}^{\top} x+b_{j}-1\right)\right]
Now, we can rewrite the sum as a linear combination of TeLU-based terms. Defining M=2 N and \oplus as a concatenating operator, we represent the following vector concatenations:
\hat{y}=y \oplus y, \quad \hat{\alpha}=\alpha \oplus-\alpha, \quad \hat{b}=b \oplus(b-1)
which leads to the expression:
g(x)=\sum_{j=1}^{M} \hat{\alpha}{j} \operatorname{TeLU}\left(\hat{y}{j}^{\top} x+\hat{b}_{j}\right)
By Cybenko’s theorem, there exists a sum g(x) such that
|f(x)-g(x)|<\epsilon \quad \forall x \in I_{n}
Thus, \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) is a universal approximator for continuous functions.
Having proven that T e L U is a universal approximator, we have shown that, in theory, a neural network with a single hidden layer that employs the T e L U activation function will approximate any unknown target function. An architecture that is capable of universal approximation enjoys the practical benefits of flexible modeling, being able to be trained to reach adequate performance in any task that shares its input and output dimensions. Additionally, with a universal approximator, deep learning engineers do not need to worry about if their chosen activation functions are capable of fitting to a task. Instead, they can focus their design efforts to scaling the model’s parameterization to better fit the desired task to prevent overfitting. Therefore, engineers that apply continuous non-polynomial functions to their fully connected layers may significantly reduce their search to tuning other hyperparameters such as network size, learning rate, and optimizer choice.
A universal approximator ensures that a network can approximate any unknown target function, but not that it will approximate it well enough without overfitting. Some universal approximators like R e L U are only capable of providing piecewise linear representations of tasks, and therefore lack representation power. The limited expressiveness of these ReLU approximations amounts to linear interpolation, which is prone to instability when attempting to model smooth relationships between input and output. Other non-linearities such as E L U offer a continuous first derivative, but a discontinuous second derivative, making it incompatible with second order optimization procedures that improve convergence speed and stability during training. Therefore, we require an additional property beyond standard universal approximator to concisely state this quality of universal approximation. We turn our attention towards the property of a function being analytic:
Definition 5.5 A function f(x) is said to be analytic at a point x_{0} if there exists a neighborhood around x_{0} such that f(x) can be represented as a convergent power series:
f(x)=\sum_{n=0}^{\infty} c_{n}\left(x-x_{0}\right)^{n}
for some constants c_{n} , where the series converges to f(x) for all x in this neighborhood.
When a function is analytic, it is implied that it is indefinitely differentiable at all points. This property makes them highly compatible with second-order optimization strategies that utilize the Hessian matrix of the cost function. By incorporating the Hessian matrix, optimization algorithms can account for the curvature of the loss function, leading to more stable and efficient convergence. Moreover, analytic functions facilitate the smooth transfer of information through the network, resulting in more dispersed neuron activations, which contribute to more robust and generalized representations (CITE). Additionally, analytic functions are compatible with a broader range of mathematical analysis techniques, such as spectral analysis and perturbation methods, which can further aid in understanding and improving the behavior of neural networks during training and inference. Since T e L U is a smooth function composed of analytic functions, we claim and demonstrate that it is also an analytic function. In the following lemma, we demonstrate that TeLU is indeed an analytic function:
Lemma 5.3 TeLU (x)=x \cdot \tanh \left(e^{x}\right) is an analytic function, where x \in \mathbb{R} .
Proof: We begin by noting that \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) is the product of the identity function and the composition of the hyperbolic tangent and the exponential functions.
First, consider the exponential function e^{x} , which has the well-known Maclaurin series expansion:
e{x}=\sum_{n=0}{\infty} \frac{x^{n}}{n!}
This series converges for all x \in \mathbb{R} , showing that e^{x} is analytic.
Next, we recall the Maclaurin series expansion for the hyperbolic tangent function \tanh (x) :
\tanh (x)=\sum_{n=1}^{\infty} \frac{2^{2 n}\left(2^{2 n}-1\right) B_{2 n}}{(2 n)!} x^{2 n-1}
where B_{2 n} are the Bernoulli numbers. This series also converges for all |x|<\frac{\pi}{2} , establishing that \tanh (x) is analytic within this domain.
Since \tanh \left(e^{x}\right) is the composition of the analytic functions \tanh (x) and e^{x}, \tanh \left(e^{x}\right) is analytic for all x \in \mathbb{R} .
Finally, \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) is the product of the analytic function \tanh \left(e^{x}\right) and the identity function x , which is itself analytic. By the closure properties of analytic functions, the product of two analytic functions is also analytic.
Thus, \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) is an analytic function, as it can be expressed as a power series:
\operatorname{TeLU}(x)=\sum_{n=0}^{\infty} b_{n} x^{n+1}
where b_{n} \in \mathbb{R} are constants derived from the power series of \tanh \left(e^{x}\right) . This completes the proof.
Having established that T e L U is analytic, it is important to revisit the practical benefits this property brings to neural network architectures. The smoothness and infinite differentiability of TeLU enhance its compatibility with second order optimization methods such as Natural Gradient Descent (NGD) [75]. This leads to more stable convergence and improved training efficiency, especially in complex models. Additionally, the smooth transfer of information throughout the network helps promote balanced neuron activations, allowing for robust feature representations and better generalization to unseen data. The analytic nature of TeLU also opens up opportunities for utilizing sophisticated mathematical tools, offering deeper insights and potential for further optimization during training and inference. We now show that T e L U , being a universal approximator that is also analytic, leads to analytic universal approximations:
Theorem 5.2 The function \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) is an analytic universal approximator. In other words, any continuous function f(x) can be approximated by a linear combination of TeLU nonlinearities, and the resulting approximation will be analytic.
Proof: We have already established that \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) is analytic. Let f(x) \in C\left(\mathbb{R}^{n}\right) be an arbitrary continuous function, and let g(x) be a linear combination of TeLU nonlinearities that approximates f(x) . Thus, for any \epsilon>0 , there exists a function g(x) of the form
g(x)=\sum_{j=1}^{M} \alpha_{j} \cdot \operatorname{TeLU}\left(w_{j}^{\top} x+b_{j}\right)
such that the approximation error satisfies
|g(x)-f(x)|<\epsilon \quad \forall x \in \mathbb{R}^{n} .
Since \operatorname{Te} L U(x) is analytic, we can express each \operatorname{Te} L U\left(w_{j}^{\top} x+b_{j}\right) as a convergent power series. Therefore, the function g(x) becomes a summation of power series expansions for each term:
g(x)=\sum_{j=1}^{M} \alpha_{j} \cdot \sum_{n=0}^{\infty} c_{n}\left(w_{j}^{\top} x+b_{j}\right)^{n}
where c_{n} are the coefficients of the power series expansion. Rewriting the expression, we have:
g(x)=\sum_{j=1}^{M} \sum_{n=0}^{\infty} \alpha_{j} c_{n}\left(w_{j}^{\top} x+b_{j}\right){n}=\sum_{j=1}{M} h_{j}(x),
where h_{j}(x) represents the power series expansion of each individual term \operatorname{Te} L U\left(w_{j}^{\top} x+b_{j}\right) . Since the sum of analytic functions is also analytic, the approximation g(x) is a sum of M analytic functions h_{j}(x) , and thus g(x) is analytic.
Consequently, the linear combination of \operatorname{TeLU}(x) nonlinearities that approximates f(x) is itself an analytic function. Therefore, we conclude that \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) is an analytic universal approximator.
With the proof that an architecture employing TeLU as its hidden activation function is capable of analytic universal approximations, the model achieves several theoretical and practical advantages. Theoretically, this capability ensures that the architecture can approximate a wide range of continuous and differentiable functions with arbitrary precision, which is essential in modeling complex, nonlinear systems. Analytic universal approximation guarantees that not only can the model represent any function, but it can do so with a smooth, well-behaved structure, preserving differentiability across all layers. This property is crucial in tasks requiring optimization through gradient-based methods, as it mitigates issues related to non-smoothness and sharp transitions in the loss landscape. The analytic nature also allows the model to generalize better by capturing underlying relationships in data, rather than fitting to noise or outliers.
Practically, having an architecture with analytic universal approximation capabilities means that the model can be applied to a broader set of problems across domains such as physics, engineering, and finance, where smooth and continuous approximations are needed for high-fidelity simulations and predictions. For instance, in control systems or physical simulations, TeLU-enabled architectures can model complex dynamics with high precision, leading to better predictive power and stability. Additionally, this property often results in more efficient training, as the smoothness in activation and loss functions can lead to faster convergence and more reliable gradient flow, addressing problems like vanishing gradients.
Having established that \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) is an analytic universal approximator, we now explore the theoretical and practical benefits of architectures capable of representing problems with analytic universal approximations.
5.5.1 Theoretical Benefits
Property 5.1 Guaranteed Approximability of Continuous Functions: An architecture using analytic universal approximators can approximate any continuous function f(x) \in C\left(\mathbb{R}^{n}\right) to an arbitrary degree of accuracy. This is ensured by the universal approximation theorem.
Since \operatorname{Te} L U(x) is analytic, it guarantees that smooth, continuous functions can be approximated with a high degree of precision. The smoothness of analytic functions, which are infinitely differentiable within their domain, enables the architecture to handle complex functional relationships.
Property 5.2 Convergence Guarantees: Due to the well-behaved nature of analytic functions, architectures using TeLU (x) exhibit stable and efficient convergence properties when optimized using gradient-based methods.
Since analytic functions have continuous derivatives, the resulting loss surface is smooth, allowing gradientbased optimization methods to converge faster and more reliably to optimal solutions without getting stuck in local minima.
Property 5.3 Higher-Order Information: Analytic functions are infinitely differentiable, meaning that architectures with analytic approximators can capture not only the function’s value but also higher-order derivatives, enabling more precise modeling of the underlying process.
This property is crucial in tasks that require higher-order information, such as physics simulations, where capturing the curvature of the function is essential. It also allows architectures to model complex dynamical systems that require higher-level functional representations.
Property 5.4 Avoidance of Pathological Behavior: Non-analytic functions may exhibit discontinuities or oscillations, which are not desirable in many practical applications. Analytic approximators avoid these issues by ensuring smooth and well-behaved representations.
Non-analytic functions can introduce undesirable behaviors, such as discontinuities or oscillations, which may complicate the learning process and reduce the model’s effectiveness in practical applications. By contrast, architectures using analytic approximators, like TeLU, ensure that the approximated functions are smooth and continuous. This smoothness prevents pathological behaviors, enabling the model to produce well-behaved representations that are more reliable and predictable, particularly in real-world scenarios where stability and smooth transitions are critical.
5.5.2 Practical Benefits
Property 5.5 Improved Generalization: An analytic universal approximator tends to generalize better to unseen data, as it avoids overfitting to noise or specific artifacts in the training data by leveraging smooth approximations.
In practice, smooth approximations help models capture the underlying patterns of the data without overfitting, making them more robust to noisy or incomplete datasets. This leads to improved performance on real-world tasks, particularly in domains with complex, non-linear relationships.
Property 5.6 Efficiency in Optimization: Architectures using analytic functions provide smooth loss landscapes, allowing for more efficient optimization. This reduces the likelihood of getting trapped in local minima and speeds up convergence during training.
Since the gradients of analytic functions are stable and predictable, optimization techniques such as gradient descent perform more effectively. In practical terms, this means that training models with analytic approximators can be faster and require fewer iterations, making them suitable for large-scale machine learning tasks.
Property 5.7 Better Interpretability and Differentiability: Analytic universal approximators offer a smooth, interpretable functional form, which makes them easier to understand and analyze. They are also well-suited for tasks requiring sensitivity analysis or gradient-based algorithms.
Because analytic functions are smooth and continuous, they are more interpretable. Additionally, their differentiability allows for the use of gradient-based optimization methods in tasks such as reinforcement learning or control systems, where differentiability is critical.
Property 5.8 Robustness to Numerical Precision: Analytic functions, due to their smoothness, are less prone to numerical instability, making them more reliable when implemented on hardware with limited floating-point precision.
This is particularly important for edge computing or embedded systems, where precision is limited. Analytic universal approximators like \operatorname{Te} L U(x) provide stable computations, reducing the risk of errors caused by floatingpoint approximations.
Property 5.9 Applicability in Physics-Informed Models: Many real-world problems, such as those in physics or engineering, are governed by smooth, differentiable equations. Analytic approximators are well-suited for these domains because they can represent smooth transitions and capture physical laws more accurately.
In practical applications like fluid dynamics, electromagnetism, or mechanical systems, analytic functions provide a natural fit due to their inherent smoothness and ability to model physical systems with high precision.
Property 5.10 Smoothness in Adversarial Settings: Analytic functions are naturally resistant to adversarial perturbations because small changes in input lead to small, predictable changes in output, making it harder for adversarial examples to exploit vulnerabilities in the model.
In adversarial machine learning, smoothness helps protect the model from being tricked by small, carefully crafted perturbations. This makes architectures based on analytic approximators more robust against adversarial attacks, enhancing security and reliability in sensitive applications.
Thus the architectures that utilize analytic universal approximators, such as \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) , offer significant theoretical and practical advantages. They guarantee smooth, continuous, and well-behaved approximations, provide better generalization, and are computationally efficient. These properties make analytic universal approximators highly valuable for a wide range of applications, from scientific computing and optimization to adversarial robustness and interpret-ability.
5.6 Stability
In deep neural networks, stability refers to the network’s ability to produce consistent and reliable outputs when exposed to small perturbations or changes in input data, weights, or training conditions. Theoretical stability is crucial for ensuring that the network generalizes well to unseen data and avoids issues like exploding or vanishing gradients, which can lead to erratic behavior during training. Without proper attention to stability, networks may suffer from poor convergence, unstable training dynamics, or a lack of reliability in deployment.
Activation functions play a crucial role in influencing the stability of deep neural networks by determining how neurons respond to input signals and propagate gradients during training. Functions like ReLU [7] mitigate the vanishing gradient problem, enabling better gradient flow in deep networks, but can lead to dead neurons, causing instability. Conversely, sigmoid and tanh activations are bounded functions that may limit instability, but suffer from vanishing gradients which leads to slow convergence and unstable learning. More recent activation functions such as Leaky ReLU [10] and ELU [12], are designed to balance gradient flow and prevent instability by reducing output bias and thereby approach Fisher ideal learning. Smooth activation functions have also been observed to promote the learning stability of a mode, as demonstrated by Zheng et al [76] and Ramachandran et al [41].
The vanishing gradient and exploding gradient problems directly impact the stability of deep neural networks by making weight updates during training inefficient or unstable. In the vanishing gradient problem, small gradients in deep layers prevent effective weight updates, causing the network to learn slowly or stop learning altogether. On the other hand, exploding gradients cause large weight updates, leading to oscillations and divergent training. By addressing these issues with activation functions that mitigate both underflow and overflow, deep networks maintain stable learning, ensuring consistent weight updates and convergence.
Activation functions that exhibit near-zero expected output contribute to neural network stability by reducing output bias, ensuring that neurons do not consistently produce large positive or negative values. This mitigates the risk of saturating activation functions like sigmoid or tanh, where extreme outputs lead to vanishing gradients. By centering activations around zero, these functions allow for more balanced weight updates during backpropagation. This behavior aligns with Fisher’s Ideal Learning, which suggests that stable learning arises when gradients provide unbiased information about the parameter space, facilitating efficient and consistent training.
Activation functions that are smooth and analytic contribute to learning stability in neural networks by ensuring that gradients change gradually and predictably during training. The smoothness of these functions avoids sudden jumps in gradient values, which can destabilize learning by causing erratic weight updates. Their analytic properties enable continuous differentiability, which is compatible with second order optimization techniques that utilize the curvature of the loss landscape to promote stability of learning. These characteristics allow for more consistent convergence, especially in deep networks, by preserving valuable gradient information throughout the layers through dispersity of activation.
Activation functions that are simple to calculate, such as ReLU and Leaky ReLU, contribute to learning stability by reducing the computational complexity of training deep neural networks. Simpler calculations minimize the risk of numerical errors like underflow or overflow, which can occur when nonlinear operations are repeatedly applied for the activation of a single neuron. Once underflow or overflow emerges within a term of a complex function, outer functions are likely to be driven towards additional underflow or overflow. Subsequent layers may also propagate the numerical instability to all the outputs, directly impacting inference. By avoiding these issues, simple activation functions help maintain numerical precision and prevent cascading errors in gradient computations, ensuring more stable learning, especially in deep architectures.
We present a detailed summary of the heuristics that we believe most significantly influence the learning stability of a model, focusing on specific properties of activation functions. First, we consider the ‘Nesting’ column, which reflects the complexity introduced by the number of nested non-linearities within an activation function. Higher levels of nested non-linearities can adversely affect the numerical stability of a model, particularly as the network depth increases. Next, the ‘Vanishing’ column assesses the tendency of an activation function to approach saturation and deactivation. If a function’s derivative exhibits exponential decay as inputs grow towards -\infty , it can result in a greater occurrence of vanishing gradients, hindering effective backpropagation [27] and slowing down learning in deep networks. In ReLU, this effect is exaggerated with derivatives of 0 for any negative input in what is known as the dying ReLU problem. Lu et al [32] discuss the strong effects of the dying ReLU problem in deep and narrow architectures, which are often necessary for efficient learning of complex behavior. The asymptotic decay of each activation function as inputs grow negative can be viewed in Table 1) within Subsection 5.11
Our ‘Smooth’ column refers to the non-zero differentiability of the activation function across its entire input domain. A function that is continuously differentiable at all points promotes smoother gradient flow and reduces potential disruptions during the training process. ReLU and ELU are not smooth functions due to their linear definition at positive inputs. Meanwhile, the non-monotonic functions are all smooth, as they can be indefinitely differentiated at all points without degenerating to zero. Finally, the ‘Output Bias’ column evaluates the expected output of the activation function when the input follows a standard Gaussian distribution. This is evaluated as \int_{-\infty}^{\infty} p(x) \cdot f(x) d x , where p(x) is the unit Gaussian distribution \mathcal{N}(0,1) , or \frac{1}{\sqrt{2 \pi}} \cdot e{-\frac{x{2}}{2}} . f(x) is the mathematical definition of each activation function. This measure helps us understand whether the function introduces an inherent bias in its outputs, which can affect weight initialization and, ultimately, learning dynamics. The closer a function’s output bias is to zero, the more it enhances the model’s convergence efficiency [28].
We do not explicitly list exploding gradients in the Table, as their behavior requires a more nuanced, detailed analysis. While the activation functions in our Table are primarily linear units, behaving linearly as inputs grow large, the real advantage of our smooth non-monotonic functions-TeLU, SiLU, GELU, Mish, Logish, and Smish-emerges at small positive inputs. These functions exhibit sub-linear growth, meaning their outputs increase more slowly than their inputs under these conditions. This sub-linearity is critical in controlling gradient behavior, as it naturally prevents the derivatives from becoming excessively large during backpropagation. By reducing the likelihood of gradients exploding, non-monotonic activation functions contribute significantly to greater learning stability. This stability is essential for deep networks, particularly when training with complex data or architectures that are prone to issues with gradient propagation. As exploding gradients can severely hinder convergence and result in unstable weight updates, choosing activation functions that naturally limit gradient magnitudes offers a practical solution to this widespread problem.
Mitigating exploding gradients is not only about achieving smoother training but also about improving the network’s overall learning efficiency. When gradients are kept within a controlled range, the model can train without sudden fluctuations in the loss function, resulting in more stable and predictable weight updates. This stability helps the optimization process stay on course, leading to faster convergence and better generalization to new, unseen data. Activation functions like TeLU and its non-monotonic counterparts play a crucial role in promoting this stability by naturally preventing the gradients from becoming excessively large. By employing smooth, analytic functions like TeLU, these architectures inherently avoid issues such as abrupt changes or sharp boundaries in the model’s output. This makes them particularly well-suited for tasks where continuity and smoothness are critical, such as in control systems, robotics, or financial modeling, where precise, stable responses are essential for success.
This can be theoretically shown as follows:
Theorem 5.3 Let f(x) \in C\left(\mathbb{R}^{n}\right) be a continuous function. Suppose we approximate f(x) using two architectures: one based on an analytic universal approximator \operatorname{Te} L U(x)=x \cdot \tanh \left(e^{x}\right) , and the other based on the non-analytic activation function \operatorname{Re} L U(x)=\max (0, x) . The analytic universal approximator provides the following mathematically provable advantages:
- Smoothness and Continuity of Derivatives: The analytic universal approximator is infinitely differentiable, whereas ReLU is piecewise linear and not differentiable at x=0 .
- Gradient Propagation Stability: In deep networks, the gradient propagation using analytic functions remains bounded and stable, whereas ReLU introduces points where gradients vanish or explode, leading to poor training dynamics.
- Error Propagation in Deep Networks: Analytic functions provide smoother error propagation, leading to better control of gradient norms, while ReLU propagates discontinuous gradients, leading to unstable optimization.
Proof: We will prove each point in turn.
- Smoothness and Continuity of Derivatives:
As demonstrated in 5.5 TeLU(x) = x • tanh ( e^{x} ), TeLU is analytic. In being analytic, TeLU may be defined as a convergence power series and is therefore infinitely differentiable at all points.
2. Gradient Propagation Stability:
Consider a deep network with L layers. Let g^{(l)}(x) represent the gradient of the output with respect to the input at layer l , defined as:
g^{(l)}(x)=\frac{d}{d x}\left(W^{(l)} \cdot a^{(l-1)}(x)\right),
where W^{(l)} is the weight matrix and a^{(l-1)}(x) is the activation of layer l-1 .
Using an analytic activation function like \operatorname{Te} L U(x) , the gradient propagation remains continuous and smooth:
g{(l)}(x)=W{(l)} \cdot \frac{d}{d x} \operatorname{TeLU}(x)=W^{(l)} \cdot\left(\tanh \left(e^{x}\right)+x \cdot e^{x} \cdot \operatorname{sech}{2}\left(e{x}\right)\right),
which is bounded and continuous for all x \in \mathbb{R} . The gradient never vanishes nor explodes, as the derivative of T e L U(x) is smooth for all x .
In contrast, for \operatorname{ReLU}(x) , the gradient at layer l is:
g{(l)}(x)=W{(l)} \cdot \frac{d}{d x} \operatorname{Re} L U(x)=W^{(l)} \cdot 1(x>0) .
Where 1(x>0) evaluates to 1 if x>0 and 0 if x \leq 0 . At x=0 , the gradient can be discontinuous or zero, leading to two major issues: 1 . Vanishing gradients when a large number of neurons have negative input, causing the gradient to become zero and halting learning. 2. Exploding gradients in deep layers, especially when weights W^{(l)} are large, causing sharp changes in the gradient due to the piecewise linear nature of ReLU.
Thus, analytic universal approximators offer more stable gradient propagation across deep networks.
3. Error Propagation in Deep Networks:
Let \mathcal{L}(x) represent the loss function of the network, and let the gradient of the loss with respect to the input at the final layer be \frac{\partial \mathcal{L}}{\partial x} .
For the analytic activation T e L U(x) , the gradient propagation follows from the smooth nature of T e L U :
\frac{\partial \mathcal{L}}{\partial x}=\frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \frac{d}{d x} \operatorname{TeLU}(x)=\frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot\left(\tanh \left(e^{x}\right)+x \cdot e^{x} \cdot \operatorname{sech}{2}\left(e{x}\right)\right) .
The gradient is well-controlled due to the smooth behavior of \operatorname{Te} L U(x) , ensuring that error propagation through layers remains stable and predictable.
For \operatorname{ReLU}(x) , the error propagation becomes problematic:
\frac{\partial \mathcal{L}}{\partial x}=\frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \frac{d}{d x} \operatorname{Re} L U(x)=\frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \nVdash(x>0) .
This creates sharp, discontinuous changes in the error gradient when x crosses zero, leading to unpredictable error propagation. Furthermore, the gradient can become zero when large numbers of neurons are inactive (i.e., in the dead zone), effectively halting the learning process for those weights.
Thus, analytic activations offer more stable and predictable error propagation in deep networks.
Smoothness and continuity of derivatives, gradient propagation stability, and error propagation stability are crucial for ensuring the effective training of deep neural networks. Smooth activation functions with continuous derivatives allow for stable and predictable gradient updates, preventing erratic changes during backpropagation. This, in turn, promotes gradient propagation stability, which is essential for maintaining useful gradient values across many layers, avoiding the problems of vanishing or exploding gradients. Stable error propagation further ensures that the network can consistently learn from its mistakes, leading to more reliable convergence and better generalization to unseen data. Together, these properties form the foundation of stable and efficient learning in deep networks.
6 Experimental Validation
This section presents a detailed experimental validation of the TeLU activation function, building on the theoretical foundation established in the previous section. Here, we provide empirical results that support and confirm TeLU’s proposed benefits, such as enhanced gradient behavior, improved convergence speed, computational efficiency, and stability. By testing TeLU across various neural network architectures and datasets, we aim to demonstrate its effectiveness in addressing challenges like inefficient learning, the vanishing gradient problem, and computational complexity. These experiments serve to bridge the gap between theory and practice, offering concrete evidence of TeLU’s potential as a superior alternative to traditional activation functions.
6.1 Persistent Gradients of the Saturation Region
In Subsection 5.1, we examined the rate at which the derivatives of various activation functions vanish as inputs approach x \rightarrow-\infty . By categorizing each function’s vanishing gradient behavior using asymptotic decay classes, we identified TeLU as belonging to the class with the most persistent gradients, \Theta\left(\frac{x}{e^{x}}\right) . Additionally, we noted that TeLU maintains stronger derivatives compared to other nonlinearities, particularly at the onset of its deactivation region.
6.1.1 FashionMNIST Dataset on MLP with Negative Biases
We demonstrate the benefit of these characteristics experimentally by attempting to train Multi-Layer Perceptrons (MLPs) where neurons have been initialized with a bias of -10 . The large negative bias dictates the activation of the neuron towards the saturating region of lower-bounded nonlinearities. By introducing this large negative bias, we evaluate the ability of the architecture to recover from a point where backpropagations yields vanishing gradients. This extreme case lets us focus on the direct effects of each activation function’s asymptotic saturation rate. We utilize two hidden layers and train on the FashionMNIST dataset [46] for 200 epochs. We optimize according to the SGD optimizer with a momentum of 0.9 and a weight decay of 0.0005 . This reduced hidden layer count effectively reduces the number of epochs that we would need to run this experiment to witness activation function recovery. We outline our configuration in table 8.
The presence of weight decay does not significantly impact recovery in practice, as the bias is what is keeping the neuron activations in their saturation regions. We display the validation accuracies over the 200 epochs in Figure 13 and testing accuracies in Table 9, with each data point averaged over 10 trials. We observe that TeLU tends to begin its successful stride towards recovery before any other nonlinearity, due to having stronger inactive gradients near the origin where expected inputs are likely to fall. We observe Mish, SiLU, Smish, and Logish begin their recovery sooner than ELU, GELU, and ReLU, which is what we would expect to see from our analysis. Furthermore, TeLU’s accuracy approaches a competitive level at a faster rate than do the competing activation functions.
We demonstrate the consisency of this experiment by initializing the bias learned parameters to -20 , pushing neurons further into their saturation regions. We plot the validation accuracies for each epoch, averaged over 9 trials in Figure 14. We notice that TeLU continues to be the first to recover from vanishing gradients, giving further empirical evidence that it is less prone to stalling learning when TeLU neurons are pushed into their inactive regions. We also note that the order of each architecture begins their recovery in the same order as before, adding the relevance of the asymptotic decay classifications. Mish, SiLU, Logish, and Smish; belonging to the same asymptotic decay class as TeLU but having weaker gradients near the origin; being their recovery before ELU, GELU, and ReLU. We summarize our test accuracies averaged over 10 trials per data point for the 200 epoch experiments in Table 9. We recognize that within the 200 epochs TeLU architectures generalize to a superior testing accuracy in both cases.
For biases intialized to -20 , we notice that 200 epochs may not be enough time for GELU to begin its recovery. Therefore we increase the number of epochs to 300 to gain further insight into the remainder of each architecture’s recovery. Figure 15 plots the 300 epoch journey of each architecture, averaged over 5 trials. Again, we see nonlinearities begin their recovery in the order of TeLU, Mish, SiLU, Logish, Smish, ELU, and GELU. It is worth noting that although the order of recovery follows our asymptotic classifications, the rate of recovery once begun depends on other factors we have not yet accounted for. We hypothesize that TeLU’s near-linearity and ELU’s linearity throughout positive inputs helps both shoot up to their concluding validation accuracies faster than other nonlinearities. Throughout all trials, ReLU exhibits no ability to recover, as is expected from its zero-gradient off state, which characterizes its dying neuron problem.
In practice, the Hyperbolic Tangent is generally preferred over the Logistic Sigmoid for preventing vanishing gradient problems when properly regularized. Similarly, we believe that TeLU and SiLU offer more persistent gradients with proper regularization, which may explain why Mish, Logish, and Smish are often considered to have a stronger self-regularizing effect. We capitalize the practical applications of slow gradient decay by initializing neuron biases in a multi-layer perceptron (MLP) to varying negative values. We then perform training on the MNIST datasets to test the ability of neural networks employing different hidden nonlinearities to reactivate their inactive neurons. To simplify results, we take nonlinearities that have shown a difference in their decay rates, numerical underflow regions, or inactive gradient calculations. Therefore, we use ReLU, ELU, GELU, Mish, and TeLU nonlinearities.
6.1.2 CIFAR-10 Dataset with DenseNet CNN
We hypothesize that the pronounced vanishing gradient effects associated with GELU and ReLU activation functions may be particularly harmful in CNN architectures lacking bypass connections. CNNs process local information from input images through convolution, where small square filters are applied around each pixel. This local information is passed through multiple convolution layers, often combined with pooling operations to reduce the signal’s dimensionality. In architectures like DenseNet [77], fully connected neural blocks are interleaved between convolutional steps. Unlike Residual Networks (ResNet) [78], DenseNets do not have bypass connections to facilitate proper gradient flow to all layers. Consequently, vanishing gradients can significantly impede the learning process of such architectures. In extreme cases, this stalling can have lasting effects, ultimately reducing the model’s ability to fit the data. Therefore, we anticipate that activation functions prone to “dying neurons,” such as ReLU and GELU, may severely limit the performance of DenseNets.
To evaluate our hypothesis, we use a CNN without residual connections to compare the testing accuracies of architectures that employ TeLU, GELU, and ReLU as their hidden layer activations. Specifically, we implement TeLU, GELU, and ReLU versions of the DenseNet121 architecture [77] and train them on the CIFAR10 dataset [47] for 200 epochs each. The dataset is partitioned into 40,000 training images, 10,000 validation images, and 10,000 testing images. We use an SGD optimizer with a base learning rate of 0.1 , along with a learning rate scheduler that reduces the learning rate by a factor of 0.2 every 60 epochs. Additionally, we apply L2 regularization with a weight decay coefficient of 0.0005 and perform mini-batch gradient descent [79] using a batch size of 128 . Further experimental configurations are detailed in Table 10.
We observe the test accuracies of each architecture, averaged over 10 trials, in Table 11. We also view the progression of validation accuracy of the TeLU, GELU, and ReLU architectures, averaged across 10 trials, in Figure 16. We notice that TeLU, an activation function that features stronger mitigation of vanishing gradients in the deactivation region, performs significantly better than both the GELU and ReLU architectures. These results are consistent with those of Dubey et al [19], which highlight that residual connections are integral to the success of architectures employing GELU or ReLU activation functions.
Table 11 presents the test accuracies of each architecture, averaged over 10 trials, while Figure 16 shows the progression of validation accuracy for the TeLU, GELU, and ReLU architectures, also averaged over 10 trials. Notably, TeLU outperforms both GELU and ReLU, highlighting its advantage due to its persistent gradients in the deactivation region, which more effectively mitigate the vanishing gradient problem. This persistent gradient flow allows for more robust learning, especially in architectures lacking residual connections. These results are consistent with the findings of Dubey et al. [19], which highlight the necessity of residual connections for the success of architectures using GELU or ReLU activation functions. In contrast, TeLU effectively addresses the vanishing gradient and numerical underflow issues related to the dying neuron problem, eliminating the need for residual connections to maintain high performance. As a result, TeLU offers greater architectural flexibility by reducing dependencies on specific model configurations.
6.2 Near-Linearity of Active Region
In Subsection 5.2, we investigated how effectively various linear units approximate a linear function within their active regions. This was done by calculating the integral difference between each activation function and the line representing its slope as x \rightarrow \infty . Among the smooth linear units, TeLU was found to minimize this integral difference. While both ReLU and ELU are defined as the identity for positive inputs, we argue that TeLU offers superior learning efficiency due to its persistent gradients, which help mitigate the vanishing gradient problem and prevent learning slowdowns.
6.2.1 ImageNet Dataset with ResNet18 Architecture
We first test the practical applicability of the theoretical improvement in corvergence rate with the ImageNet dataset [80]. We directly implement Pytorch’s example ImageNet training on the ResNet18 architecture [78]. Residual Neural Networks (ResNets) are a type of Convolutional Neural Networks (CNNs) [81] that define bypass connections between hidden convolutional layers to treat effective model depth as a learnable parameter and provide strong gradients to earlier neural layers during backpropagation. We use the predefined base learning rate of 0.1 , weight decay coefficient of 0.0001 , and minibatch size of 256 . We optimize according to Stochastic Gradient Descent (SGD) with a momentum of 0.9 over 90 epochs, dividing our learning rate by 10 at periods of 30 epochs. Due to the computational requirements and the size of the ImageNet dataset, we only ran experiments with the ReLU and TeLU nonlinearities. This configuration is detailed in Table 12. Our first trial was run with a seed of 1111, but the following two trials were unseeded due to performance drop. We plot the averaged validation accuracy metrics over the trials in Figure 17. Additionally, we show the test accuracy in Table 13. We observe that TeLU outperforms ReLU with baseline hyperparameters when training for 90 epochs.
Given that the unseeded trials still required 14 days to train on 4 GTX 1080 Ti Graphics Processing Units (GPU), we wish to take advantage of TeLU’s theoretical improvements in convergence speed. Therefore, we lower the number of training epochs from 90 to 50 and expect to see comparable results. To keep the learning rate scheduler roughly proportional, we lowered the learning rate decay period from 30 to 20 . We show the average validation epoch accuracy metrics over the 50 epochs in Figure 18. We also note a comparable performance of both activation function architectures, but a more-unchanged TeLU accuracy in Table 13. We attribute this to TeLU’s distant saturation region helping the network stay relevant across forward passes as well as TeLU’s strong gradients for positive inputs.
6.2.2 Text8 Dataset with Dynamic Pooling Transformer
We then proceeded to test TeLU’s baseline convergence rate on a transformer [3] architecture. For performance improvements, we utilized a dynamic pooling transformer architecture [82] with 8 heads of attention. We set the fully connected and attention neurons to drop out at a rate of 0.12 during training. We utilized a base learning rate of 0.0002 that updated according to a 200,000 step cosine scheduler with 4,000 initial warm-up steps. We used the ADAM optimizer [83], given that we did not employ weight decay. We detail our configuration in Table 14. With this configuration, we ran 3 trials and recorded the averaged progression of validation error metrics over the 20 evaluation steps in Figure 20.
As shown in Figure 20, GELU initially achieves lower validation loss during the early evaluation steps. However, TeLU’s faster convergence allows its loss to drop below GELU’s over time. Notably, TeLU’s validation loss stabilizes by evaluation seep 15, whereas GELU begins to experience increasing validation error beyond step 14. These validation improvements with TeLU translate into superior test loss, as seen in Table 15. Furthermore, TeLU’s stability at favorable validation accuracy results in lower standard deviation in test loss compared to GELU.
To demonstrate TeLU’s improved convergence rate further, we reduced the training steps from 200,000 to 100,000 and updated the cosine scheduler to a period of 150,000 . This was empirically chosen to allow the architectures to continue applying meaningful updates during the last few training steps. With a shorter experiment, our budget now allows for a ReLU architecture to be included. We view the progression of the validation error in Figure 21 and the test error summary in Table 15, with data averaged over 9 trials.
The shortening of the cosine scheduler period [84] enables TeLU to outper Form GELU at earlier evaluation steps, as anticipated. ReLU, like GELU, displays strong gradients in the active region but converges more slowly. This slower convergence can likely be attributed to the dying neuron problem, which inhibits learning in ReLU architectures. As a result, both ReLU’s dying neuron issue and GELU’s sub-linearity in the active region contribute to the slower and less consistent convergence observed.
6.3 Runtime Efficiency
Using the function definitions provided in Table 4, we measured the computational efficiency of forward and backward passes for a neuron employing the baseline implementation of each nonlinearing. To ensure accurate delay measurements, we first minimized operating system scheduler disturbances [85] [86] by running only essential Windows operating system programs on the system in addition to the experiment itself. Our host system had 12 Intel i7 processors, 32 GB of RAM, and an NVIDIA RTX-2070 Graphics Processing Unit (GPU) [87]. We defined an input vector of 10^{6} randomly initialized 32-bit floating point values and performed a sequence of forward and backward passes over 10^{6} iterations. This approach helped evenly distribute any remaining operating system interrupts across each nonlinearity’s measurement. We repeated the experiment after its completion to ensure that the sequential execution of each function’s forward and backward passes did not provide any bias to our experiment. This repeat experiment confirmed that system performance factor differences between the times of each nonlinearity’s calculations were negligible. The delay of each activation function is recorded in Table 16, where we observed that TeLL’s performance was second only to ReLU’s.
We repeated the baseline definition experiment on two Linux systems, each equipped with 64 GB of RAM. One system featured an A100 GPU, while the other used a TI-1080 GPU. This time, we used input vector of length 10^{6} and 10^{7} to ensure that each GPU would require multiple operations to process the larger 32-bit floating point input vector. The timing comparisons between these experiments was proportional, as seen in Tables 17 and 18 indicating minimal variance in nonlinearity delays between the two systems. The concise formulation of TeLU allows for significant computational efficiency improvements over other smooth functions, making TeLU-based architectures more scalable and leading to reduced training times. This speedup in training suggests not only an improvement in TeLU’s convergence speed but also enhanced energy efficiency during operation.
This experiment was repeated for 10^{7} iterations of input vectors of size 10^{6} . with results shown in Table 18. Across all cases, TeLU demonstrates substantial performance improvements over all nonlinearities except ReLU. Although ReLU is faster to compute, TeLU compensates with faster convergence, achieving high performance with fewer training iterations. These findings are particularly meaningful, as they highlight TeLU’s superior computational efficiency compared to other smooth activation functions. This efficiency plays a critical role in deep neural networks because activation functions heavily influence the overall training speed. Faster training times not only enhance productivity but also enable the use of larger models, increased data throughput, and more frequent experimentation. Moreover, computationally efficient activation functions can reduce hardware demands, lower energy consumption, and facilitate real-time applications, making them wital for scalable and sustainable AI solutions.
6.4 ReLU Compatibility
Optimizing hyperparameter selection for training deep learning models is a time-intensive process. The time required can span weeks or even months, depending on the number of hyperparameters involved, the complexity of the architecture, and the duration of the training. The challenge arises from the fact that hyperparameter values can impact model accuracy, overfitting, convergence speed, and stability in nonlinear and interdependent ways. Furthermore, when designing experiments from scratch, each new setup demands its own lengthy tuning process, as we encountered in our initial experiments.
To enhance our productivity and streamline the process of finding competitive configurations, we began utilizing hyperparameter settings sourced from GitHub repositories or published papers. These configurations were primarily designed for ReLU-based architectures. Fortunately, we discovered that TeLU remained competitive without requiring any modifications to these ReLU configurations. In practice, adapting ReLU architectures to TeLU architectures proved to be straightforward-simply by defining the TeLU nonlinearity and substituting it for the ReLU nonlinearity in the hidden layers. This observed compatibility suggests that ReLU and TeLU perform well within similar hyperparameter subspaces, which could facilitate the adoption of TeLU. Given that ReLU is arguably the most popular activation function for hidden neurons, it is likely that data scientists will initially test a novel nonlinearity within an existing ReLU architecture using ReLU configurations. Therefore, TeLU’s similarity to ReLU across different architectures could help it make a positive first impression.
6.4.1 FashionMNIST Dataset with MLP with Varied Weight Decay
To experimentally demonstrate this compatibility, we designed MLP architectures utilizing various activation functions, including ReLU, ELU, GELU, SiLU, Mish, Logish, Smish, and TeLU. Each architecture was defined with 8 hidden layers, each with a width of 128 neurons. The batch size was set to 128 , and the learning rate was fixed at 0.005 . We used the Fashion MNIST dataset, dividing it into training, validation, and testing partitions with 50, 000 ; 10,000 ; and 10,000 samples, respectively. Each configuration was tested over 6 trials across 6 different weight decay values: 0.0001,0.0003,0.0005,0.001,0.003 , and 0.005 . We chose to focus on weight decay because these activation functions seemed to differ most in their degree of self-regularization. The complete experimental configuration can be viewed in Table 20. The resulting accuracies for each configuration are presented in Figure 22.
It is observed that TeLU exhibits superior accuracy when using a weight decay coefficienct that is optimal for ReLU. ReLU and TeLU nonlinearities generally benefit from higher levels of L2 regularization, while Logish and Smish require minimal or no weight decay to maintain stability. ReLU likely benefits more from increased weight decay due to its tendency to cause neurons to become permanently inactive during training. A larger weight decay helps mitigate the dying ReLU problem by keeping preactivations close to zero, which allows neurons to stay near the threshold needed to activate or deactivate based on different input signals. In contrast, nonlinearities with complex nested subfunctions, such as Smish, experience instability at weight decay levels that are beneficial for ReLU. When external regularization pushes the activation of preceding Smish neurons towards zero, subsequent layers receive inputs close to zero. This can lead to numerical underflow as the nested nonlinearities are calculated from the inside out, resulting in dying neurons that cause subsequent layers to produce zero activation, further propagating numerical instability. Overall, we observe a strong correlation between the mathematical complexity of a nonlinearity and its ideal level of L2 regularization, with the exception of SiLU and ELU.
We repeat this experiment for weight decay coefficient values 0.0004,0.0006,0.0008,0.001,0.0012,0.0014 , and 0.0016 . As shown in Figure 23, the same key observation persists: where ReLU performs optimally, TeLU outperforms it. This is significant because ReLU is widely adopted in deep neural networks, meaning many existing projects rely on its configuration. With TeLU acting as a straightforward substitute for ReLU, researchers and developers can seamlessly switch to TeLU and immediately benefit from faster convergence and enhanced stability without requiring additional tuning, leading to greater productivity and minimal adjustment effort.
6.4.2 CIFAR-10 Dataset with SqueezeNext with Varied Optimizers
To further validate the compatibility between ReLU and TeLU configurations, we evalaated the performance of various nonlinearities when trained with hyperparameters optimized for ReLU. We utilized a SqueezeNext CNN architecture with ReLU as the activation function in its hidden layers. We then fine-tuned the learning rate, weight decay, and learning rate scheduler scaling hyperparameters for each of four optimization methods. We utilized Minibatch Stochastic Gradient Descent (SGD) [88, 89], SGD with momentum [90], AdamW [91], and RMSprop [92]. The resulting learning rate and weight decay hyperparameters for each optimizer are detailed in 22. The portion of the hyperparameters that stay constant throughout optimizer choice is detailed in Table 21. The learning rate scheduler for each optimizer scaled the learning rate according to the learning decay value once every 60 epochs. For the experiment, we set aside 10,000 of the 60,000 CIFAR-10 images to define a validation partition, trained on the remaining 50,000 images, and tested on the standard 10,000-image testing split. We then substituted ReLU with the other nonlinearities and summarized the testing accuracies for each architecture over 10 trials in Table 23.
We observe that in the momentum, AdamW, and RMSprop optimizer-oriented ReLU configurations, TeLU performs presents a significant improvement over ReLU. This reinforces our ongoing observation that TeLU tends to be superior over ReLU in configurations that optimize ReLU performance. The SGD-oriented ReLU configuration presented the only case where ReLU performed best out of all activation functions. We note that the SGD optimizer was the most challenging to tune due to their smaller viable hyperparameter subspaces, yet it yielded the highest accuracies for the architecture being optimized once an adequate tuning was found. This behavior regarding SGD is common [94], and may help explain why TeLU did not outperform ReLU in this context. Since the success of SGD is particularly inconsistent between similar configurations, it is expected that the substitution of activation functions within an architecture adds some additional tuning costs. TeLU, however, retains comparable performance with ReLU in the SGD case and shows significant improvement over all other activation functions.
To visualize the empirical similarity between TeLU and ReLU further, we plot the progression of validation accuracies in each architecture while optimizing according to the SGD and momertum-accelerated SGD optimizers in Figures 24 and 25. We find that architectures employing each nonlinearity learn in consistent patterns in both cases. ReLU, having little to no empirical self regularization according to our observations so far, benefits from large weight decays. Therefore, we see that more self-regularized activation functions cause their architecture’s validation accuracy to saturate to a lower value in this high regularization context. We also notice that TeLU and ELU architectures initially converge at a faster rate than their competition, similar to the persistent gradient experiments. We hypothesize that this occurs for the same reasons of having strong gradients in the unbounded region of activation without suffering from dying neuron problems. Eventually; ReLU, TeLU, and ELU architectures saturate to a similar validation accuracy and TeLU generalizes to a superior test accuracy.
We emphasize for these ReLU-similarity experiments that all experiments are done with configurations that are meant to favor the learning of ReLU architectures. This way, an activation function’s empirical similarity to ReLU in terms of viable configurations can be highlighted. Based on the resulting observations, we conclude that TeLU and ReLU share similar ideal hyperparameter subspaces, allowing them to perform competitively after training with the same configurations. This suggests that architectures and tuning strategies optimized for ReLU can also be effectively applied to TeLC’, minimizing the cost of substituting the hidden nonlinearities. In practice, we found that TeLU performs well with ReLU hyperparameters but may achieve even better results with a 20 % reduction in weight decay and learning rate. Conversely, ReLU appears to perform optimally with a 25 % increase in weight decay and learning rate when applied to a TeLU configuration.
6.5 Analytic Universal Approximation
6.5.1 MNIST Dataset with Variational AutoEncoder
To highlight the practical relevance of TeLU being an analytic universal approximation, we compare its training and testing loss to that of ReLU on the data compression and expansion of MNIST samples. A Variational AutoEncoder (VAE) [95] architecture is used with encoder layers of size 784, and 512, and 32. Unlike the standard autoencoder [96], the encoder of a VAE concludes with a probability distribution of the latent space rather than the latent space itself. From this, the forward pass of a VAE’s decoder samples from the distribution to generate a latent space representation before continuing to reconstruct the image. We apply the standard KL Divergence Loss [97] to the sampling at the start of decoding as well the mean squared error (MSE) [98] reconstruction loss.
We use a batch size of 100 and train ReLU and TeLU networks over 50 epochs. Table 24 summarizes these hyperparameters. In Table 25, we observe the training and testing errors of each model across varying learning rates to show that the results are consistent. ReLU, a piecewise-linear universal approximator is not analytic and therefore does not form smooth approximations to the target representation of the task. Therefore, we observe that TeLU reconstructions outperform ReLU reconstructions.
6.5.2 Penn TreeBank Dataset with RNN Architectures
Additionally, we show that TeLU performs competitively when used as the hidden nonlinearity in Recurrent Neural Networks (RNN) [99] architectures on the Penn TreeBank (PTB) [100] dataset. First, we define an Elman architecture [101], the simplest form of a recurrent neural network. In an Elman recurrent neural network, hidden neurons perform their forward pass by taking their previous activation as inputs, allowing for the introduction of neuron states. We set an input embedding size of 450, a hidden layer width of 650, and 2 hidden layers. We train over 40 epochs with a batch size of 20 , a learning rate of 0.2 , and a hidden neuron dropout rate of 0.2 . We truncate the backpropagation through time (BPTT) [99] steps to 25 and perform gradient clipping to limit the absolute value of each computer gradient during the backward pass to 0.1. We compare the Perplexity (PPL) [102] of TeLU, ReLU, and Tanh architectures in Table 26 over 10 seeded trials and observe that TeLU networks offer significant improvements over networks employing ReLU and Tanh nonlinearities. These 10 trials were seeded with the first 10 whole numbers, 0 through 10 . The hyperparameter values used for this Elman network experiment are detailed in Table 27,
We follow this success to demonstrate similar improvements in Long Short Term Memory (LSTM) RNNs. We utilize the same batch size, number of hidden layers, and number of epochs as in the Elman experiments. Since LSTM networks benefit from additional memory and gating over Elman networks, we increase the number of backpropagation steps to 35 . We also empirically determine that the embeddeding size of 550 and the hidden layer size of 750 are preferred by LSTMs. To balance this modification, we increase our hidden neuron dropout rate to 0.65 and reduce our gradient clipping to 0.25 to reduce overfitting of the model. We maintain the gating activation function as logistic sigmoids and hyperbolic tangents like in standard LSTMs so that they may continue functioning as gates. Since these sigmoid gates suffer from vanishing gradients, we increase our learning rate to 20. Lastly, we modify only the LSTM cells’ output activation function to employ either TeLU, ReLU, or Tanh. We observe significant improvements of TeLU over ReLU, and slight improvements over Tanh in Table 26. For reproducibility, we detail this seeded LSTM experiment in Table 28.
6.5.3 FashionMNIST Dataset with MLP Robustness Experiment
To observe the robustness benefits of a smooth universal approximator, we designed an experiment with an MLP on the Fashion-MNIST dataset [46]. We performed ten trials, seeded from integers 0 to 9 , where an architecture employing the TeLU, ReLU, and ELU nonlinearities as its hidden activation function trains over 100 epochs. We partition the FashionMNIST according to 50,000 training samples; 10,000 validation samples; and 10,000 testing samples. In order to reach robust solutions, we use an SGD optimizer [88] with a large learning rate of 0.05 and a large weight decay coefficient of 0.001 . We set our momentum parameter update acceleration coefficient to 0 , upon noticing that momentum tends to produce less robust models in this simple experiment. We define a batch size of 256 with the intention to perform more regularized weight parameter update steps, resulting in minibatch gradient descent [79]. After each epoch of training, we evaluate our model on the validation partition and save a checkpoint if validation accuracy has improved. After 100 epochs, we evaluate the best validation accuracy checkpoint on 11 instances of our test dataset, each with an increasing degree of Gaussian noise standard deviation summed to the inputs. We detail our experimental setup in Table 29.
We record our testing accuracies over the 11 noise standard deviations in Table 30. We note that while tuning the hyperparameters of the experiment to optimize robustness without reducing accuracy, ELU experienced numerical instability at favorable configurations. To focus on the robustness of the activation functions, rather than the stability, we separate ELU’s testing accuracies into two columns. The first column depicts the average ELU test accuracy for each noise level, while the second “ELU Stable” column calculates the average accuracies for stable ELU trials. We also visualize the stable behavior of all three architectures in Figure 26 across the gradually increasing noise distributions. We did not include the overall testing average of ELU architectures in this Figure, as this would significantly reduce the resolution needed to valuably compare the robustness exhbited each activation function. We note that TeLU, an analytic universal approximator, significantly outperforms the piecewise ReLU and ELU activation functions when employed as the hidden layer activation of an MLP.
6.6 Stability
In Subsection 5.6, we explored how an activation function’s properties; such as simple formulation, mitigation of vanishing and exploding gradients, smoothness, and reduced output bias; contribute to the overall stability of a neural network. With these foundational concepts established, we now shift from theory to practice by validating these benefits through experimental analysis. In this subsection, we will demonstrate how these theoretical advantages translate into real-world improvements in training stability, convergence speed, and model performance. By conducting empirical tests across various architectures and datasets, we aim to showcase the tangible impact of these activation functions on deep learning tasks, reinforcing the connection between theory and practical application.
6.6.1 FashionMNIST Dataset with MLP with Varied Network Depths
First we demonstrate that TeLU architectures offer stable learning across a wide range of model depths. We define MLP architectures that employ hidden activations of TeLU, ReLU, and Mish with a starting number of 16 hidden layers of width 128. We then train models with a momentum-accelerated Mini-batch Gradient Descent [79] [90] on the MNIST dataset [46] with a weight decay coefficient of 0.0005 ; setting aside 10,000 validation images and 10,000 testing images. We repeat this process for incrementally increasing hidden layers until reaching a hidden depth of 44 . We detail our precise experimental configuration in Table 31 and show the average testing accuracies of each architecture in Figure 27. We observe that ReLU, being a piecewise linear function with greater output bias than TeLU and Mish, succumbs to instability as depth increases. TeLU and Mish, both being smooth non-monotonic functions with less biased outputs, maintain their stability and perform predictably across all depths. Our detailed configuration is given by Table 31.
We then repeat the experiment for hidden layer counts between 34 and 44 with an increased weight decay coefficient of 0.001. We observe in Figure 28 how ReLU, being less self-regularized, stabilizes with greater L2 regularization. Mish, however, exhibits drastic instability with this minor increase in weight decay. We understand that Mish is a self-regularized non-monotonic function [43] with a more complex formulation than TeLU and ReLU. In this context, Mish (x) =x • \tanh \left(\ln \left(1+e^{x}\right)\right) may be interpreted as a version of TeLU( \left.x\right)=x \cdot \tanh \left(e^{x}\right) with the addition a nested natural logarithm nonlinearity. This extra intermediate computation exposes each Mish activation to numerical underflow. Once underflow occurs, it may propagate along the remaining layers until dictating a particular choice of inference. At the worst case of a depth of 44, this underflow is more likely to occur somewhere in the network and the average accuracy of the model be comparable to that of a random guess.
6.6.2 FashionMNIST Dataset with MLP with Varied Initialization Methods
We perform a similar experiment with the MLP architecture on the FashionMNIST dataset [46]. This time we treat our weight initialization method as an independent variable instead of our model’s depth. We define TeLU, ReLU, ELU, SiLU, GELU, Mish, Logish, and Smish MLP architectures by changing only the activation function of hidden layers. We initialize each architecture according to Xavier Uniform, Xavier Normal, He Uniform, and He Normal for different experiments. To be able to observe the impact of the different initializations, we limit the number of epochs to prevent eventual convergance on similar accuracies across initializers. We configure our experiment hyperparameters according to Table 32. We train each combination of initializer and architecture over 10 trials. We view the averaged testing accuracies for each architecture on each initializer in Figure 29. We observe how TeLU maintains superior performance over other architectures across all initializers. We also note that the ReLU architectures also exhibit considerable accuracy, but all other activation architectures seem to vary in success across initializers.
6.6.3 CIFAR-100 Dataset with SqueezeNext with Varied Optimizers
We go on to showcase the comparative stability and performance of the TeLU, ReLU, and GELU activation functions when utilized as the hidden activation function in a SqueezeNext CNN [93]. We train the SqueezeNext architecture on the CIFAR-100 dataset [47] for 200 epochs with the Mini-batch SGD [79] [88], momentumaccelerated SGD [90], AdamW [91], and RMSprop [92]. We utilize a learning rate scheduler that updates the base learning rate every 60 epochs. We detail the training hyperparameter values that stay constant across optimizer settings in Table 33, as well as the hyperparameters that change in Table 34. After each training epoch, we check to see if our validation has improved. If it has, we overwrite our checkpoint with the current weight and bias values of our model. After 5 trials of each optimizer and architecture combination’s training is complete, we evaluate the best validation accuracy checkpoint on the testing partition of 10,000 CIFAR-100 images. We summarize the testing accuracies and standard deviations in Table 35. Additionally, we compare the progression of the validation accuracies of each architecture for the Minibatch SGD and momentum-accelerated SGD optimizers in Figures 30 and 31
We observe from Table 35 that TeLU exhibits greater accuracy than both ReLU and GELU accuracies in all cases. Additionally, we notice that TeLU exhibits greater learning stability in most cases, as determined by its minimal standard deviation of test accuracy across the Mini-batch SGD, momentum-accelerated SGD, and AdamW optimizers. This means that the network is less sensitive to fluctuations during training, leading to more predictable and stable outcomes, which are crucial in real-world applications where reliability is paramount.
6.6.4 Tiny ImageNet Dataset with ResNet34 with Varied Optimizers
We extend our experimentation to include a ResNet34 architecture on TinyImageNet to ensure that this trend persists for larger architectures and datasets. We utilize the configuration detailed in Table 36 on both Minibatch SGD and momentum-accelerated SGD optimizers. Across optimizers, we vary only the base learning rate, weight decay coefficient, and learning rate scheduler decay hyperparameters as detailed in Table 37. We perform each trial by training the respective ResNet34 architecture on the TinyImageNet training partition, consisting of 100,000 images, over the course of 200 epochs. After each epoch, we see if the validation accuracy that the updated model exhibits is an improvement over the current validation accuracy. If the previous epoch of training has improved validation accuracy, we save a new checkpoint of our model. After training concludes, we evaluate our best checkpoint on the unseen testing partition to determine our top- 1 and top- 5 testing accuracies.
We record our test results across 5 trials for the Mini-batch SGD optimizer in Table 38 and observe that TeLU offers greater average test accuracy and less variance in top-1 test scores. For another perspective on these accuracy improvements, we plot the progression of validation accuracy across the 200 epochs of training in Figure 32. Here, we observe that the TeLU architecture exhibits faster convergence speeds after the learning rate updates at epochs 60 and 100. After this, both the ReLU and TeLU architectures experience a saturation in validation accuracies, with TeLU architectures ending up at a greater validation accuracy.
We also summarize the test results across 10 trials for the momentum-accelerated Mini-batch SGD optimizer in Table 39. It is evident that the ReLU-based ResNet34 architecture experiences significant instability in both top- 1 and top- 5 accuracy metrics due to numerical issues. Notably, 3 out of the 10 trials resulted in ‘not a number’ ( NaN ) training accuracies across all 200 epochs, indicating that numerical instability arises as early as the first epoch. When training accuracy is compromised by such instability, the neurons transmitting it dictate the output, leading to random validation accuracies, where the network effectively guesses output values. This behavior is further illustrated in Figure 33, where we observe that the ReLU ResNet34 architecture underperforms by approximately 20 % in average validation accuracy compared to the TeLU model. These findings confirm that TeLU-based architectures mantain stability across different optimizers.
6.6.5 Summary of Stability Experimental Validation
Throughout previous subsections of this section, we have observed similar cases of instability with other activation functions. Figure 22- shows that SiLU, GELU, Mish, Logish, and Smish experience significantly more instability than other nonlinearities when employed within MLP architectures when the weight decay coefficient is increased beyond 0.001 . Figure 23 adds further evidence to this observation, with SiLU, GELU, Mish, Logish, and Smish architectures resulting in inferior testing accuracies after training corcludes. Furthermore, we observe in Figure 24 that SiLU, Logish, and Smish SqueezeNext architectures result in reduced test accuracies when trained on the CIFAR-10 dataset with a Mini-batch SGD optimizer. This drop in accuracy is amplified for Logish and Smish when momentum is introduced in 25. Across all experiments, TeLU experiences the most stable and consistent results.
The reasons for TeLU’s unique degree of learning stability lie in its ability to mitigate both vanishing and exploding gradients while maintaining smoothness and minimal output bias. This combination ensures that gradients propagate effectively through deep networks, allowing for steady and controlled learning, even in complex architectures. TeLU’s ability to deliver high accuracy while ensuring stable learning makes it particularly well-suited for tasks requiring long-term reliability, such as in autonomous systems, financial forecasting, or critical healthcare diagnostics, where instability in learning could result in severe consequences. By delivering consistently better accuracy and reduced variability, TeLU provides a foundation for more resilient, generalizable models across a wide range of applications.
7 Discussion
We have proposed the Hyperbolic Tangent Exponential Unit (TeLU) an activation function for deep neural networks defined as \operatorname{TeLU}(x)=x \cdot \tanh \left(e^{x}\right) . TeLU is an analytic, non-monotonic non-linearity that belongs to the class of linearly-scaled cumulative distribution functions, which consists of linear units such as \operatorname{ReLU}(x)=\max (0, x) and \operatorname{GELU}(x)=\frac{x}{2} \cdot\left(1+\operatorname{er} f\left(\frac{x}{\sqrt{2}}\right)\right) . Like other linear units, TeLU may be interpreted as the identity function x with an implicit activation dropout as input becomes negative. This results in TeLU comprising a dense deactivation region followed by an active region exhibiting asymptotic identity growth. Throughout this paper, we have demonstrate TeLU’s persistent gradients of its deactivation region, near-linearity of its active region, computational efficiency, compatibility as a substitute for ReLU, analytic universal approximation, and inherent stability properties. Together, these properties allow TeLU to uniquely overcome vanishing gradient, convergence delay, computational delay, configuration tuning delay, community adoptability, and learning instability challenges.
7.1 Persistent Gradients of Deactivation Region of TeLU
In Subsection 5.1, we noticed that our lower-bounded activation functions had derivatives that saturate towards deactivation at different rates. To better describe these varying vanishing rates, we extended the asymptotic growth classes to encompass asymptotic decay with classes \Theta\left(\frac{x}{e^{x}}\right), \Theta\left(\frac{1}{e^{x}}\right), O\left(\frac{1}{x!}\right) , and \Omega\left(\frac{1}{\left(x^{2}\right)!}\right) . We discovered that TeLU’s gradient vanished according to \Theta\left(\frac{x}{e^{x}}\right) , leading to minimal vanishing gradient concerns among competing functions. We showcased the effect of these varying vanishing gradients numerically by expressing the domain that avoids numerical underflow in each activation function.
In Subsection 6.1, we showed the direct impact of these asymptotic decays, we designed an experiment where MLP architectures employing different hidden non-linearities have biases initialized to -10 and -20 . We trained each architecture on the FashionMNIST dataset and observed each model’s ability to recover from the vanishing gradient conditions found at their saturation regions. In each case, we observed that TeLU emerges first. Remaining architectures lag behind TeLU, in an order that is expected given their asymptotic decay class and in a manner that is reflective of their gradient strengths.
We proceed to provide further experimental validations to the impact of persistent gradients by comparing TeLU, ReLU, and GELU DenseNet architectures on CIFAR-10. DenseNets do not feature any residual bypass connections, so vanishing or dying gradients in GELU and ReLU architectures are especially effective in halting learning prematurely, resulting in models that exhibit poor accuracy. Meanwhile, the TeLU architecture trains and generalizes to improved accuracy in all experiments, showing that TeLU mitigates the vanishing gradient problem.
7.2 Near-Linearity of Active Region of TeLU
In Subsection 5.2, we focused on TeLU’s near-linear behavior of the active region, featuring strong gradients that are essential for meaningful learnable parameter update steps. This undiminished learning allows for rapid, stable convergence towards superior training accuracy and generalizes to superior test accuracies, as ensured by TeLU’s implicit dropout. In Subsection 6.2, we provide experimental validation for the rapid, stable convergence of TeLU. We started by comparing TeLU and ReLU ResNet34 architectures on the ImageNet dataset. While TeLU smoothly approaches linear growth rapidly, ReLU exhbits linear growth immediately once its input become positive. With both TeLU and ReLU activation functions exhibiting strong active gradients, we compare the speed and stability of both architectures.
First, we utilized Pytorch’s ResNet34 default training configuration, which encompassed 90 epochs of training. We observed that the TeLU activation allowed for faster and more consistent convergence than ReLU. To accentuate on the benefits of this improved convergence, we limited our number of epochs to 50 and 20 epochs in two separate follow-up experiments. In either case, we observe that TeLU architectures provide improved convergence and superior generalization over ReLU architectures on ImageNet.
We further demonstrate TeLU’s convergence advantages over ReLU and GELU within a dynamic-pooling transformer architecture trained on the Text8 dataset. Initially, we highlighted TeLU’s superior convergence speed and stability compared to GELU in terms of validation loss, with this improvement translating to lower test loss and reduced standard deviation. To examine the impact of TeLU’s faster convergence, we conducted a shorter experiment with a reduced learning rate scheduler period. As expected, TeLU outpaced both ReLU and GELU in terms of convergence speed and stability under these conditions.
7.3 Computational Efficiency of TeLU
In Subsection 5.3, we quantified the computational complexities of baseline definitions of activation functions. We counted the number of piecewise segmentations, nonlinear functions, arithmetic operations, and constant terms present within each activation function. We discovered that TeLU exhibited minimal computational complexity that was second only to ReLU. Table 5 summarizes these counts for our focused group of non-linearities.
In Subsection 6.3, we witness the direct computational benefits on TeLU over other smooth activation functions within custom Pytorch [104] benchmarks run on various systems. Each benchmark consisted of a number of forward and backward passes on neurons that activated according to the TeLU, ReLU, ELU, SiLU, GELU, Mish, Logish, and Smish nonlinearities. The first of these benchmarks was run on a Windows 11 operating system with an NVIDIA RTX 2070 GPU, and consisted of 10^{6} activations on an input size of 10^{6} , each followed by the corresponding gradient calculation. The total amount of time spent on forward and backward passes was calculated for each non-linearity.
TeLU and ReLU were found to offer the most computational efficiency, as expected from the computational complexity heuristics. The next benchmarks were run on a LINUX batch server and utilized A100 [105] and 1080Ti GPUs, respectively. Tables 17, 19, and 18, summarize the resulting runtime necessary across varying sizes of input and number of iterations. Across each operating system, device, and experimental configuration; we observe TeLU offering optimal run-times that are second only to ReLU.
7.4 Configuration Compatability of ReLU and TeLU
In Subsection 5.4, we describe how non-monotonic non-linearities are the most suitable for approximating the ReLU activation function. We begin by calculating the area between different activation functions and ReLU, showing an initial heuristic for the closeness of approximation each function provides. This is performed for both negative and positive inputs, to indicate the quality of approximation in both the inactive and active regions. From our candidate pool, we demonstrated that non-monotonic functions are uniquely capable of approximating ReLU as input x \rightarrow \pm \infty and at x=0 . Additionally, we found that TeLU serves as the most suitable substitute for ReLU in deep learning applications. TeLU’s active gradients closely approximate ReLU’s strong identity growth, resulting in a comparable convergence speed and low implicit regularization. In contrast, other activation functions with weaker active gradients experience implicit regularization in the form of gradient damping, a behavior that distinguishes them from ReLU and TeLU.
In Subsection 6.4, we experimentally validated the operational similarities between TeLU and ReLU by testing MLP architectures with different non-linearities as their hidden activation functions. The results showed that the same weight decay coefficient values led to optimal performance in both TeLU and ReLU architectures, whereas other architectures with different activation functions did not achieve similar results. Furthermore, we show that TeLU provides accuracy improvements over ReLU on the same configuration that optimizes ReLU convergence and generalization. We offer additional experimental evidence of this similarity by demonstrating that TeLU improves accuracy in training configurations optimized to minimize the loss in ReLU CNN architectures. For each optimizer, including Mini-batch SGD [79, 88], momentum accelerated SGD [90], AdamW [91], and RMSprop [92], we tune the training hyperparameters that influence the learning of a SqueezeNext architecture using the ReLU nonlinearity. The hyperparameters are optimized to achieve the highest possible validation accuracy for each optimizer. Again, we observe that TeLU architectures are optimized by configurations designed for ReLU. In addition, the results demonstrate that TeLU outperforms ReLU when using Mini-batch momentum-accelerated SGD, AdamW, and RMSprop optimizers.
7.5 Analytic Universal Approximation of TeLU
In Subsection 5.5, we establish that TeLU is an analytic universal approximator. Consequently, architectures utilizing TeLU as their activation function exhibit stable and efficient convergence when optimized with gradientbased methods. The smooth nature of analytic approximators, like TeLU, also enhances generalization to unseen data by minimizing overfitting to noise or irrelevant artifacts [50]. This improved robustness leads to more reliable model performance. Moreover, the smooth approximations provided by analytic functions are easier to interpret and analyze, a quality highly valued in mathematical research [22]. Additionally, as an analytic function, TeLU ensures compatibility with second-order optimization techniques that leverage the curvature of the loss landscape, resulting in more stable and efficient learning [27, 21, 64, 65].
In Subsection 6.5, we perform practical demonstrations of the benefits of utilizing analytic nonlinearities with VAE [95], RNN [101, 103], and MLP robustness [63] experiments. Results show that VAE architectures that utilize the TeLU activation function lead to MNIST sample reconstructions with minimal loss and improved consistency over that of ReLU VAEs. For our RNN experiments, we define Elman [101] and LSTM [103] architectures that employ either TeLU, ReLU, or Tanh activation functions. Again, we find that TeLU architectures lead to minimal perplexity [106] on the Penn TreeBank dataset [100]. Lastly, we demonstrate the improved robustness of analytic universal approximations by comparing the robustness of TeLU MLPs with that of ReLU and ELU models. As expected, we observe that TeLU consistently outperforms its piecewise competitors [16, 50]. Across all experiments, we notice that TeLU offers a strong advantage over non-analytic activation functions.
7.6 Learning and Numerical Stability of TeLU Networks
In Subsection 5.6, we examined the properties of activation functions that contribute to the learning stability of deep neural network models. Table 7 provides a summary of key stability factors: This comprises the depth of embedded nonlinear computations, smoothness, susceptibility to the vanishing gradient problem, and output bias for each activation function studied. We demonstrate that smooth linear units like TeLU effectively mitigate the exploding gradient issue through their sub-linear growth in the positive domain. Our findings show that TeLU achieves the best overall balance across these criteria, resulting in notable stability improvements compared to existing activation functions.
In 6.6, we provide experimental validation of our heuristic comparison by demonstrating that TeLU MLP architectures maintain stable performance as model depth increases across multiple configurations. Furthermore, we show that the learning stability of TeLU persists across various weight initialization methods. Specifically, using Xavier Uniform, Xavier Normal [5], Kaiming Uniform, and Kaiming Normal [107] initialization techniques, TeLU consistently outperforms similar MLPs employing other activation functions. This consistent performance extends to our CNN experiments as well. TeLU-based SqueezeNext architectures achieve superior test accuracies across all tested optimizers when trained on the CIFAR-100 dataset. Across all experiments, we find that TeLU architectures consistently exhibit faster convergence and greater learning stability, highlighting the unique advantages of TeLU in deep learning applications.
7.7 Independent Rediscovery
After submitting our work on the Hyperbolic Tangent Exponential Linear Unit (TeLU) activation function, we discovered that a similar formulation had been introduced in the literature under the name TanhExp [67]. Our independent discovery, stemming from a theoretical study of activation functions initiated in late 2022, underscores the natural emergence of TeLU as an innovative design. By systematically analyzing activation functions and building on insights from prior work, including Mish [43], we identified TeLU’s distinctive properties. This parallel rediscovery highlights the intuitive appeal of TeLU’s design and its potential to address key challenges in machine learning applications. While both studies demonstrate the efficacy of this activation function, they approach the problem from complementary perspectives. The TanhExp study emphasizes empirical results, particularly on small-scale vision benchmarks like CIFAR-10 and Fashion-MNIST, providing valuable insights into its practical utility. In contrast, our work integrates theoretical rigor with extensive empirical validation, offering a deeper understanding of TeLU’s behavior and properties.
Our study expands on the existing literature by deriving and validating theoretical bounds, which explain why TeLU works effectively and establish it as a reliable drop-in replacement for ReLU. Additionally, we significantly broaden the scope of experimentation, testing TeLU on large-scale benchmarks such as ImageNet and Text8 to evaluate its versatility across diverse tasks. This comprehensive evaluation demonstrates TeLU’s applicability beyond vision tasks and bridges the gap between theory and practice, showcasing its robustness in real-world scenarios. To ensure reliability and reproducibility, our methodology incorporates multiple trials, reports standard deviations, and employs separate testing and validation sets, adhering to best practices in machine learning experimentation. By providing detailed experimental settings and making our code publicly available, we aim to promote transparency and encourage further exploration of TeLU. These contributions complement prior work by addressing aspects such as statistical significance and scalability, reinforcing the utility of this activation function. Through systematic comparisons against a wide range of activation functions across diverse datasets and experimental setups, we offer a balanced and thorough analysis. This combination of theoretical insights and empirical breadth underscores TeLU’s potential as a lightweight, efficient, and effective alternative to ReLU, paving the way for broader adoption and future advancements in activation function design.
8 Conclusion
In conclusion, the Hyperbolic Tangent Exponential Linear Unit (TeLU) stands out as a highly effective activation function that addresses several critical challenges in neural network training. One of its primary advantages is the presence of persistent gradients in its inactive region, which mitigates the vanishing gradient problem that can slow down learning in deep networks. By maintaining non-zero gradients even for negative input values, TeLU ensures that all neurons continue to learn, enhancing overall network performance.
Moreover, TeLU closely approximates the identity function for positive input values. This characteristic strikes an ideal balance, preventing both vanishing and exploding gradient issues. The linear approximation allows for consistent and efficient gradient propagation, leading to faster and more stable convergence during training. By minimizing unintended damping of gradients, TeLU ensures that learnable parameter updates are not inadvertently downscaled. This allows optimizers and learning rate schedulers to effectively manage the magnitude of learning steps, maximizing convergence rates and enabling a more modular and flexible training configuration.
Unlike more complex activation functions, TeLU’s simple formulation reduces computational overhead, resulting in significant efficiency improvements. TeLU’s straightforward design also offers seamless compatibility as a drop-in substitute for the widely used ReLU activation function. This ease of integration encourages adoption within the deep learning community, allowing practitioners to leverage TeLU’s benefits without overhauling existing architectures. Unlike ReLU, TeLU is an analytic function, which enhances robustness and stability during training. Its analytic nature not only improves convergence but also makes it compatible with second-order optimization methods. By utilizing the curvature of the loss function, these methods can further enhance convergence efficiency and stability.
These combined properties enable TeLU to exhibit a unique level of learning stability across a wide range of experimental settings. By effectively addressing issues like vanishing gradients, computational inefficiency, and training instability, TeLU presents a compelling advancement in activation functions. Its ability to simplify training configurations while enhancing performance makes it a valuable tool for advancing deep learning models across various applications. The TeLU activation function, therefore, holds significant promise for facilitating more efficient, stable, and robust neural network training in the field of deep learning.
9 Future Work
For future work, we aim to delve into the unique properties of analytic universal approximators by thoroughly examining the convergence guarantees provided by the TeLU activation function. A deeper theoretical investigation could identify the precise conditions under which TeLU guarantees faster and more reliable convergence, broadening its utility across different neural network architectures and aiding researchers in optimizing model performance more efficiently.
TeLU’s role as an analytic universal approximator makes it a promising candidate for higher-order optimization methods. In future work, we will explore its performance with techniques like Newton’s Method [108, 109, 110], Hessian-Free Optimization [21], Natural Gradient Descent [13, 38], and Trust Region methods [111, 112, 113]. These methods offer improved stability and convergence efficiency, so we will investigate how TeLU complements their properties.
In addition, we will extend our research by integrating TeLU with bio-inspired learning algorithms, such as predictive coding networks [114], which leverage higher-order curvature information through quasi-Newton approximations [115]. By incorporating TeLU, we anticipate enhanced stability and faster convergence compared to traditional backpropagation, and our future studies will carefully analyze its performance within these innovative frameworks.
TeLU, as an analytic universal approximator, has a smooth curvature often associated with improved robustness in neural networks [50, 21, 65]. Initial testing has demonstrated promising robustness benefits. To further validate these results, we plan to conduct extensive experiments across challenging datasets and standard benchmarks [116]. This allows smooth activation functions like TeLU to benefit against adversarial attacks [16].
Beyond providing resistance to adversarial attacks, we hypothesize that smooth non-monotonic activation functions enhance security by making it more difficult to reconstruct training images, addressing significant privacy concerns [117]. Their unpredictability disrupts neural network processing patterns, hindering attackers from reverse-engineering or inferring sensitive data. Attack methods like side-channel analysis and bitstream reverse engineering, which rely on predictable behaviors and data correlations, become less effective. This unpredictability strengthens neural network security by reducing the success of traditional attack vectors. Integrating non-monotonic activation functions thus enhances privacy and bolsters resistance against specific attacks.
Inspired by the success of TeLU, we also intend to experiment with variations of TeLU of the form x \cdot \tanh \left(a^{x}\right) , where a \in[2,3] . If a \in \mathbb{Z}^{+} , computational efficiency could be enhanced. When a+\epsilon=e , for some small positive \epsilon , gradients would decay at an asymptotic rate of \Theta\left(\frac{x}{a^{x}}\right) , which may help further address the vanishing gradient problem and improve convergence. Conversely, when a=e+\epsilon , the function might approximate the identity more closely, potentially enhancing convergence.
To further investigate the computational efficiency of TeLU, we plan to conduct additional experiments using lower-level programming languages like C and \mathrm{C}+ . Implementing TeLU in these languages may allow for a more precide evaluation of its computational efficiency. We also intend to explore optimizations of TeLU itself, refining its algorithms to improve speed and resource utilization. Additionally, by involving distinct hardware accelerators such as TPUs [118] and FPGAs [119], we aim to assess TeLU’s performance across various architectures. These efforts are motivated by our desire to maximize TeLU’s efficiency and scalability, ensuring it can be effectively utilized in a wide range of real-world applications.
A Additional Results
This Appendix section provides supplementary tables that offer additional insights into our theoretical analysis and experimentation. These tables serve as an additional resource for readers seeking a deeper understanding of the theoretical analysis and methodologies discussed in the main body of this work.
A. 1 Computational Efficiency of Derivatives
Table 40 presents a breakdown of the various mathematical components involved in computing the first derivative of each activation function examined in this study. Specifically, it quantifies the number of piece-wise segments, nonlinear operations, arithmetic computations, and constant terms required. This analysis serves as a practical measure of computational complexity, providing insights into how each activation function’s derivative might impact processing time and resource utilization during model training. By comparing these counts, we can better understand the efficiency trade-offs between different activation functions, highlighting the advantages of the proposed function in terms of simplicity and speed.
A. 2 Computation Efficiency of Derivatives
Table 41 outlines the standzrd score normalization applied in experiments with the CIFAR-10, CIFAR-100, and TinyImageNet datasets. For each dataset, the mean and standard deviation are calculated and utilized to normalize each input sample, ensuring consistency and comparability across the experiments. This information is provided to assist anyone attempting to recreate the experiments. By including the computed means and standard deviations for each dataset, we ensure that the standard score normalization process can be accurately replicated, maintaining the integrity and consistency of the experimental conditions. This level of detail is crucial for achieving comparable results and validating the findings of this study.
B Additional Supporting Theoretical Results
This appendix section presents supplementary theorems that provide a more detailed and rigorous foundation for the theories proposed in this study. These additional theorems serve to strengthen the theoretical framework, offering deeper insights and comprehensive support for our arguments. By including these formal mathematical statements and proofs, we aim to enhance the clarity and robustness of our theoretical analysis, giving readers a more complete understanding of the underlying principles that drive our findings.
B. 1 Close Approximation to ReLU Non-Linearity with TeLU
Lemma B. 1 Let r(x)=\operatorname{ReLU}(x)=\max (0, x) , and define the active subdomain as \mathcal{A}=[0, \infty) and the inactive subdomain as \mathcal{I}=(-\infty, 0) . Consider the following activation functions:
- TeLU: t(x)=x \cdot \tanh \left(e^{x}\right) ,
- GELU: g(x)=x \cdot \Phi(x) , where \Phi(x) is the Gaussian CDF.
Define the gradient magnitudes over the active and inactive subdomains as:
G_{\mathcal{A}}(f)=\int_{0}{\infty}\left|f{\prime}(x)\right| d x, \quad G_{\mathcal{I}}(f)=\int_{-\infty}{0}\left|f{\prime}(x)\right| d x
If G_{\mathcal{A}}(t)>G_{\mathcal{I}}(g) , then TeLU has a stronger impact on training dynamics in the positive subdomain than GELU has in the negative subdomain. Consequently, neural networks utilizing TeLU more closely resemble ReLU and exhibit stronger positive-side sensitivity.
Proof: To quantify the gradient behavior of t(x) and g(x) in their respective subdomains, we first compute the derivatives:
- Derivative of TeLU in the Active Subdomain The TeLU function is defined as:
t(x)=x \cdot \tanh \left(e^{x}\right)
Taking the derivative:
t^{\prime}(x)=\tanh \left(e^{x}\right)+x \cdot e^{x} \cdot \operatorname{sech}{2}\left(e{x}\right) .
As x \rightarrow \infty, \tanh \left(e^{x}\right) \approx 1 and \operatorname{sech}{2}\left(e{x}\right) \approx 0 . Thus, for large x , we have:
t^{\prime}(x) \approx 1 .
This shows that TeLU’s derivative asymptotically approaches ReLU’s derivative for x>0 .
The gradient magnitude in the active subdomain is:
G_{\mathcal{A}}(t)=\int_{0}^{\infty}\left|\tanh \left(e^{x}\right)+x \cdot e^{x} \cdot \operatorname{sech}{2}\left(e{x}\right)\right| d x
Since \tanh \left(e^{x}\right) \approx 1 and the second term vanishes as x increases, the integral converges to a large value, indicating strong gradient influence in the active subdomain.
2. Derivative of GELU in the Inactive Subdomain The GELU function is defined as:
g(x)=x \cdot \Phi(x),
where \Phi(x)=\frac{1}{2}\left(1+\operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)\right) is the Gaussian CDF. Taking the derivative:
g^{\prime}(x)=\Phi(x)+x \cdot \phi(x)
where \phi(x)=\frac{1}{\sqrt{2 \pi}} e{-x{2} / 2} is the Gaussian PDF.
For x<0, \Phi(x) \approx 0 and \phi(x) rapidly decays to 0 as x \rightarrow-\infty . Thus, for large negative x , we have:
g^{\prime}(x) \approx 0
The gradient magnitude in the inactive subdomain is:
G_{\mathcal{I}}(g)=\int_{-\infty}^{0}|\Phi(x)+x \cdot \phi(x)| d x
Since \Phi(x) \approx 0 and \phi(x) \approx 0 for x<0 , this integral converges to a small value, indicating weak gradient influence in the inactive subdomain.
3. Gradient Magnitude Comparison Comparing G_{\mathcal{A}}(t) and G_{\mathcal{I}}(g) :
G_{\mathcal{A}}(t) \approx \int_{0}^{\infty} 1 d x=\infty, \quad G_{\mathcal{I}}(g) \approx \int_{-\infty}^{0} 0 d x=0
This shows that TeLU has a significantly larger gradient magnitude in the active subdomain compared to GELU’s gradient in the inactive subdomain.
Conclusion Since the gradient strength of TeLU in \mathcal{A} far exceeds that of GELU in \mathcal{I} , TeLU will have a stronger impact on the learning dynamics of a neural network. Therefore, neural networks using TeLU will be more sensitive and responsive in the positive region, closely mimicking ReLU’s behavior in practice.
B. 2 Zero-Centering of TeLU
Before proving that TeLU has better zero-centering capability than ReLU, we will provide the following lemma to support our construction.
Lemma B.2 If f(x) \geq g(x) for all x \in \mathbb{R} and f(x)>g(x) for some defined interval [a, b] , where a, b \in \mathbb{R} and a<b , then \int_{-\infty}^{\infty} f(x) d x>\int_{-\infty}^{\infty} g(x) d x .
Proof: We can expand \int_{-\infty}^{\infty} f(x) d x as follows:
\int_{-\infty}^{\infty} f(x) d x=\int_{-\infty}^{a} f(x) d x+\int_{a}^{b} f(x) d x+\int_{b}^{\infty} f(x) d x
Similarly, we can also expand \int_{-\infty}^{\infty} g(x) d x as follows:
\int_{-\infty}^{\infty} g(x) d x=\int_{-\infty}^{a} g(x) d x+\int_{a}^{b} g(x) d x+\int_{b}^{\infty} g(x) d x
Next, we define a condition necessary to prove as follows:
\begin{array}{c}
\int_{-\infty}^{\infty} f(x) d x>\int_{-\infty}^{\infty} g(x) d x \
\int_{-\infty}^{\infty} f(x) d x-\int_{-\infty}^{\infty} g(x) d x>0 \
\left(\int_{-\infty}^{a} f(x) d x-\int_{-\infty}^{a} g(x) d x\right)+\left(\int_{a}^{b} f(x) d x-\int_{a}^{b} g(x) d x\right)+\left(\int_{b}^{\infty} f(x) d x-\int_{b}^{\infty} g(x) d x\right)>0
\end{array}
Let’s analyze these three components :
For all x \in \mathbb{R} we get,
\begin{array}{c}
f(x) \geq g(x) \
\int_{-\infty}^{a} f(x) \geq \int_{-\infty}^{a} g(x) \
\int_{-\infty}^{a} f(x)-\int_{-\infty}^{a} g(x) \geq 0
\end{array}
Similarly, for all x \in \mathbb{R} ,
\begin{array}{c}
f(x) \geq g(x) \
\int_{b}^{\infty} f(x) \geq \int_{b}^{\infty} g(x) \
\int_{b}^{\infty} f(x)-\int_{b}^{\infty} g(x) \geq 0
\end{array}
Finally, for all x \in \mathbb{R}^{+} we get,
\begin{array}{c}
f(x)>g(x) \
\int_{b}^{\infty} f(x)>\int_{b}^{\infty} g(x) \
\int_{b}^{\infty} f(x)-\int_{b}^{\infty} g(x)>0 \
k>0
\end{array}
where k \in \mathbb{R}^{+} .
These 3 components can be re-written as follows:
\begin{array}{c}
\left(\int_{-\infty}^{a} f(x) d x-\int_{-\infty}^{a} g(x) d x\right)+\left(\int_{a}^{b} f(x) d x-\int_{a}^{b} g(x) d x\right)+\left(\int_{b}^{\infty} f(x) d x-\int_{b}^{\infty} g(x) d x\right) \
\geq 0+\left(\int_{a}^{b} f(x) d x-\int_{a}^{b} g(x) d x\right)+0=\left(\int_{a}^{b} f(x) d x-\int_{a}^{b} g(x) d x\right) \
=k>0
\end{array}
because k \in \mathbb{R}^{+}
Therefore, we have shown that
\int_{-\infty}^{\infty} f(x) d x>\int_{-\infty}^{\infty} g(x) d x
for any f(x) and g(x) such that f(x) \geq g(x) for any x \in \mathbb{R} and f(x)>g(x) for any x \in[a, b] , which concludes our proof.
Next, we show that TeLU exhibits greater zero-centering of activation than ReLU.
Theorem B. 1 If x is a random variable following a Gaussian probability distribution about zero with standard deviation \sigma \in \mathbb{R}^{+} expressed as PDF p(x)=\frac{1}{\sigma \sqrt{2 \pi}} \cdot \exp \left(\frac{-x^{2}}{2 \sigma^{2}}\right), \operatorname{TeLU}(x)=x \cdot \tanh \left(e^{x}\right), \operatorname{ReLU}(x)=\max (0, x) , E[\operatorname{TeLU}(x)]=\int_{-\infty}^{\infty} p(x) \cdot \operatorname{TeLU}(x) d x , and E[\operatorname{ReLU}(x)]=\int_{-\infty}^{\infty} p(x) \cdot \operatorname{ReLU}(x) d x ; then |E[\operatorname{TeLU}(x)]| <|E[\operatorname{ReLU}(x)]| .
Proof: We now analyze \operatorname{PDF} p(x)=\frac{1}{\sigma \sqrt{2 \pi}} \cdot \exp \left(\frac{-x^{2}}{2 \sigma^{2}}\right) to show that it is positive for all \sigma \in \mathbb{R}^{+}, x \in \mathbb{R} :
- \frac{1}{\sigma \sqrt{2 \pi}} is positive for all \sigma \in \mathbb{R}^{+}
- \exp \left(\frac{-x^{2}}{2 \sigma^{2}}\right) is positive for all \sigma \in \mathbb{R}^{+}, x \in \mathbb{R} ; because the range of \exp (x) is (0, \infty)
\therefore p(x)=\frac{1}{\sigma \sqrt{2 \pi}} \cdot \exp \left(\frac{-x^{2}}{2 \sigma^{2}}\right) , as it is the product of two positive terms
We now show that E[t(x)]<E[r(x)] for all \sigma \in \mathbb{R}^{+}, x \in \mathbb{R} :
We begin by showing following condition that p(x) \cdot \operatorname{TeLU}(x)<p(x) \cdot \operatorname{ReLU}(x) for all \sigma \in \mathbb{R}^{+}, x \in \mathbb{R}^{+} :
\begin{array}{r}
p(x) \cdot \operatorname{TeLU}(x)<p(x) \cdot \operatorname{ReLU}(x) \
\operatorname{TeLU}(x)<\operatorname{ReLU}(x) \
x \cdot \tanh \left(e^{x}\right)<x \cdot\left{\begin{array}{ll}
0 & x<0 \
1 & x \geq 0
\end{array}\right. \
\tanh \left(e^{x}\right)<\left{\begin{array}{ll}
0 & x<0 \
1 & x \geq 0
\end{array}\right. \
\tanh \left(e^{x}\right)<1
\end{array}
Next we show that p(x) \cdot \operatorname{TeLU}(x)=p(x) \cdot \operatorname{ReLU}(x) when x=0 for all \sigma \in \mathbb{R}^{+}
\begin{array}{r}
p(x) \cdot \operatorname{TeLU}(x)=p(x) \cdot \operatorname{ReLU}(x) \
\operatorname{TeLU}(x)=\operatorname{ReLU}(x) \
\operatorname{TeLU}(0)=\operatorname{ReLU}(0) \
0 \cdot \tanh \left(e^{0}\right)=0 \cdot\left{\begin{array}{ll}
0 & x<0 \
1 & x \geq 0 \
0 & =0
\end{array}\right.
\end{array}
and next condition we show is that p(x) \cdot \operatorname{TeLU}(x)<p(x) \cdot \operatorname{ReLU}(x) for all \sigma \in \mathbb{R}^{+}, x \in \mathbb{R}^{+} :
\begin{array}{r}
p(x) \cdot \operatorname{TeLU}(x)<p(x) \cdot \operatorname{ReLU}(x) \
\operatorname{TeLU}(x)<\operatorname{ReLU}(x) \
x \cdot \tanh \left(e^{x}\right)<x \cdot\left{\begin{array}{ll}
0 & x<0 \
1 & x \geq 0
\end{array}\right. \
\tanh \left(e^{x}\right)>\left{\begin{array}{ll}
0 & x<0 \
1 & x \geq 0
\end{array}\right. \
\tanh \left(e^{x}\right)>0
\end{array}
Since we have shown that p(x) \cdot \operatorname{Te} L U(x) \leq p(x) \cdot \operatorname{Re} L U(x) for all x \in \mathbb{R} and p(x) \cdot \operatorname{Te} L U(x) \leq p(x) . \operatorname{ReLU}(x) for all x \in \mathbb{R}^{+} , support theorem tells us that \int_{-\infty}^{\infty} p(x) \cdot \operatorname{TeLU}(x) d x<\int_{-\infty}^{\infty} p(x) \cdot \operatorname{ReLU}(x) d x . \therefore E[\operatorname{TeLU}(x)]<E[\operatorname{ReLU}(x)] .
Now, we show that E[\operatorname{Te} L U(x)] is positive for all \sigma \in \mathbb{R}^{+} :
We analyze TeLU(x):
- e^{x} is positive for all x \in \mathbb{R}^{+}
- since \tanh (w) is positive for all w \in \mathbb{R}, \tanh \left(e^{x}\right) is positive for all x \in \mathbb{R}
- x multiplies with positive \tanh \left(e^{x}\right) , so \operatorname{sign}(x)=\operatorname{sign}\left(x \cdot \tanh \left(e^{x}\right)\right. )
- p(x) \cdot \operatorname{Te} L U(X) is positive for all x \in \mathbb{R}^{+} , since \mathrm{p}(\mathrm{x}) is always positive
- p(x) \cdot \operatorname{TeLU}(X) is 0 when x=0 , since multiplying by \mathrm{x}=0 leads to 0
- p(x) \cdot \operatorname{TeLU}(X) is negative for all x \in \mathbb{R}^{-} , since \mathrm{p}(\mathrm{x}) is always positive
For E[t(x)]=\int_{-\infty}^{\infty} p(x) \cdot \operatorname{Te} L U(x) d x=\int_{-\infty}^{\infty} \frac{1}{\sigma \sqrt{2 \pi}} \cdot \exp \left(\frac{-x^{2}}{2 \sigma^{2}}\right) \cdot \operatorname{Te} L U(x) d x to be positive for all \sigma \in \mathbb{R}^{+} , we must show that |p(-\epsilon) \operatorname{TeLU}(-\epsilon)|<p(\epsilon) \operatorname{TeLU}(\epsilon) for all \epsilon \in \mathbb{R}^{+} . In other words, we must show that the positive component of p(x) \cdot \operatorname{TeLU}(x) is always a strict upper bound to the absolute value of its negative component, evaluated symmetrically about x=0 . \epsilon=0 can be disregarded, as p(\epsilon) \cdot \operatorname{Te} L U(\epsilon)=p(0) \cdot \operatorname{Te} L U(0)=0 .
Hence, we show that |p(-\epsilon) \cdot \operatorname{TeLU}(-\epsilon)|<p(\epsilon) \cdot \operatorname{TeLU}(\epsilon) for all \epsilon \in \mathbb{R}^{+} :
\begin{array}{r}
|p(-\epsilon) \cdot \operatorname{TeLU}(-\epsilon)|<p(\epsilon) \cdot \operatorname{TeLU}(\epsilon) \
-p(-\epsilon) \cdot \operatorname{TeLU}(-\epsilon)<p(\epsilon) \cdot \operatorname{TeLU}(\epsilon) \
-\operatorname{TeLU}(-\epsilon)<\operatorname{TeLU}(\epsilon) \
-\left(-\epsilon \cdot \tanh \left(e^{-\epsilon}\right)\right)<\epsilon \cdot \tanh \left(e^{\epsilon}\right) \
\epsilon \cdot \tanh \left(e^{-\epsilon}\right)<\epsilon \cdot \tanh \left(e^{\epsilon}\right) \
\tanh \left(e^{-\epsilon}\right)<\tanh \left(e^{\epsilon}\right) \
e{-\epsilon}<e{\epsilon} \
-\epsilon<\epsilon
\end{array}
Which is true for all \epsilon \in \mathbb{R}^{+} .
In summary, we have shown that 0<E[\operatorname{TeLU}(x)]<E[\operatorname{ReLU}(x)] for all x \in \mathbb{R} . This implies that |E[\operatorname{TeLU}(x)]|<|E[\operatorname{ReLU}(x)]| for all x \in \mathbb{R} . showing that \operatorname{TeLU}(x) exhibits better zero-centering of activation than \operatorname{ReLU}(x) for any standard deviation \sigma \in \mathbb{R}^{+} given that input x follows a Gaussian distribution of mean \mu=0 .
B. 3 Isolated Zero of TeLU
Let \sigma be an activation function given as y=\sigma(x) , where x is the input and y is the output. Let \mathcal{F}(\Theta) be the set of parameters using the \sigma non-linearities. Let the function { be optimized by the objective function \mathcal{L}(\Theta) using standard backpropagation of error, then we show \sigma applied on any function f avoids vanishing gradients issues in the neural network.
Theorem B. 2 Let f: \mathbb{R} \rightarrow \mathbb{R} be a function defined by f(x)=x \cdot \tanh \left(e^{x}\right) . The derivative of f, f^{\prime}(x) , is given by
f^{\prime}(x)=x \cdot\left(1-\tanh {2}\left(e{x}\right)\right) \cdot e^{x}+\tanh \left(e^{x}\right)
Then, the set \left{x \in \mathbb{R} \mid f^{\prime}(x) \neq 0\right} is dense in \mathbb{R} . Moreover, there exists a countable set \left{x_{i}\right}{i \in \mathbb{N}} \subset \mathbb{R} where f^{\prime}\left(x{i}\right)=0 for each i . Each point x_{i} is isolated, in the sense that for each x_{i} , there exists an \epsilon_{i}>0 such that if x \in\left(x_{i}-\epsilon_{i}, x_{i}+\epsilon_{i}\right) and x \neq x_{i} , then f^{\prime}(x) \neq 0 .
Proof: The derivative of f(x) with respect to x is given by:
f^{\prime}(x)=\frac{d}{d x}\left(x \cdot \tanh \left(e^{x}\right)\right)
Applying the product rule and the chain rule, we find:
f^{\prime}(x)=\tanh \left(e^{x}\right)+x \cdot\left(1-\tanh {2}\left(e{x}\right)\right) \cdot e^{x}
We analyze this derivative of above function in two parts:
- \tanh \left(e^{x}\right) is always positive, as e^{x} is always positive for and \tanh (z) is bounded between 0 and 1 for all positive z .
- 1-\tanh {2}\left(e{x}\right) is always positive since |\tanh (z)|<1 for all z , and e^{x} is always positive for all real x
Thus, the second term x \cdot\left(1-\tanh {2}\left(e{x}\right)\right) \cdot e^{x} is always non-zero unless x=0 . However, even at x=0 , the first term \tanh \left(e^{x}\right) remains non-zero. Therefore, the entire expression for f^{\prime}(x) is non-zero for all x \neq x_{i} .
Isolated Zeros: For f^{\prime}(x)=0 , we must have x \cdot \exp (x) \cdot \operatorname{sech}^{2}(\exp (x))=-\tanh (\exp (x)) . Given the properties of \tanh (x) and \exp (x) , the solutions to this equation are isolated because both sides of the equation represent continuous, differentiable functions with fundamentally different growth rates, ensuring any intersubsections are isolated points. The above construction is based on functional analysis. However it is important to note that the function f^{\prime}(x) has no analytical solution, thus the bound of x_{i} will change based on system precision. For instance, the numerical solution is needed to find the value of x where f^{\prime}(x) \approx 0 . We observe that using the numerical solution (newton-ramphson method) when x \approx-1.07886 the function f^{\prime}(x)=4.6 \times 10^{-48} . However majority of systems cannot handle such high precision and will equate this to be equal to zero. In other words, based on precision, f^{\prime}(x) will reach zero, thus the bound or range will change based on precision of the system.
Formal Statement
\forall x \in \mathbb{R}, \exists \epsilon>0, \text { such that if }\left|f^{\prime}(x)\right|<\epsilon, \text { then } \epsilon \approx 0, \text { but } \epsilon \neq 0
This means for all real numbers x , there exists an \epsilon greater than zero (indicating an extremely small magnitude) such that if the absolute value of f^{\prime}(x) is less than \epsilon , then \epsilon is approximately zero. This indicates that while f^{\prime}(x) may approach very close to zero for some values of x , it does not strictly equal zero except possibly under conditions that are negligible for practical purposes.
Density of Non-Zero Derivative: The non-zero values of f^{\prime}(x) constitute a dense subset of \mathbb{R} since the conditions for f^{\prime}(x)=0 require a specific balance that is only met at isolated points, as shown above. Between these points, f^{\prime}(x) maintains non-zero values, ensuring the gradient does not vanish across these intervals.
Hence, we conclude :
f^{\prime}(x) \neq 0 \text { for all } x \neq x_{i} \in \mathbb{R}
B. 4 Bounded Saturation and Growth of TeLU
Next, we prove the network’s saturating decay and bounded growth, contributing towards stability during training due to the avoidance of the exploding gradient issue.
Theorem B. 3 The function f(x)=x \cdot \tanh \left(e^{x}\right) exhibits stable behavior for any neural network.
Proof: Bounded Output: The hyperbolic tangent function \tanh (z) has outputs bounded between -1 and 1 . Therefore, for any real number x , the product x \cdot \tanh \left(e^{x}\right) will not grow unbounded, contributing to stability. Mathematically, this can be expressed as:
-|x| \leq f(x) \leq|x|
- Non-zero Gradient: The derivative of f(x) , given by
f^{\prime}(x)=\tanh \left(e^{x}\right)+x \cdot\left(1-\tanh {2}\left(e{x}\right)\right) \cdot e^{x}
is always non-zero for all real x besides our isolated zero when x \approx-1.07886 . This ensures that the gradients do not vanish during backpropagation, which is crucial for stable learning in deep networks.
- Controlled Growth for Positive x : As x \rightarrow \infty , the function grows linearly since \tanh \left(e^{x}\right) approaches 1 . After being scaled by the identity x , TeLU approaches linear growth as x \rightarrow \infty . This linear growth is more stable than exponential growth, which could lead to exploding gradients.
- Saturating Behavior for Negative x : As x \rightarrow-\infty, x becomes large in the negative direction with linear growth. However, e^{x} approaches 0 as x \rightarrow-\infty at an exponential rate. \tanh (z) , approaching the identity as inputs approach 0 , results in \tanh \left(e^{x}\right) maintaining exponential decay towards 0 . This negative linear growth and asymptotic decay as x \rightarrow-\infty is thus evaluated by \lim _{x \rightarrow-\infty} \frac{-x}{e^{x}}=0 . The resulting saturation towards 0 helps prevent the function from contributing to exploding gradients during training.
Therefore, due to its bounded output, non-zero gradient besides an isolated zero, controlled growth for positive values, and saturating behavior for negative values, the function f(x)=x \cdot \tanh \left(e^{x}\right) is shown to be stable in the context of neural network activations.
B. 5 Robustness of TeLU
Next, we show TeLU is more robust to small noise and perturbations compared to ReLU, which is an important property for designing adversarial-resistant neural networks.
Theorem B. 4 The function f(x)=x \cdot \tanh \left(e^{x}\right) is more robust compared to Relu (g(x)=\max (0, x)) and robust against small perturbations or noise in the input.
Proof: We analyze the derivative of f(x) to show robustness to small perturbations. The derivative gives the rate of change of the function with respect to changes in the input. A small derivative magnitude indicates robustness to small changes or noise in the input. The derivative of g(x)= Relu is represented as follows:
g^{\prime}(x)=\left{\begin{array}{ll}
0 & \text { if } x<0 \
1 & \text { if } x>0 \
\text { undefined } & \text { if } x=0
\end{array}\right.
This derivative shows that for x>0 , the function is sensitive to changes, as even small positive changes in x will result in a change in output. The function is insensitive to changes for x<0 , as the output remains zero. The derivative is undefined at x=0 , indicating a discontinuity, which can be problematic for stability.
The derivative of f(x)=T e L U is given by:
f^{\prime}(x)=\tanh \left(e^{x}\right)+x \cdot\left(1-\tanh {2}\left(e{x}\right)\right) \cdot e^{x}
Consider the behavior of f^{\prime}(x) for different ranges of x :
For large negative x : As x becomes very negative, e^{x} approaches 0 , making \tanh \left(e^{x}\right) and its derivative small. Thus, f^{\prime}(x) becomes small, indicating that f(x) is not highly sensitive to small changes in x .
For small x around 0 : Here, \tanh \left(e^{x}\right) is approximately equal to e^{x} , which is close to 1 for small x . The term x \cdot\left(1-\tanh {2}\left(e{x}\right)\right) \cdot e^{x} is also small. Hence, f^{\prime}(x) remains moderate, suggesting that f(x) does not change drastically for small perturbations around 0 .
For large positive x : Although e^{x} grows, the term \tanh \left(e^{x}\right) approaches 1 , limiting the growth of f(x) . The term x \cdot\left(1-\tanh {2}\left(e{x}\right)\right) \cdot e^{x} becomes small as x increases, due to the saturation of \tanh \left(e^{x}\right) . Thus, f^{\prime}(x) remains bounded.
Since f^{\prime}(x) does not exhibit large values across the range of x , it indicates that f(x) does not change disproportionately for small changes in x , thereby demonstrating robustness to small perturbations or noise.
B. 6 Lipschitz Continuity of TeLU
Next, we show a strong property which shows TeLU is Lipschitz continuous, which is important to uniform continuity of the function
Theorem B. 5 The function f: \mathbb{R} \rightarrow \mathbb{R} , defined by f(x)=x \cdot \tanh \left(e^{x}\right) , is Lipschitz continuous on the real line \mathbb{R}
Proof: To demonstrate that f is Lipschitz continuous, we seek a constant L such that for all x, y \in \mathbb{R} , the inequality
|f(x)-f(y)| \leq L|x-y|
is satisfied. A sufficient condition for this is that the derivative of f, f^{\prime}(x) , is bounded on \mathbb{R} .
The derivative of f is given by
f^{\prime}(x)=\tanh \left(e^{x}\right)+x \cdot \frac{e^{x}}{\cosh {2}\left(e{x}\right)}
We analyze the boundedness of f^{\prime}(x) in two parts:
- The function \tanh \left(e^{x}\right) is bounded on \mathbb{R} as tanh outputs values in (-1,1) .
- For the term x \cdot \frac{e^{x}}{\cosh {2}\left(e{x}\right)} , we consider its behavior as x approaches infinity and negative infinity:
\begin{array}{l}
\lim _{x \rightarrow \infty}\left|x \cdot \frac{e^{x}}{\cosh {2}\left(e{x}\right)}\right|=1 \
\lim _{x \rightarrow-\infty}\left|x \cdot \frac{e^{x}}{\cosh {2}\left(e{x}\right)}\right|=0
\end{array}
Since both limits are finite, the term x \cdot \frac{e^{x}}{\cosh {2}\left(e{x}\right)} is bounded on \mathbb{R} .
Combining these findings, we conclude that \left|f^{\prime}(x)\right| is bounded on \mathbb{R} . The maximum value of \left|f^{\prime}(x)\right| is 1 , therefore we can take L=1 as the Lipschitz constant.
Hence, f(x)=x \cdot \tanh \left(e^{x}\right) is Lipschitz continuous with a Lipschitz constant L=1 .
B. 7 Smooth Loss Landscape of TeLU
Next, we show that TeLU has a smoother loss landscape, which leads to faster convergence.
Theorem B. 6 Given a neural network \mathcal{N} with activation function f(x)=x \cdot \tanh \left(e^{x}\right) , parameters \theta , and a differentiable loss function \mathcal{L}(\theta) , the Fisher Information Matrix I(\theta) defined as
I(\theta)=\mathbb{E}{(x, y) \sim \mathcal{D}}\left[\nabla{\theta} \log p(y \mid x ; \theta) \nabla_{\theta} \log p(y \mid x ; \theta)^{\top}\right]
leads to a smoother optimization landscape during training of \mathcal{N} .
Proof: Continuity and Differentiability of f(x)
The activation function f(x)=x \cdot \tanh \left(e^{x}\right) and its derivative are analyzed:
\begin{aligned}
f(x) & =x \cdot \tanh \left(e^{x}\right) \
\text { where } \tanh (u) & =\frac{e^{2 u}-1}{e^{2 u}+1} \
\text { Thus, } f^{\prime}(x) & =\frac{d}{d x}\left(x \cdot \tanh \left(e^{x}\right)\right) \
& =\tanh \left(e^{x}\right)+x \cdot \frac{d}{d x} \tanh \left(e^{x}\right) \
& =\tanh \left(e^{x}\right)+x \cdot e^{x} \cdot\left(1-\tanh {2}\left(e{x}\right)\right)
\end{aligned}
Since \tanh (u) and e^{x} are continuously differentiable, f(x) and f^{\prime}(x) are also continuously differentiable.
Impact on Fisher Information Matrix
Applying the chain rule to compute the gradient of the log-likelihood:
\begin{aligned}
\nabla_{\theta} \log p(y \mid x ; \theta) & =\frac{\partial \log p(y \mid x ; \theta)}{\partial \mathcal{N}} \cdot \frac{\partial \mathcal{N}}{\partial \theta} \
& =\text { Gradient of the output w.r.t. the network’s parameters. }
\end{aligned}
The gradient involves terms from f^{\prime}(x) due to the activation function in each layer:
f^{\prime}(x)=\tanh \left(e^{x}\right)+x \cdot e^{x} \cdot\left(1-\tanh {2}\left(e{x}\right)\right)
Thus, I(\theta) becomes a matrix of expectations of outer products of these gradients:
I(\theta)=\mathbb{E}{(x, y) \sim \mathcal{D}}\left[\nabla{\theta} \log p(y \mid x ; \theta) \nabla_{\theta} \log p(y \mid x ; \theta)^{\top}\right]
The smoothness of f^{\prime}(x) translates to a smoother I(\theta) .
Smoother Optimization Landscape
In gradient descent, parameter updates are governed by:
\theta{(t+1)}=\theta{(t)}-\eta \cdot \nabla_{\theta} \mathcal{L}\left(\theta^{(t)}\right),
where \eta is the learning rate. The gradient of the loss function \nabla_{\theta} \mathcal{L}(\theta) is influenced by I(\theta) . A smoother I(\theta) results in more stable and consistent gradient updates, avoiding erratic steps often observed in rougher optimization landscapes. This leads to enhanced stability in finding the minima of \mathcal{L}(\theta) .
Hence, we can show, that the continuously differentiable nature of f(x)=x \cdot \tanh \left(e^{x}\right) and its derivative ensures that the Fisher Information Matrix I(\theta) in the neural network \mathcal{N} promotes a smoother optimization landscape, facilitating more effective training dynamics.
B. 8 Global Convergence of TeLU
Based on the properties of Telu, shown in Theorem B.6, we can prove the global convergence of the function under certain conditions.
Theorem B.7 Let \mathcal{N} be a neural network employing the activation function f(x)=x \cdot \tanh \left(e^{x}\right) in its architecture. Assume the network parameters are denoted by \theta and the network is trained using a differentiable loss function \mathcal{L}(\theta) . If \mathcal{L}(\theta) satisfies the Polyak-Lojasiewicz (PL) condition, then the gradient descent optimization on \mathcal{N} converges to a global minimum, significantly influenced by the properties of f(x) and it’s derivative f^{\prime}(x) .
Proof: Smoothness and Boundedness of f(x) and f^{\prime}(x) :
The function f(x)=x \cdot \tanh \left(e^{x}\right) is continuously differentiable. Its derivative, given by
f^{\prime}(x)=\tanh \left(e^{x}\right)+x \cdot e^{x} \cdot\left(1-\tanh {2}\left(e{x}\right)\right),
is also continuously differentiable and bounded due to the inherent properties of the tanh function and the exponential function. These properties ensure smooth and well-conditioned gradient computations throughout the optimization process.
Influence on Gradient Descent under PL Condition:
Given the PL condition, for a global minimum \theta^{*} , there exists \mu>0 such that
2 \mu\left(\mathcal{L}(\theta)-\mathcal{L}\left(\theta^{*}\right)\right) \leq\left|\nabla_{\theta} \mathcal{L}(\theta)\right|^{2} \text { for all } \theta .
The gradient descent update rule is
\theta{(t+1)}=\theta{(t)}-\eta \cdot \nabla_{\theta} \mathcal{L}\left(\theta^{(t)}\right),
where \eta is the learning rate.
Convergence Analysis:
Utilizing the smoothness and boundedness of f^{\prime}(x) , along with the PL condition, it can be shown that
\mathcal{L}\left(\theta^{(t+1)}\right) \leq \mathcal{L}\left(\theta^{(t)}\right)-\eta \cdot\left|\nabla_{\theta} \mathcal{L}\left(\theta{(t)}\right)\right|{2},
which implies
\mathcal{L}\left(\theta{(t)}\right)-\mathcal{L}\left(\theta{}\right) \leq(1-2 \mu \eta){t}\left(\mathcal{L}\left(\theta{(0)}\right)-\mathcal{L}\left(\theta^{}\right)\right) .
Therefore, \mathcal{L}\left(\theta^{(t)}\right) converges to \mathcal{L}\left(\theta^{*}\right) as t \rightarrow \infty .