Adaptivity and Non-stationarity:Problem-dependent Dynamic Regret for Online Convex Optimization Peng Zhao ZHAOP@LAMDA.NJU.EDU.CN Yu-Jie Zhang ZHANGYJOLAMDA.NJU.EDU.CN Lijun Zhang ZHANGLJ@LAMDA.NJU.EDU.CN Zhi-Hua Zhou ZHOUZH@LAMDA.NJU.EDU.CN National Key Laboratory for Novel Software Technology Nanjing University.Nanjing 210023.China Abstract We investigate online convex optimization in non-stationary environments and choose the me as th difference between cumu be the time horizon and th of environments,the state-of-the-art dyn amic regret is O(vT(1+Pr)).Although this bound is proved to b minimax optimal for convex functions,in this paper,we demon t it is p ce t eeasy pro lem rithms that can exploit smoothness and replace the dependence on T in dynamic regret with problem-dependent quantities:the variation in gradients of loss functions,the cu- mulat loss of the co sequen and the minimum of thes two term I hes are tighter than existing results for easy problems and meanwhile guarantee the same rate in the t vorst case. Notably,our proposed algorit hms can achieve favorable dynamic regr one ntper itera qu xity as tive online ensemble.The proposed framework emp loys a two-laver online ensem. ble to handle non-stationarity,and uses optimistic online learning and further introduces aboration within t thereby at con term blem 1.Introduction In many real-world applications,data are inherently accumulated over time,and thus it is of great importance to develop a learning system that updates in an online fashion.Online Convex Optimization (OCO)is a powerful paradigm for learning in such scenarios,which can be regarded as an iterative game between a player and an adversary.At iteration t,the player chooses a decision vector xt from a convex set R.Subsequently,the adversary discloses a convex function f:R,and the player incurs a loss denoted by fi(xt).The 2021 Peng Zhao,Yu-Jie Zhang.Lijun Zhang,and Zhi-Hua Zhou. License:CC-BY40,se https://creativec mons.org/licenses/by/4.0/
Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization Peng Zhao zhaop@lamda.nju.edu.cn Yu-Jie Zhang zhangyj@lamda.nju.edu.cn Lijun Zhang zhanglj@lamda.nju.edu.cn Zhi-Hua Zhou zhouzh@lamda.nju.edu.cn National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China Abstract We investigate online convex optimization in non-stationary environments and choose the dynamic regret as the performance measure, defined as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let T be the time horizon and PT be the path length that essentially reflects the non-stationarity of environments, the state-of-the-art dynamic regret is O( p T(1 + PT )). Although this bound is proved to be minimax optimal for convex functions, in this paper, we demonstrate that it is possible to further enhance the guarantee for some easy problem instances, particularly when online functions are smooth. Specifically, we introduce novel online algorithms that can exploit smoothness and replace the dependence on T in dynamic regret with problem-dependent quantities: the variation in gradients of loss functions, the cumulative loss of the comparator sequence, and the minimum of these two terms. These quantities are at most O(T) while could be much smaller in benign environments. Therefore, our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and meanwhile guarantee the same rate in the worst case. Notably, our proposed algorithms can achieve favorable dynamic regret with only one gradient per iteration, sharing the same gradient query complexity as the static regret minimization methods. To accomplish this, we introduce the framework of collaborative online ensemble. The proposed framework employs a two-layer online ensemble to handle non-stationarity, and uses optimistic online learning and further introduces crucial correction terms to enable effective collaboration within the meta-base two layers, thereby attaining adaptivity. We believe the framework can be useful for broader problems. Keywords: Online Learning, Online Convex Optimization, Dynamic Regret, Problemdependent Bounds, Gradient Variation, Optimistic Mirror Descent, Online Ensemble 1. Introduction In many real-world applications, data are inherently accumulated over time, and thus it is of great importance to develop a learning system that updates in an online fashion. Online Convex Optimization (OCO) is a powerful paradigm for learning in such scenarios, which can be regarded as an iterative game between a player and an adversary. At iteration t, the player chooses a decision vector xt from a convex set X ⊆ R d . Subsequently, the adversary discloses a convex function ft : X 7→ R, and the player incurs a loss denoted by ft(xt). The ©2021 Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/
ZHAO,ZHANG,ZHANG,ZHOU standard performance measure is the (static)regret(Zinkevich,2003), which is the difference between cumulative loss incurred by the online algorithm and that of the best decision in hindsight The ale behin metric is that the bos decision in hindsi all the iterations.H this might be not hold in c nd the ng over tim ess this imitation,dynamic s propose compete with changing comparators D-Regretr(u1,,ur)=∑f(x)-∑f(u) (2) which draws considerable attention recently(Zhang et al.,2018a;Zhao et al.,2020b;Cutkosky 2020:Zhao et al.,2021a;Baby and Wang,2021).The measure is also called the universal dynamic regret (or generl dynamic regret),in the sense that it gives a universal guaran- tee that holds against any comparator sequence.Note that the static regret (1)can be viewed as its special form by choosing comparators as the fixed best decision in hindsight. Moreover.a variant appeared frequently in the literature is called the worst-case dumaric regret (Besbes et al.,2015;Jadbabaie et al.,2015;Mokhtari et al.,2016;Yang et al.,2016; Wei et al.,2016:Zhang et al.,2017:Baby and Wang,2019;Yuan and Lamperski,2020: Zhao et al.,2020a;Zhang et al.,2020a,b;Zhao and Zhang,2021),defined as D-Regretr(xi.,x)=∑f(xt)-∑f(x), @ =1 t=1 which specializes the general form(2)with comparators u=xi E arg minxef(x).There fore,universal dynamic regret is very general and can include the static regret(1)and the worst-case dynamic regret (3)as special cases by different instantiations of comparators. We further remark that the worst-case dynamic regret is often too pessimistic,whereas the universal one is more adaptive to non-stationary environments.Actually,the changes of on- line functions usually come from two aspects:one is the sampling randomness and the other one is the environmental non-stationarity,and clearly the latter one is the main focus of non-stationary online learning.Optimizing the worst-case dynamic regret can be problem- atic in some cases.For example,considering the stochastic optimization task where fr's are independently randomly sampled from the same distribution,then minimizing the worst- case dynamic regret is evidently inappropriate and will eventually lead to overfitting(Zhang et al.,2018a)because the minimizer of online loss function could be dramatically different from the minimizer of the expected loss function due to the sampling randomness There are many studies on the worst-case dynamic regret (Besbes et al 2015 Jadbabai et al..2015:Mokhtari et al.2016:Yang et al.2016:Zhang et al.2017.2018b:Baby and Wang,2019;Zhang et al.,2020b:Zhao and Zhang,2021),but only few results are known for the universal dynamic regret.Zinkevich(2003)shows that online gradient descent (OGD) with a step sizen>0achieves an ((1+Pr)n)universal dynamic regret,where Pr= 2
Zhao, Zhang, Zhang, Zhou standard performance measure is the (static) regret (Zinkevich, 2003), S-RegretT = X T t=1 ft(xt) − min x∈X X T t=1 ft(x), (1) which is the difference between cumulative loss incurred by the online algorithm and that of the best decision in hindsight. The rationale behind such a metric is that the best fixed decision in hindsight is reasonably good over all the iterations. However, this might be too optimistic and may not hold in changing environments, where data are evolving and the optimal decision is drifting over time. To address this limitation, dynamic regret is proposed to compete with changing comparators u1, . . . , uT ∈ X , D-RegretT (u1, . . . , uT ) = X T t=1 ft(xt) − X T t=1 ft(ut), (2) which draws considerable attention recently (Zhang et al., 2018a; Zhao et al., 2020b; Cutkosky, 2020; Zhao et al., 2021a; Baby and Wang, 2021). The measure is also called the universal dynamic regret (or general dynamic regret), in the sense that it gives a universal guarantee that holds against any comparator sequence. Note that the static regret (1) can be viewed as its special form by choosing comparators as the fixed best decision in hindsight. Moreover, a variant appeared frequently in the literature is called the worst-case dynamic regret (Besbes et al., 2015; Jadbabaie et al., 2015; Mokhtari et al., 2016; Yang et al., 2016; Wei et al., 2016; Zhang et al., 2017; Baby and Wang, 2019; Yuan and Lamperski, 2020; Zhao et al., 2020a; Zhang et al., 2020a,b; Zhao and Zhang, 2021), defined as D-RegretT (x ∗ 1 , . . . , x ∗ T ) = X T t=1 ft(xt) − X T t=1 ft(x ∗ t ), (3) which specializes the general form (2) with comparators ut = x ∗ t ∈ arg minx∈X ft(x). Therefore, universal dynamic regret is very general and can include the static regret (1) and the worst-case dynamic regret (3) as special cases by different instantiations of comparators. We further remark that the worst-case dynamic regret is often too pessimistic, whereas the universal one is more adaptive to non-stationary environments. Actually, the changes of online functions usually come from two aspects: one is the sampling randomness and the other one is the environmental non-stationarity, and clearly the latter one is the main focus of non-stationary online learning. Optimizing the worst-case dynamic regret can be problematic in some cases. For example, considering the stochastic optimization task where ft ’s are independently randomly sampled from the same distribution, then minimizing the worstcase dynamic regret is evidently inappropriate and will eventually lead to overfitting (Zhang et al., 2018a) because the minimizer of online loss function could be dramatically different from the minimizer of the expected loss function due to the sampling randomness. There are many studies on the worst-case dynamic regret (Besbes et al., 2015; Jadbabaie et al., 2015; Mokhtari et al., 2016; Yang et al., 2016; Zhang et al., 2017, 2018b; Baby and Wang, 2019; Zhang et al., 2020b; Zhao and Zhang, 2021), but only few results are known for the universal dynamic regret. Zinkevich (2003) shows that online gradient descent (OGD) with a step size η > 0 achieves an O((1 + PT )/η+ηT) universal dynamic regret, where PT = 2
ADAPTIVITY AND NON-STATIONARITY 2llu-1 ors u1 ,ur and thus reflects the non ationarit D e environ V. nthe path engt K0 could choos opt regr weve this path length quantit is hard t w since the univers c reg o prov amst any le step sh sed in stati ead its a large g ap from the favorable bour h an orac step size tun (2018a)res ng.Zhang et a the proposing a nove online algorithm to search th e optima ng an O(T +Pr))univer dynamic regret,and they also establish ar lower bound to show the mi x optimality Although the rate is minimax optimal for conv ex functions,we would like to design algorithms with more adaptive bounds.Specifically,we aim to enhance the guarantee for some easy problem instances,particularly when the online functions are smooth,by replacing the dependence on T by certain problem-dependent quantities that are O(T)in the worst case while could be much smaller in benign environments.In the e study of stati regret,we can attain such results like small-loss bounds (Srebro et al.,2010)and gradient variation bounds(Chiang et al.,2012).Thus,a natural question arises whether it is possible to achieve similar problem-dependent quarantees for the universal dynamic regret? Our results.In this paper,extending our preliminary work(Zhao et al.,2020b),we pro- vide an affirmative answer by designing online algorithms with problem-dependent dynamic regret bounds.Specifically,we focus on the following two adaptive quantities:the gradient variation of online functions Vr,and the cumulative loss of the comparator sequence Fr, defined as 翠Ii-i6eead与=足f. = (4 = The ould b antities are both at most (T nder standard assumption e much er pro m in We propose two me a hms called Sword rd++( short nlin learning w n dynam regret)that are suita ble for diffe ent Ie dba m00 algorithms are online en methods(Zhou,2012;Zhao. 2021),which adr t a two-laye tructure with a meta-algorithm running over a group of bas We prove tha they enjoy an o(v(1+Pr minfVr,Fr))(1+Pr))dynamic regret variation and small-loss bounds simultaneously.Comparing to the (vT(1+Pr))minimax rate,our result replaces the dependence on T by the problem-dependent quantity Pr minfVr,Fr.Our bounds become much tighter when the problem is easy (for example when Pr and Vr/Fr are sublinear in T),and meanwhile safeguard the same guarantee in the worst case. Hence,our results are adaptive to the intrinsic difficulty of problem instances as well as the non-stationarity of environments Our first algorithm.Sword.achieves the favorable problem-dependent guarantees under the multi-gradient feedback model,namely,the player can query gradient information mul- tiple times at each round.This algorithm is conceptually simple,but the gradient query complexity is O(logT)at each round.Our second algorithm,Sword++,is an improved version that requires only one gradient per iteration,despite using a two-layer online en- semble structure.As a result,Sword++is not only computationally efficient but also more 3
Adaptivity and Non-stationarity PT t=2∥ut−1 − ut∥2 is the path length of comparators u1, . . . , uT and thus reflects the nonstationarity of the environments. When the path length PT was known, one could choose the optimal step size η∗ = Θ(p (1 + PT )/T) and attain an O( p T(1 + PT )) dynamic regret. However, this path length quantity is hard to know since the universal dynamic regret aims to provide guarantees against any feasible comparator sequence. The step size η = Θ(1/ √ T) commonly used in static regret would lead to an inferior O( √ T(1 + PT )) bound, which exhibits a large gap from the favorable bound with an oracle step size tuning. Zhang et al. (2018a) resolve the issue by proposing a novel online algorithm to search the optimal step size η∗, attaining an O( p T(1 + PT )) universal dynamic regret, and they also establish an Ω(p T(1 + PT )) lower bound to show the minimax optimality. Although the rate is minimax optimal for convex functions, we would like to design algorithms with more adaptive bounds. Specifically, we aim to enhance the guarantee for some easy problem instances, particularly when the online functions are smooth, by replacing the dependence on T by certain problem-dependent quantities that are O(T) in the worst case while could be much smaller in benign environments. In the study of static regret, we can attain such results like small-loss bounds (Srebro et al., 2010) and gradientvariation bounds (Chiang et al., 2012). Thus, a natural question arises whether it is possible to achieve similar problem-dependent guarantees for the universal dynamic regret? Our results. In this paper, extending our preliminary work (Zhao et al., 2020b), we provide an affirmative answer by designing online algorithms with problem-dependent dynamic regret bounds. Specifically, we focus on the following two adaptive quantities: the gradient variation of online functions VT , and the cumulative loss of the comparator sequence FT , defined as VT = X T t=2 sup x∈X ∥∇ft(x) − ∇ft−1(x)∥ 2 2 , and FT = X T t=1 ft(ut). (4) The two problem-dependent quantities are both at most O(T) under standard assumptions of online learning, while could be much smaller in easier problem instances. We propose two novel online algorithms called Sword and Sword++ (“Sword” is short for Smoothness-aware online learning with dynamic regret) that are suitable for different feedback models. Our algorithms are online ensemble methods (Zhou, 2012; Zhao, 2021), which admit a two-layer structure with a meta-algorithm running over a group of base-learners. We prove that they enjoy an O( p (1 + PT + min{VT , FT })(1 + PT )) dynamic regret, achieving gradientvariation and small-loss bounds simultaneously. Comparing to the O( p T(1 + PT )) minimax rate, our result replaces the dependence on T by the problem-dependent quantity PT + min{VT , FT }. Our bounds become much tighter when the problem is easy (for example when PT and VT /FT are sublinear in T), and meanwhile safeguard the same guarantee in the worst case. Hence, our results are adaptive to the intrinsic difficulty of problem instances as well as the non-stationarity of environments. Our first algorithm, Sword, achieves the favorable problem-dependent guarantees under the multi-gradient feedback model, namely, the player can query gradient information multiple times at each round. This algorithm is conceptually simple, but the gradient query complexity is O(log T) at each round. Our second algorithm, Sword++, is an improved version that requires only one gradient per iteration, despite using a two-layer online ensemble structure. As a result, Sword++ is not only computationally efficient but also more 3
ZHAO,ZHANG,ZHANG,ZHOU attractive due to its reduced feedback requirements-Sword++can be applied to the more challenging one-gradient feedback model,where the player only receives gradient Vf(x as the feedback after submitting the decision xt.Furthermore,it is worth mentioning that Sword++has the potential to be extended to more constrained feedback models,such as the two-point bandit convex optimization,by further leveraging the technique for gradient- variation static regret minimization presented in (Chiang et al..2013). Technical contributions.Note that there exist studies showing that the worst-case dy namic regret can benefit from smoothness (Yang et al.,2016;Zhang et al.,2017:Zhao and Zhang,2021).However,their analyses do not apply to our universal dynamic regret,as we cannot exploit the optimality condition of comparators u,...,ur,in stark contrast with the worst-case dynamic regret analysis.As a result,we propose an adaptive online ensemble method to hedge non-stationarity while extracting adaptivity.Specifically,we em ploy the meta-base two-layer ensemble to hedge the non-stationarity and utilize optimistic online learning to reuse the historical gradient information adaptively.Two crucial novel ingredients are designed in order to achieve favorable problem-dependent guarantees. .We introduce optimistic mirror descent (OMD)as a unified building block for th algorithm design of dynamic regret minimization in both meta and base levels.We sent generic and completely modular analysis for the dynamic regret of OMD. meoative tern ntial for achieving problem-depe ndent dv amic regret the ollabor ive onlin emble framework In addition to emplo ing optimistic online learning to attain ad laptivity and meta-base structure to -statiornaritclncoOporatcaito decis ion-deviation ilitatese the two layers an red proble lependent bound while is crucial requiring only one gradient per iteration We emphasize that these ingredients are particularly important for achieving gradient- variation dynamic regret,which we will demonstrate to be more fundamental than the small-loss bound.In particular,our overall solution(especially Sword++)effectively utilizes negative terms and introduces correction terms to ensure successful collaboration within the twolaver online ensemble The overall framework of collaborative online ensemble is summarized in Section 5,and we believe that the proposed framework has the potential for broader online learning problems. Organization.The rest is structured as follows.Section 2 briefly reviews the related work.In Section 3,we introduce the problem setup and algorithmic framework,where a generic dynamic regret analysis of optimistic mirror descent is provided.Section 4 presents our main results,in which the gradient-variation dynamic regret bounds are established. Section 5 illustrates a generic framework called collaborative online ensemble that is highly useful for attaining problem-dependent dynamic regret.Section 6 provides some additional results.The major proofs are presented in Section 7.Furthermore,Section 8 reports the ort our theoretical findings.Finally,we conclude the paper in Section 9.Some omitted details and proofs are provided in the appendix. 4
Zhao, Zhang, Zhang, Zhou attractive due to its reduced feedback requirements — Sword++ can be applied to the more challenging one-gradient feedback model, where the player only receives gradient ∇ft(xt) as the feedback after submitting the decision xt . Furthermore, it is worth mentioning that Sword++ has the potential to be extended to more constrained feedback models, such as the two-point bandit convex optimization, by further leveraging the technique for gradientvariation static regret minimization presented in (Chiang et al., 2013). Technical contributions. Note that there exist studies showing that the worst-case dynamic regret can benefit from smoothness (Yang et al., 2016; Zhang et al., 2017; Zhao and Zhang, 2021). However, their analyses do not apply to our universal dynamic regret, as we cannot exploit the optimality condition of comparators u1, . . . , uT , in stark contrast with the worst-case dynamic regret analysis. As a result, we propose an adaptive online ensemble method to hedge non-stationarity while extracting adaptivity. Specifically, we employ the meta-base two-layer ensemble to hedge the non-stationarity and utilize optimistic online learning to reuse the historical gradient information adaptively. Two crucial novel ingredients are designed in order to achieve favorable problem-dependent guarantees. • We introduce optimistic mirror descent (OMD) as a unified building block for the algorithm design of dynamic regret minimization in both meta and base levels. We present generic and completely modular analysis for the dynamic regret of OMD, where the negative term is essential for achieving problem-dependent dynamic regret. • We propose the collaborative online ensemble framework. In addition to employing optimistic online learning to attain adaptivity and meta-base structure to hedge non-stationarity, we incorporate a novel decision-deviation correction term, which facilitates effective collaborations within the two layers and is crucial for achieving the desired problem-dependent bound while requiring only one gradient per iteration. We emphasize that these ingredients are particularly important for achieving gradientvariation dynamic regret, which we will demonstrate to be more fundamental than the small-loss bound. In particular, our overall solution (especially Sword++) effectively utilizes negative terms and introduces correction terms to ensure successful collaboration within the two-layer online ensemble. The overall framework of collaborative online ensemble is summarized in Section 5, and we believe that the proposed framework has the potential for broader online learning problems. Organization. The rest is structured as follows. Section 2 briefly reviews the related work. In Section 3, we introduce the problem setup and algorithmic framework, where a generic dynamic regret analysis of optimistic mirror descent is provided. Section 4 presents our main results, in which the gradient-variation dynamic regret bounds are established. Section 5 illustrates a generic framework called collaborative online ensemble that is highly useful for attaining problem-dependent dynamic regret. Section 6 provides some additional results. The major proofs are presented in Section 7. Furthermore, Section 8 reports the experiments to empirically support our theoretical findings. Finally, we conclude the paper in Section 9. Some omitted details and proofs are provided in the appendix. 4
ADAPTIVITY AND NON-STATIONARITY 2.Related Work In this section,we present a brief review of static regret and dynamic regret minimization for online convex optimization. 2.1 Static Regret Static regret has been extensively studied in online convex optimization.Let T be the time horizon and d be the dir the et bounded by o.O(dlogT and O(logT)for eeristomlinealgorita fu tivoly (Zink wich 2003:Hazan et al.2007). Thes results are od to ptimal (Abernethy et al.2008).More sults can be found in the books (Shale artz 2012.Ha 2016 e there tion to convexity of ere are studie ng stati regret by incorporating smoothe m proposa ence ol y problem-dependent Such lem-dependent bou properties,in particula they can sa guard the worst-ca te yet can be ighte er pro em stances. There are usually two ind bounds (Srebro et al.,2010)and gradient-variation bounds (Chiang et al.,2012) Small-loss bounds are first introduced in the context of prediction with expert ad- vice (Littlestone and Warmuth,1994;Freund and Schapire,1997),which replace the de- pendence on T by cumulative loss of the best expert.Later.Srebro et al.(2010)show that in the online convex optimization setting,OGD with a certain step size scheme can achieve an O(1+F)small-loss regret bound when the online convex functions are smooth and non-negative,where Ff is the cumulative loss of the best decision in hind sight.namely.=f(x")with x*chosen as the offine minimizer.The key technical ingredient in the analysis is to exploit the self-bounding properties of smooth functions Gradient-variation bounds are introduced by Chiang et al.(2012),rooting in the devel- opment of second-order bounds for prediction with expert advice (Cesa-Bianchi et al. 2005)and online convex optimization (Hazan and Kale,2008).For convex and smooth functions.Chi ang et al.(2012)establish an O(v1+VT)gradient-variation regret bound, where v= (x)-Vf-1(x)3 measures the cumulative gradient variation. Gradient-variation bounds are particularly favored in slowly changing environments where online functions evolve gradually. In addition,problem-dependent static r regret bounds are also studied in the bandit on line learnins the gradien riation bo e for +10012 111 nds fo get al,2006;Wei and Luo.2018:Lee et al.,2020b).lin bandits (Lee et al,2020b) emi-h ndits (N 2015),g0 lits (Lv ris et al.,2018;Le and a textual bandits (Allen-Zhu e 2018.F and Krishnamurthy 20211 2.2 Dynamic Regret Dynamic regret enforces the player to compete with time-varying comparators and thus is fa- vored in online learning in open and non-stationary environments(Sugiyama and Kawanabe 2012:Zhao et al.,2021b;Zhou,2022).The notion of dynamic regret is sometimes referred
Adaptivity and Non-stationarity 2. Related Work In this section, we present a brief review of static regret and dynamic regret minimization for online convex optimization. 2.1 Static Regret Static regret has been extensively studied in online convex optimization. Let T be the time horizon and d be the dimension, there exist online algorithms with static regret bounded by O( √ T), O(d log T), and O(log T) for convex, exponentially concave, and strongly convex functions, respectively (Zinkevich, 2003; Hazan et al., 2007). These results are proved to be minimax optimal (Abernethy et al., 2008). More results can be found in the seminal books (Shalev-Shwartz, 2012; Hazan, 2016) and references therein. In addition to exploiting the convexity of functions, there are studies improving static regret by incorporating smoothness, whose main proposal is to replace the dependence on T by problem-dependent quantities. Such problem-dependent bounds enjoy many benign properties, in particular, they can safeguard the worst-case minimax rate yet can be much tighter in easier problem instances. There are usually two kinds of such bounds — small-loss bounds (Srebro et al., 2010) and gradient-variation bounds (Chiang et al., 2012). Small-loss bounds are first introduced in the context of prediction with expert advice (Littlestone and Warmuth, 1994; Freund and Schapire, 1997), which replace the dependence on T by cumulative loss of the best expert. Later, Srebro et al. (2010) show that in the online convex optimization setting, OGD with a certain step size scheme can achieve an O( p 1 + F ∗ T ) small-loss regret bound when the online convex functions are smooth and non-negative, where F ∗ T is the cumulative loss of the best decision in hindsight, namely, F ∗ T = PT t=1 ft(x ∗ ) with x ∗ chosen as the offline minimizer. The key technical ingredient in the analysis is to exploit the self-bounding properties of smooth functions. Gradient-variation bounds are introduced by Chiang et al. (2012), rooting in the development of second-order bounds for prediction with expert advice (Cesa-Bianchi et al., 2005) and online convex optimization (Hazan and Kale, 2008). For convex and smooth functions, Chiang et al. (2012) establish an O( √ 1 + VT ) gradient-variation regret bound, where VT = PT t=2 supx∈X ∥∇ft(x)−∇ft−1(x)∥ 2 2 measures the cumulative gradient variation. Gradient-variation bounds are particularly favored in slowly changing environments where online functions evolve gradually. In addition, problem-dependent static regret bounds are also studied in the bandit online learning setting, including the gradient-variation bounds for two-point bandit convex optimization (Chiang et al., 2013), as well as small-loss bounds for multi-armed bandits (Allenberg et al., 2006; Wei and Luo, 2018; Lee et al., 2020b), linear bandits (Lee et al., 2020b), semi-bandits (Neu, 2015), graph bandits (Lykouris et al., 2018; Lee et al., 2020a), and contextual bandits (Allen-Zhu et al., 2018; Foster and Krishnamurthy, 2021), etc. 2.2 Dynamic Regret Dynamic regret enforces the player to compete with time-varying comparators and thus is favored in online learning in open and non-stationary environments (Sugiyama and Kawanabe, 2012; Zhao et al., 2021b; Zhou, 2022). The notion of dynamic regret is sometimes referred 5
ZHAO,ZHANG,ZHANG,ZHOU to as tracking regret or shifting regret in the prediction with expert advice setting(Herbster and Warmuth,1998,2001:Bousquet and Warmuth,2002;Cesa-Bianchi et al.,2012;Gyorgy and Szepesvari,2016).It is known that in the worst case,a sublinear dynamic regret is not attainable unless imposing certain regularities on the comparator sequence or the function sequence(Besbes et al.,2015;Jadbabaie et al.,2015).This paper focuses the most common regularity called the path length introduced by Zinkevich(2003),which measures fluctuation of the comparators defined by Pr=∑u-1-l2 (5) Note that we simply focus on the Euclidean norm throughout this paper,and it is straight- d to extend u prima orms introdu Zhang et al (2017), ST =f2lu-1-ull3,and the functio variation introduced by Besbes et al. 2015 二2先2pkx-I冈pect to tea There are two kinds of dynamic regret notions in the previous studies.The universa dynamic regret,as defined in (2),aims to compare with any feasible comparator sequence. while the worst-case e dynamic regret defined in (3)specifies the comparator sequence to be the sequence of minimizers of online functions.In the following,we present the related works respectively.Notice that we will use notations Pr and Sr for path length and squared path length of the comparator sequence fu)=1r,while adopt the notations Pi and Si for that of the sequence {xh=1..r where xi is the minimizer of the online function f namely,P7=∑t-2xt-1-x对l2andS=∑i-2lx-1-xtl2. Universal dynamic regret.The seminal work of Zinkevich (2003)demonstrates that online gradient descent (OGD)enjoys an o(vT(1+Pr))universal dynamic regret,which holds against any feasible comparator sequence.Nevertheless,the result is far from the (VT(1+Pr))lower bound established by Zhang et al.(2018a),who further r close the gap by proposing a novel online algorithm that attains an optimal rate of (VT(1+Pr)) for convex functions(Zhang et al.,2018a).Our work further exploits the easiness of the problem instances and achieve problem-dependent regret guarantees,hence better than the minimax rate.Zhao et al.(2021a)study the universal dynamic regret for bandit convex optimization under both one-point and two-point feedback models.The universal dynamic regret is also studied for variants of the standard OCO model such as OCO with mem- ory (Zhao et al.,2022)and OCO with switching cost (Zhang et al.,2021).We note that the aforementioned works and ours are all building on the two-laver meta-base structure Concurrent to our conference version paper (Zhao et al.,2020b),Cutkosky (2020)pro poses a novel online algorithm that achieves the same minimax optimal dynamic regret for convex functions as (Zhang et al.,2018a),without relying on meta-base aggregation Instead,their method employs the combination strategy developed in parameter-free online learning (Cutkosky and Orabona,2018;Cutkosky,2019).We note that it may be possible to modify the algorithm of Cutkosky (2020)to achieve small-loss bounds;however,it is generally challend ng to attain gradient-variation bounds.especially under the one-gradient feedback model.More specifically,it is not hard to modify their framework to be compat- 6
Zhao, Zhang, Zhang, Zhou to as tracking regret or shifting regret in the prediction with expert advice setting (Herbster and Warmuth, 1998, 2001; Bousquet and Warmuth, 2002; Cesa-Bianchi et al., 2012; Gy¨orgy and Szepesv´ari, 2016). It is known that in the worst case, a sublinear dynamic regret is not attainable unless imposing certain regularities on the comparator sequence or the function sequence (Besbes et al., 2015; Jadbabaie et al., 2015). This paper focuses the most common regularity called the path length introduced by Zinkevich (2003), which measures fluctuation of the comparators defined by PT = X T t=2 ∥ut−1 − ut∥2. (5) Note that we simply focus on the Euclidean norm throughout this paper, and it is straightforward to extend the notions and results to general primal-dual norms. Other regularities include the squared path length introduced by Zhang et al. (2017), which is defined as ST = PT t=2∥ut−1 − ut∥ 2 2 , and the function variation introduced by Besbes et al. (2015) that measures the cumulative variation with respect to the function value and is defined as V f T = PT t=2 supx∈X |ft−1(x) − ft(x)|. There are two kinds of dynamic regret notions in the previous studies. The universal dynamic regret, as defined in (2), aims to compare with any feasible comparator sequence, while the worst-case dynamic regret defined in (3) specifies the comparator sequence to be the sequence of minimizers of online functions. In the following, we present the related works respectively. Notice that we will use notations PT and ST for path length and squared path length of the comparator sequence {ut}t=1,...,T , while adopt the notations P ∗ T and S ∗ T for that of the sequence {x ∗ t }t=1,...,T where x ∗ t is the minimizer of the online function ft , namely, P ∗ T = PT t=2∥x ∗ t−1 − x ∗ t ∥2 and S ∗ T = PT t=2∥x ∗ t−1 − x ∗ t ∥ 2 2 . Universal dynamic regret. The seminal work of Zinkevich (2003) demonstrates that online gradient descent (OGD) enjoys an O( √ T(1 + PT )) universal dynamic regret, which holds against any feasible comparator sequence. Nevertheless, the result is far from the Ω(p T(1 + PT )) lower bound established by Zhang et al. (2018a), who further close the gap by proposing a novel online algorithm that attains an optimal rate of O( p T(1 + PT )) for convex functions (Zhang et al., 2018a). Our work further exploits the easiness of the problem instances and achieve problem-dependent regret guarantees, hence better than the minimax rate. Zhao et al. (2021a) study the universal dynamic regret for bandit convex optimization under both one-point and two-point feedback models. The universal dynamic regret is also studied for variants of the standard OCO model such as OCO with memory (Zhao et al., 2022) and OCO with switching cost (Zhang et al., 2021). We note that the aforementioned works and ours are all building on the two-layer meta-base structure. Concurrent to our conference version paper (Zhao et al., 2020b), Cutkosky (2020) proposes a novel online algorithm that achieves the same minimax optimal dynamic regret for convex functions as (Zhang et al., 2018a), without relying on meta-base aggregation. Instead, their method employs the combination strategy developed in parameter-free online learning (Cutkosky and Orabona, 2018; Cutkosky, 2019). We note that it may be possible to modify the algorithm of Cutkosky (2020) to achieve small-loss bounds; however, it is generally challenging to attain gradient-variation bounds, especially under the one-gradient feedback model. More specifically, it is not hard to modify their framework to be compat- 6
ADAPTIVITY AND NON-STATIONARITY ible with optimistic online learning,but one usually needs to exploits additional negativ opt quantity 修to e gradi dif x:and xi- caref n met bas algorithm as th as far as we c with only one gradler feedback per round it is nging for th framework of Cut sky(2020 achieve the gradient-variation bound due to the lack of negative terms in the regret analysis Worst-case dynamic regret. There are many efforts devoted to studying the worst-case dynamic regret.Yang et al.(2016)prove that OGD enjoys an O(T(1+P))worst-case dynamic regret for convex functions when the path length Pt is known.For strongly con- vex and smooth functions,Mokhtari et al.(2016)show that an (P)dynamic regret is achievable,and Zhang et al.(2017)further propose the online multiple gradient descent algorithm with an O(min{p,S})guarantee.Yang et al.(2016)show that O(Pr)rate is attainable for convex and smooth functions,provided that all the minimizers x's lie in the interior of the domain &The above results mainly use the (squared)path length as the non-stationarity measure,which measures the cumulative variation of the compara- tor sequence.In another line of research,researchers use the variation with respect to the function values as the measur Besbes et al.(2015)show that OGD with a restart- ing strategy attains n()regret for functions when the function varia- tion V is available,which is improved to (T/V2/3)for 1-dim square loss (Baby and Wang,2019).Chen et al.(2019)extend the results of Besbes et al.(2015)to more general function-variation measures capable of capturing local temporal and spatial changes.To take advantage of variations in both comparator sequences and function values,(Zhao and Zhang,2021)provide an improved analysis for online multiple gradient descent and prove an O(min Pr,St,Vf)worst-case dynamic regret for strongly convex and smooth functions. For convex and smooth functions,they also demonstrate that the simple greedy strategy (i.e..x=x:E arg min. f(x))can effectively optimize the worst-case dynamic re- gret (Zhao and Zhang,2021,Section 4.2). 3.Problem Setup and Algorithmic Framework In this section,we first formally state the problem setup,then introduce the foundational algorithmic frame ork for dy inimization and finally list several assumptions that might be used in the th etical 3.1 Problem Setup Online Convex Optimization(OCO)can be modeled as an iterated game between the player and the environments.At iteration tE[T],the player first chooses the decision x from a convex feasible set CRd,then the environments reveal the loss function f:R and the player suffers the loss f(x)and observes a certain information about the function fi().According to the revealed information,the online learning problems are typically
Adaptivity and Non-stationarity ible with optimistic online learning, but one usually needs to exploits additional negative terms to convert the optimistic quantity ∥∇ft(xt) − ∇ft−1(xt−1)∥ 2 2 to the gradient variation supx∈X ∥∇ft(x) − ∇ft−1(x)∥ 2 2 , in order to eliminate the difference between decisions xt and xt−1. Our proposed algorithms are built upon the meta-base two-layer framework, which involves a careful exploitation of negative terms in the regret analysis of both meta and base algorithms, as well as the introduction of additional correction terms. However, as far as we can see, with only one gradient feedback per round, it is challenging for the framework of Cutkosky (2020) to achieve the gradient-variation bound due to the lack of negative terms in the regret analysis. Worst-case dynamic regret. There are many efforts devoted to studying the worst-case dynamic regret. Yang et al. (2016) prove that OGD enjoys an O( q T(1 + P ∗ T )) worst-case dynamic regret for convex functions when the path length P ∗ T is known. For strongly convex and smooth functions, Mokhtari et al. (2016) show that an O(P ∗ T ) dynamic regret is achievable, and Zhang et al. (2017) further propose the online multiple gradient descent algorithm with an O(min{P ∗ T , S∗ T }) guarantee. Yang et al. (2016) show that O(P ∗ T ) rate is attainable for convex and smooth functions, provided that all the minimizers x ∗ t ’s lie in the interior of the domain X . The above results mainly use the (squared) path length as the non-stationarity measure, which measures the cumulative variation of the comparator sequence. In another line of research, researchers use the variation with respect to the function values as the measure. Besbes et al. (2015) show that OGD with a restarting strategy attains an O(T 2/3V f1/3 T ) regret for convex functions when the function variation V f T is available, which is improved to O(T 1/3V f2/3 T ) for 1-dim square loss (Baby and Wang, 2019). Chen et al. (2019) extend the results of Besbes et al. (2015) to more general function-variation measures capable of capturing local temporal and spatial changes. To take advantage of variations in both comparator sequences and function values, (Zhao and Zhang, 2021) provide an improved analysis for online multiple gradient descent and prove an O(min P ∗ T , S∗ T , V f T ) worst-case dynamic regret for strongly convex and smooth functions. For convex and smooth functions, they also demonstrate that the simple greedy strategy (i.e., xt+1 = x ∗ t ∈ arg minx∈X ft(x)) can effectively optimize the worst-case dynamic regret (Zhao and Zhang, 2021, Section 4.2). 3. Problem Setup and Algorithmic Framework In this section, we first formally state the problem setup, then introduce the foundational algorithmic framework for dynamic regret minimization, and finally list several assumptions that might be used in the theoretical analysis. 3.1 Problem Setup Online Convex Optimization (OCO) can be modeled as an iterated game between the player and the environments. At iteration t ∈ [T], the player first chooses the decision xt from a convex feasible set X ⊆ R d , then the environments reveal the loss function ft : X 7→ R and the player suffers the loss ft(xt) and observes a certain information about the function ft(·). According to the revealed information, the online learning problems are typically 7
ZHAO,ZHANG,ZHANG,ZHOU classified into full-information online learning and partial-information online learning (or called bandit online learning).In this paper,we focus on the full-information one,which can be further categorized into the following two setups: (i)multi-gradient feedback:the plaver can access the entire gradient function vf(. and thus can evaluate the gradient multiple times: (ii)one-gradient feedback: the player can observe the gradient(x) after submitting the decision x. In Section 4.2,we develop an online algorithm called Sword with gradient-variation dynamic regret bounds under the multi-gradient feedback model.In Section 4.3,we present an improved algorithm called Sword++that can achieve the same guarantee under the more challenging one-gradient feedback model. The tvpical performance measure is the static regret,which benchmarks the algorithm with a fixed comparator.To handle non-stationary environments.we focus on the strength ened measure called dymnamic regret,which compares the online algorithm to a sequence of time-varying comparators ul ur e&,as defined in (2).An upper bound of dynamic regret should be a function of comparators,and typically the bound depends on some reg- ularities that measure the fluctuation of the comparator sequence,such as the path length Pr=u-u2.Throughout the pa per we focus on the Euclidean norm for simplicity,and it is straightforward to extend our results to general primal-dual norms. n addition to the r rret measure we further consider the ndient query complerity Note that algorithms designed for the multi-gradient feedback model may nery the gra Hients for multiple times at each round.How er n nost algorithms designed for the static uire 又f(xt for the kt update c only.Ther efor it is more desirable to achieve the favorable regret nder the c adient feedback model.In other words. elop first-order methods for our aim is to deve vmamic regret ninimization that require only gradient ou per iterati 3.2 Optimistic Mirror Descen ent (OMD)(Chian et al and Sridhar rk of Optimistic Mir g block for desig for on-statio online lea ning.OMD is gy for optimist Ito the M∈Rdt the the initial point ofthe future g哥 and perform the following two-step upd te x:=argmin m(M). +1=ar竖x,+De,刘) (6) which firstly updates by the optimism and then updates by the received gradient.In above, n>0 is a(potentially)time-varying step size,and D(,)denotes the Bregman divergence ssociated with the regularizer defined as D(x,y)=(x)-v(y)-(vu(y).x-y).We have the following general result regarding dynamic regret of optimistic mirror descent. 8
Zhao, Zhang, Zhang, Zhou classified into full-information online learning and partial-information online learning (or called bandit online learning). In this paper, we focus on the full-information one, which can be further categorized into the following two setups: (i) multi-gradient feedback: the player can access the entire gradient function ∇ft(·) and thus can evaluate the gradient multiple times; (ii) one-gradient feedback: the player can observe the gradient information ∇ft(xt) after submitting the decision xt . In Section 4.2, we develop an online algorithm called Sword with gradient-variation dynamic regret bounds under the multi-gradient feedback model. In Section 4.3, we present an improved algorithm called Sword++ that can achieve the same guarantee under the more challenging one-gradient feedback model. The typical performance measure is the static regret, which benchmarks the algorithm with a fixed comparator. To handle non-stationary environments, we focus on the strengthened measure called dynamic regret, which compares the online algorithm to a sequence of time-varying comparators u1, . . . , uT ∈ X , as defined in (2). An upper bound of dynamic regret should be a function of comparators, and typically the bound depends on some regularities that measure the fluctuation of the comparator sequence, such as the path length PT = PT t=2∥ut − ut−1∥2. Throughout the paper, we focus on the Euclidean norm for simplicity, and it is straightforward to extend our results to general primal-dual norms. In addition to the regret measure, we further consider the gradient query complexity. Note that algorithms designed for the multi-gradient feedback model may query the gradients for multiple times at each round. However, most algorithms designed for the static regret minimization only require one gradient per iteration, namely, using ∇ft(xt) for the next update only. Therefore, it is more desirable to achieve the favorable regret guarantees under the one-gradient feedback model. In other words, our aim is to develop first-order methods for dynamic regret minimization that require only one gradient query per iteration. 3.2 Optimistic Mirror Descent We employ the algorithmic framework of Optimistic Mirror Descent (OMD) (Chiang et al., 2012; Rakhlin and Sridharan, 2013) as a generic building block for designing algorithms for non-stationary online learning. OMD is a methodology for optimistic online learning. Compared to the standard online learning setup, the player will now additionally receive an optimistic vector Mt ∈ R d at each round, which serves as a hint of the future gradient. OMD starts from the initial point xb1 ∈ X and performs the following two-step updates: xt = arg min x∈X ηt⟨Mt , x⟩ + Dψ(x, xbt), xbt+1 = arg min x∈X ηt⟨∇ft(xt), x⟩ + Dψ(x, xbt), (6) which firstly updates by the optimism and then updates by the received gradient. In above, ηt > 0 is a (potentially) time-varying step size, and Dψ(·, ·) denotes the Bregman divergence associated with the regularizer ψ defined as Dψ(x, y) = ψ(x) − ψ(y) − ⟨∇ψ(y), x − y⟩. We have the following general result regarding dynamic regret of optimistic mirror descent. 8
ADAPTIVITY AND NON-STATIONARITY Theorem 1.Suppose that the regularizerR is -strongly conver with let ll-be the du he regret of Optimistic Mirror Descent whose update rule is specified in (6)is bounded as follous. 台 盖a+a.刘 which holds for any comparator sequence u,....u Remark 1.The dynamic regret upper bound in Theorem 1 consists of three terms: (i)the first term rnft(xt)-M:l2 is the adaptivity term that measures the devi- ation between gradient and optimism; (ii)the second term D(ut-D(u,)reflects the non-stationarity of environments a andwl the path length of comparators (iii)the last negative term-D(x)+D(x,))is crucial and will be greatly useful for problem-dependent bounds,particularly the gradient-variation one. Moreover,we emphasize that the above regret guarantee is very general due to the flexibil- ity in choosing the regularizer and comparators futh=1.r.For example,by choosing the negative-entropy regularizer and competing with the best fixed prediction,the result recovers the static regret bound of Optimistic Hedge(Syrgkanis et al.,2015);by choosing the Euclidean regularizer and competing with time-varying compactors,it recovers the dy- namic regret bound of Online Gradient Descent (Zinkevich,2003).The versatility of this optimistic mirror descent framework motivates us to use it as a unified building block for both algorithm design and theoretical analysis. 3.3 Assumptions In this part,we list several common assumptions that might be used in the theorems. Assumption 1.The norm of the gradients of online functions over the domain is bounded by G,i.e.,IV ft(x)12 G,for all xEX and t E [T]. Assumption 2.The domain &R contains the origin 0,and the diameter of the domain is at most D,i.e,lx-xl2≤D for any x,x∈X. Assumption 4.All the online functions are non-negative over Rd. We have the following remarks regarding the assumptions.The generic dynamic regret of OMD(Theorem 1)does not require the smoothness assumption(Assumption 3).Never- theless,it is required in attaining the problem-dependent dynamic regret bounds.In fact, 9
Adaptivity and Non-stationarity Theorem 1. Suppose that the regularizer ψ : X 7→ R is 1-strongly convex with respect to the norm ∥ · ∥, and let ∥ · ∥∗ be the dual norm of ∥ · ∥. The dynamic regret of Optimistic Mirror Descent whose update rule is specified in (6) is bounded as follows: X T t=1 ft(xt) − X T t=1 ft(ut) ≤ X T t=1 ηt∥∇ft(xt) − Mt∥ 2 ∗ + X T t=1 1 ηt Dψ(ut , xbt) − Dψ(ut , xbt+1) − X T t=1 1 ηt Dψ(xbt+1, xt) + Dψ(xt , xbt) , (7) which holds for any comparator sequence u1, . . . , uT ∈ X . Remark 1. The dynamic regret upper bound in Theorem 1 consists of three terms: (i) the first term PT t=1 ηt∥∇ft(xt) − Mt∥ 2 ∗ is the adaptivity term that measures the deviation between gradient and optimism; (ii) the second term PT t=1 1 ηt Dψ(ut , xbt) − Dψ(ut , xbt+1) reflects the non-stationarity of environments and will be converted to the path length of comparators; (iii) the last negative term − PT t=1 1 ηt Dψ(xbt+1, xt) + Dψ(xt , xbt) is crucial and will be greatly useful for problem-dependent bounds, particularly the gradient-variation one. Moreover, we emphasize that the above regret guarantee is very general due to the flexibility in choosing the regularizer ψ and comparators {ut}t=1,...,T . For example, by choosing the negative-entropy regularizer and competing with the best fixed prediction, the result recovers the static regret bound of Optimistic Hedge (Syrgkanis et al., 2015); by choosing the Euclidean regularizer and competing with time-varying compactors, it recovers the dynamic regret bound of Online Gradient Descent (Zinkevich, 2003). The versatility of this optimistic mirror descent framework motivates us to use it as a unified building block for both algorithm design and theoretical analysis. ¶ 3.3 Assumptions In this part, we list several common assumptions that might be used in the theorems. Assumption 1. The norm of the gradients of online functions over the domain X is bounded by G, i.e., ∥∇ft(x)∥2 ≤ G, for all x ∈ X and t ∈ [T]. Assumption 2. The domain X ⊆ R d contains the origin 0, and the diameter of the domain X is at most D, i.e., ∥x − x ′∥2 ≤ D for any x, x ′ ∈ X . Assumption 3. All the online functions are L-smooth, i.e., ∥∇ft(x) − ∇ft(x ′ )∥2 ≤ L∥x − x ′∥2 for any x, x ′ ∈ R d and t ∈ [T]. Assumption 4. All the online functions are non-negative over R d . We have the following remarks regarding the assumptions. The generic dynamic regret of OMD (Theorem 1) does not require the smoothness assumption (Assumption 3). Nevertheless, it is required in attaining the problem-dependent dynamic regret bounds. In fact, 9
ZHAO.ZHANG.ZHANG.ZHOU even in the static regret analysis,smoothness is demonstrated to be necessary for the first- order methods to achieve gradient-variation bounds (cf.Lemma 9 of Chiang et al.(2012) and Theorem 1 of Yang et al.(2014)).Therefore,throughout the paper we focus on the problem-dependent dynamic regret of convex and smooth functions.Moreover,note that in Assumption 4 we require the online functions to be non-negative outside the domain which is a precondition for establishing the self-bounding property for smooth functions and is commonly used to establish small-loss bounds (Srebro et al.,2010;Zhang et al.,2019; Zhang and Zhou,2019).Meanwhile,we treat double logarithmic factors in T as a constant, following previous studies(Adamskiy et al.,2012;Luo and Schapire,2015). 4.Gradient-Variation Dynamic Regret Our paper aims to develop online algorithms that achieve problem regret nt quantities he gradlent-var ation term Vr and the smal s term Fr,as defined in (4).As we wil emon rate in the next section the grad t-variation bour d is more ru amental tha he small s bound.Consequently,we start by focusing on the gradient-variation dynami regret in thi section on 6,we wi I then present the small-loss bound and the best- of-both-worlds bound as implications of the results obtained in this section. 4.1 A Gentle Start In the study of static regret,Chiang et al.(2012)propose the online extra-gradient descent (OEGD)algorithm and prove that the algorithm enjoys gradient-variation static regret. Specifically,OEGD starts from x.xE A and then updates by x=ΠxR-n7f-1(xt-1,+1=Πx区-nVf(x] (8) We here consider the algorithm with a fixed step size n for simplicity,and II denotes the Euclidean projection onto the nearest point in t.For convex and smooth functions,Chiang et al.(2012)prove that OEGD can achieve an O(v1+Vr)static regret bound,where Vr=2supxexlvfi(x)-Vf-1(x)is the gradient variation. In the following,we further demonstrate that OEGD also enjoys the gradient-variation dynamic regret guarantee.Actually,OEGD can be viewed as a specialization of the opti mistic mirror descent (6)presented in Section 3.2,by choosing the regularizer (x)=x and the optimism M=Vf-1(x-1)as well as a fixed step size n>0.Then,the general result of Theorem 1 directly implies the following dynamic regret bound for OEGD,and the proof can be found in Appendix A. Lemma 1.Under Assumptions 1,2,and 3,by choosing n s,the dynamie regret of OMD with a reqularizer(x)=x and optimism M:=Vft-1(xt-1)is bounded as ∑f(x)-∑fi(u)≤n(G2+2)+2(D2+2DPr, (9 f=1 where吹=∑T 气-1xl服is the.9 drnPr=∑ ul2 is the path length.The result holds for any comparator sequenceu, 10
Zhao, Zhang, Zhang, Zhou even in the static regret analysis, smoothness is demonstrated to be necessary for the firstorder methods to achieve gradient-variation bounds (cf. Lemma 9 of Chiang et al. (2012) and Theorem 1 of Yang et al. (2014)). Therefore, throughout the paper we focus on the problem-dependent dynamic regret of convex and smooth functions. Moreover, note that in Assumption 4 we require the online functions to be non-negative outside the domain X , which is a precondition for establishing the self-bounding property for smooth functions and is commonly used to establish small-loss bounds (Srebro et al., 2010; Zhang et al., 2019; Zhang and Zhou, 2019). Meanwhile, we treat double logarithmic factors in T as a constant, following previous studies (Adamskiy et al., 2012; Luo and Schapire, 2015). 4. Gradient-Variation Dynamic Regret Our paper aims to develop online algorithms that can simultaneously achieve problemdependent dynamic regret bounds, which scale with two problem-dependent quantities: the gradient-variation term VT and the small-loss term FT , as defined in (4). As we will demonstrate in the next section, the gradient-variation bound is more fundamental than the small-loss bound. Consequently, we start by focusing on the gradient-variation dynamic regret in this section. In Section 6, we will then present the small-loss bound and the bestof-both-worlds bound as implications of the results obtained in this section. 4.1 A Gentle Start In the study of static regret, Chiang et al. (2012) propose the online extra-gradient descent (OEGD) algorithm and prove that the algorithm enjoys gradient-variation static regret. Specifically, OEGD starts from xb1, x1 ∈ X and then updates by xt = ΠX [xbt − η∇ft−1(xt−1)] , xbt+1 = ΠX [xbt − η∇ft(xt)] . (8) We here consider the algorithm with a fixed step size η > 0 for simplicity, and ΠX [·] denotes the Euclidean projection onto the nearest point in X . For convex and smooth functions, Chiang et al. (2012) prove that OEGD can achieve an O( √ 1 + VT ) static regret bound, where VT = PT t=2 supx∈X ∥∇ft(x) − ∇ft−1(x)∥ 2 2 is the gradient variation. In the following, we further demonstrate that OEGD also enjoys the gradient-variation dynamic regret guarantee. Actually, OEGD can be viewed as a specialization of the optimistic mirror descent (6) presented in Section 3.2, by choosing the regularizer ψ(x) = 1 2 ∥x∥ 2 2 and the optimism Mt = ∇ft−1(xt−1) as well as a fixed step size η > 0. Then, the general result of Theorem 1 directly implies the following dynamic regret bound for OEGD, and the proof can be found in Appendix A. Lemma 1. Under Assumptions 1, 2, and 3, by choosing η ≤ 1 4L , the dynamic regret of OMD with a regularizer ψ(x) = 1 2 ∥x∥ 2 2 and optimism Mt = ∇ft−1(xt−1) is bounded as X T t=1 ft(xt) − X T t=1 ft(ut) ≤ η(G 2 + 2VT ) + 1 2η (D2 + 2DPT ), (9) where VT = PT t=2 supx∈X ∥∇ft(x)−∇ft−1(x)∥ 2 2 is the gradient variation and PT = PT t=2∥ut−1− ut∥2 is the path length. The result holds for any comparator sequence u1, . . . , uT ∈ X . 10