BASGD:Buffered Asynchronous SGD for Byzantine Learning Yi-Rui Yang Wu-Jun Li Abstract Lee et al.,2017:Lian et al.,2017:Zhao et al.,2017:Sun Distributed learning has become a hot research et al.,2018;Wangni et al.,2018;Zhao et al.,2018;Zhou topic due to its wide application in cluster-based et al.,2018;Yu et al.,2019a;b;Haddadpour et al.,2019). large-scale learning,federated learning,edge com- Most traditional distributed learning methods are based on puting and so on.Most traditional distributed stochastic gradient descent(SGD)and its variants(Bottou, learning methods typically assume no failure or 2010;Xiao,2010;Duchi et al.,2011;Johnson Zhang, attack.However,many unexpected cases,such as 2013;Shalev-Shwartz Zhang,2013;Zhang et al.,2013; communication failure and even malicious attack, Lin et al.,2014;Schmidt et al.,2017;Zheng et al.,2017; may happen in real applications.Hence,Byzan- Zhao et al.,2018),and typically assume no failure or attack. tine learning (BL),which refers to distributed However,in real distributed learning applications with mul- learning with failure or attack,has recently at- tiple networked machines(nodes),different kinds of hard- tracted much attention.Most existing BL meth- ware or software failure may happen.Representative failure ods are synchronous,which are impractical in includes bit-flipping in the communication media and the some applications due to heterogeneous or offline memory of some workers (Xie et al.,2019).In this case, workers.In these cases,asynchronous BL (ABL) small failure on some machines (workers)might cause a is usually preferred.In this paper,we propose distributed learning method to fail.In addition,malicious a novel method,called buffered asynchronous attack should not be neglected in an open network where stochastic gradient descent(BASGD),for ABL. the manager(or server)generally has not much control on To the best of our knowledge.BASGD is the the workers,such as the cases of edge computing and feder- first ABL method that can resist malicious attack ated learning.Malicious workers may behave arbitrarily or without storing any instances on server.Com- even adversarially.Hence,Byzantine learning (BL),which pared with those methods which need to store refers to distributed learning with failure or attack,has at- instances on server,BASGD has a wider scope of tracted much attention (Diakonikolas et al.,2017;Chen application.BASGD is proved to be convergent, et al.,2017;Blanchard et al,2017;Damaskinos et al.,2018; and be able to resist failure or attack.Empirical Baruch et al..2019:Diakonikolas Kane.2019). results show that BASGD significantly outper- forms vanilla asynchronous stochastic gradient Existing BL methods can be divided into two main cate- descent (ASGD)and other ABL baselines when gories:synchronous BL(SBL)methods and asynchronous there exists failure or attack on workers. BL(ABL)methods.In SBL methods,the learning infor- mation,such as the gradient in SGD,of all workers will be aggregated in a synchronous way.On the contrary,in 1.Introduction ABL methods the learning information of workers will be aggregated in an asynchronous way.Existing SBL methods Due to the wide application in cluster-based large-scale mainly take two different ways to achieve resilience against learning,federated learning(Konevcny et al.,2016;Kairouz Byzantine workers which refer to those workers with failure et al.,2019),edge computing (Shi et al.,2016)and so on, or attack.One way is to replace the simple averaging aggre- distributed learning has recently become a hot research gation operation with some more robust aggregation opera- topic (Zinkevich et al.,2010;Yang,2013;Jaggi et al.,2014; tions,such as median and trimmed-mean (Yin et al..2018) Shamir et al.,2014:Zhang Kwok,2014:Ma et al.,2015: Krum (Blanchard et al.,2017)and ByzantinePGD (Yin National Key Laboratory for Novel Software Technology, et al.,2019)take this way.The other way is to filter the sus- Department of Computer Science and Technology,Nanjing picious learning information (gradients)before averaging. University,China.Correspondence to:Wu-Jun Li . et al.,2018)and Zeno (Xie et al.,2019).The advantage Proceedings of the 38th International Conference on Machine of SBL methods is that they are relatively simple and easy Learning.PMLR 139,2021.Copyright 2021 by the author(s). to be implemented.But SBL methods will result in slow
BASGD: Buffered Asynchronous SGD for Byzantine Learning Yi-Rui Yang 1 Wu-Jun Li 1 Abstract Distributed learning has become a hot research topic due to its wide application in cluster-based large-scale learning, federated learning, edge computing and so on. Most traditional distributed learning methods typically assume no failure or attack. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD has a wider scope of application. BASGD is proved to be convergent, and be able to resist failure or attack. Empirical results show that BASGD significantly outperforms vanilla asynchronous stochastic gradient descent (ASGD) and other ABL baselines when there exists failure or attack on workers. 1. Introduction Due to the wide application in cluster-based large-scale learning, federated learning (Konevcny et al. ` , 2016; Kairouz et al., 2019), edge computing (Shi et al., 2016) and so on, distributed learning has recently become a hot research topic (Zinkevich et al., 2010; Yang, 2013; Jaggi et al., 2014; Shamir et al., 2014; Zhang & Kwok, 2014; Ma et al., 2015; 1National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China. Correspondence to: Wu-Jun Li . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Lee et al., 2017; Lian et al., 2017; Zhao et al., 2017; Sun et al., 2018; Wangni et al., 2018; Zhao et al., 2018; Zhou et al., 2018; Yu et al., 2019a;b; Haddadpour et al., 2019). Most traditional distributed learning methods are based on stochastic gradient descent (SGD) and its variants (Bottou, 2010; Xiao, 2010; Duchi et al., 2011; Johnson & Zhang, 2013; Shalev-Shwartz & Zhang, 2013; Zhang et al., 2013; Lin et al., 2014; Schmidt et al., 2017; Zheng et al., 2017; Zhao et al., 2018), and typically assume no failure or attack. However, in real distributed learning applications with multiple networked machines (nodes), different kinds of hardware or software failure may happen. Representative failure includes bit-flipping in the communication media and the memory of some workers (Xie et al., 2019). In this case, small failure on some machines (workers) might cause a distributed learning method to fail. In addition, malicious attack should not be neglected in an open network where the manager (or server) generally has not much control on the workers, such as the cases of edge computing and federated learning. Malicious workers may behave arbitrarily or even adversarially. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has attracted much attention (Diakonikolas et al., 2017; Chen et al., 2017; Blanchard et al., 2017; Damaskinos et al., 2018; Baruch et al., 2019; Diakonikolas & Kane, 2019). Existing BL methods can be divided into two main categories: synchronous BL (SBL) methods and asynchronous BL (ABL) methods. In SBL methods, the learning information, such as the gradient in SGD, of all workers will be aggregated in a synchronous way. On the contrary, in ABL methods the learning information of workers will be aggregated in an asynchronous way. Existing SBL methods mainly take two different ways to achieve resilience against Byzantine workers which refer to those workers with failure or attack. One way is to replace the simple averaging aggregation operation with some more robust aggregation operations, such as median and trimmed-mean (Yin et al., 2018). Krum (Blanchard et al., 2017) and ByzantinePGD (Yin et al., 2019) take this way. The other way is to filter the suspicious learning information (gradients) before averaging. Representative examples include ByzantineSGD (Alistarh et al., 2018) and Zeno (Xie et al., 2019). The advantage of SBL methods is that they are relatively simple and easy to be implemented. But SBL methods will result in slow
BASGD:Buffered Asynchronous SGD for Byzantine Learning convergence when there exist heterogeneous workers.Fur- where w is the parameter to learn,d is the dimension of thermore,in some applications like federated learning and parameter,n is the number of training instances,f(w;z;)is edge computing,synchronization cannot even be performed the empirical loss on the instance z;.The goal of distributed most of the time due to the offline workers(clients or edge learning is to solve the problem in(1)by designing learning servers).Hence,ABL is preferred in these cases. algorithms based on multiple networked machines. To the best of our knowledge.there exist only two Although there have appeared many distributed learning ABL methods:Kardam (Damaskinos et al.,2018)and frameworks,in this paper we focus on the widely used Zeno++(Xie et al..2020).Kardam introduces two filters to Parameter Server (PS)framework (Li et al.,2014).In a drop out suspicious learning information(gradients),which PS framework.there are several workers and one or more can still achieve good performance when the communication servers.Each worker can only communicate with server(s). delay is heavy.However,when in face of malicious attack, There may exist more than one server in a PS framework. some work (Xie et al..2020)finds that Kardam also drops but for the problem of this paper servers can be logically con- out most correct gradients in order to filter all faulty(failure) ceived as a unity.Without loss of generality,we will assume gradients.Hence,Kardam cannot resist malicious attack. there is only one server in this paper.Training instances are Zeno++needs to store some training instances on server disjointedly distributed across m workers.Let D denote for scoring.In some practical applications like federated the index set of training instances on worker_k,we have learning (Kairouz et al.,2019),storing data on server will URD ={1,2,...:n}and DnD =0 ifk'.In increase the risk of privacy leakage or even face legal risk. this paper,we assume that server has no access to any train- Therefore,under the general setting where server has no ing instances.If two instances have the same value,they are access to any training instances,there does not exist ABL still deemed as two distinct instances.Namely,z;may equal methods which can resist malicious attack. z;(ii).One popular asynchronous method to solve the problem in (1)under the PS framework is ASGD (Dean In this paper,we propose a novel method,called buffered asynchronous stochastic gradient descent(BASGD),for et al.,2012)(see Appendix A of supplementary materials for ABL.The main contributions are listed as follows: details).In this paper,we assume each worker samples one instance for gradient computation each time.The analysis To the best of our knowledge.BASGD is the first ABL of mini-batch case is similar. method that can resist malicious attack without storing In PS based ASGD,server is responsible for updating and any instances on server.Compared with those methods maintaining the latest parameter.The number of iterations which need to store instances on server,BASGD has a that server has already executed is used as the global logical wider scope of application. clock of server.At the beginning,iteration number t =0. BASGD is theoretically proved to be convergent,and Each time a SGD step is executed,t will increase by 1 be able to resist failure or attack immediately.The parameter after t iterations is denoted as wt.If server sends parameters to worker_k at iteration Empirical results show that BASGD significantly out- t',some SGD steps may have been excuted before server performs vanilla asynchronous stochastic gradient de- receives gradient from worker_k next time at iteration t. scent(ASGD)and other ABL baselines when there Thus,we define the delay of workerk at iteration t as r= exist failure or malicious attack on workers.In particu- t-t'.Worker_k is heavily delayed at iteration t if r> lar,BASGD can still converge under malicious attack, Tmaz,where Tmar is a pre-defined non-negative constant. when ASGD and other ABL methods fail. 2.2.Byzantine Worker 2.Preliminary For workers that have sent gradients (one or more)to server In this section,we present the preliminary of this paper, at iteration t,we call worker_k loyal worker if it has finished including the distributed learning framework used in this all the tasks without any fault and each sent gradient is paper and the definition of Byzantine worker. correctly received by the server.Otherwise,workerk is called Byzantine worker.If worker_k is a Byzantine worker, 2.1.Distributed Learning Framework it means the received gradient from worker_k is not credible, which can be an arbitrary value.In ASGD,there is one Many machine learning models,such as logistic regression received gradient at a time.Formally,we denote the gradient and deep neural networks,can be formulated as the follow- received from worker_k at iteration t as g.Then,we have: ing finite sum optimization problem: mFw)=∑fw2, Vf(w;),if worker_k:is loyal at iteration t; wE i=1 if worker k is Byzantine at iteration t
BASGD: Buffered Asynchronous SGD for Byzantine Learning convergence when there exist heterogeneous workers. Furthermore, in some applications like federated learning and edge computing, synchronization cannot even be performed most of the time due to the offline workers (clients or edge servers). Hence, ABL is preferred in these cases. To the best of our knowledge, there exist only two ABL methods: Kardam (Damaskinos et al., 2018) and Zeno++ (Xie et al., 2020). Kardam introduces two filters to drop out suspicious learning information (gradients), which can still achieve good performance when the communication delay is heavy. However, when in face of malicious attack, some work (Xie et al., 2020) finds that Kardam also drops out most correct gradients in order to filter all faulty (failure) gradients. Hence, Kardam cannot resist malicious attack. Zeno++ needs to store some training instances on server for scoring. In some practical applications like federated learning (Kairouz et al., 2019), storing data on server will increase the risk of privacy leakage or even face legal risk. Therefore, under the general setting where server has no access to any training instances, there does not exist ABL methods which can resist malicious attack. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. The main contributions are listed as follows: • To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD has a wider scope of application. • BASGD is theoretically proved to be convergent, and be able to resist failure or attack. • Empirical results show that BASGD significantly outperforms vanilla asynchronous stochastic gradient descent (ASGD) and other ABL baselines when there exist failure or malicious attack on workers. In particular, BASGD can still converge under malicious attack, when ASGD and other ABL methods fail. 2. Preliminary In this section, we present the preliminary of this paper, including the distributed learning framework used in this paper and the definition of Byzantine worker. 2.1. Distributed Learning Framework Many machine learning models, such as logistic regression and deep neural networks, can be formulated as the following finite sum optimization problem: min w∈Rd F(w) = 1 n Xn i=1 f(w; zi), (1) where w is the parameter to learn, d is the dimension of parameter, n is the number of training instances, f(w; zi) is the empirical loss on the instance zi . The goal of distributed learning is to solve the problem in (1) by designing learning algorithms based on multiple networked machines. Although there have appeared many distributed learning frameworks, in this paper we focus on the widely used Parameter Server (PS) framework (Li et al., 2014). In a PS framework, there are several workers and one or more servers. Each worker can only communicate with server(s). There may exist more than one server in a PS framework, but for the problem of this paper servers can be logically conceived as a unity. Without loss of generality, we will assume there is only one server in this paper. Training instances are disjointedly distributed across m workers. Let Dk denote the index set of training instances on worker k, we have ∪ m k=1Dk = {1, 2, . . . , n} and Dk ∩ Dk0 = ∅ if k 6= k 0 . In this paper, we assume that server has no access to any training instances. If two instances have the same value, they are still deemed as two distinct instances. Namely, zi may equal zi 0 (i 6= i 0 ). One popular asynchronous method to solve the problem in (1) under the PS framework is ASGD (Dean et al., 2012) (see Appendix A of supplementary materials for details). In this paper, we assume each worker samples one instance for gradient computation each time. The analysis of mini-batch case is similar. In PS based ASGD, server is responsible for updating and maintaining the latest parameter. The number of iterations that server has already executed is used as the global logical clock of server. At the beginning, iteration number t = 0. Each time a SGD step is executed, t will increase by 1 immediately. The parameter after t iterations is denoted as wt . If server sends parameters to worker k at iteration t 0 , some SGD steps may have been excuted before server receives gradient from worker k next time at iteration t. Thus, we define the delay of worker k at iteration t as τ t k = t − t 0 . Worker k is heavily delayed at iteration t if τ t k > τmax, where τmax is a pre-defined non-negative constant. 2.2. Byzantine Worker For workers that have sent gradients (one or more) to server at iteration t, we call worker k loyal worker if it has finished all the tasks without any fault and each sent gradient is correctly received by the server. Otherwise, worker k is called Byzantine worker. If worker k is a Byzantine worker, it means the received gradient from worker k is not credible, which can be an arbitrary value. In ASGD, there is one received gradient at a time. Formally, we denote the gradient received from worker k at iteration t as g t k . Then, we have: g t k = ( ∇f(wt 0 ; zi), if worker k is loyal at iteration t; ∗, if worker k is Byzantine at iteration t
BASGD:Buffered Asynchronous SGD for Byzantine Learning Algorithm 1 Buffered Asynchronous SGD(BASGD) Server: Input:learning rate n.reassignment interval A, Workers buffer number B,aggregation function:Aggr(); Initialization:initial parameter wo,learning rate n: 0 12 13 Set buffer:hb←0,Ng←0: Initialize mapping table Bss (s =0,1,...,m-1); Buffer_O Buffer_ Send initial wo to all workers; Set t 0,and start the timer; repeat Figure 1.An example of buffers.Circle represents worker,and the Wait until receiving g from some worker-s; number is worker ID.There are 15 workers and 5 buffers.The Choose buffer:bB mod B: gradient received from worker_s is stored in buffer-{s mod 5). LetN店←N话+l,and he←-h+s: N ifN话>0 for each b∈[B]then Aggregate:G=Aggr([h1,...,hB]): Execute SGD step:wt+i←w-n·G; received gradient is credible or not. Zero out buffers:h←0,Nt←-0(b=l,,B); In order to deal with this problem in asynchronous BL, Set tt+1.and restart the timer: we propose a novel method called buffered asynchronous end if SGD(BASGD).BASGD introduces B buffers(0<B if the timer has exceeded A seconds then m)on server,and the gradient used for updating parameters Zero out buffers:h6←0,N5←0(b=1,.,B): will be aggregated from these buffers.The detail of the Modify the mapping table for buffer reas- learning procedure of BASGD is presented in Algorithm 1. signment,and restart the timer; In this section,we will introduce the three key components end if of BASGD:buffer,aggregation function,and mapping table Send back the latest parameters back to worker_s,no matter whether a SGD step is executed or not. 3.1.Buffer until stop criterion is satisfied Notify all workers to stop; In BASGD,the m workers do the same job as that in ASGD, while the updating rule on server is modified.More specifi- Worker_k:=0,1,...,m-1) cally,there are B buffers(0<B<m)on server.When a repeat gradient g from worker-s is received,it will be temporarily Wait until receiving the latest parameter w from server; stored in buffer b,where b=s mod B.as illustrated in Randomly sample an index i from D; Figure 1.Only when each buffer has stored at least one gra- Compute Vf(w;z): dient,a new SGD step will be executed.Please note that no Send Vf(w;zi)to server; matter whether a SGD step is executed or not,the server will until receive server's notification to stop immediately send the latest parameters back to the worker after receiving a gradient.Hence,BASGD introduces no barrier,and is an asynchronous algorithm. where 0≤t'≤t,and i is randomly sampled from D.‘* For each buffer b,more than one gradient may have been represents an arbitrary value.Our definition of Byzantine received at iteration t.We will store the average of these worker is consistent with most previous works(Blanchard gradients (denoted by h)in buffer b.Assume that there are et al.,2017;Xie et al.,2019;2020).Either accidental failure already (N-1)gradients gi,g2,...,gN-1 which should or malicious attack will result in Byzantine workers. be stored in buffer and )gi.When the N-th gradient gN is received,the new average value is: 3.Buffered Asynchronous SGD W-1 In synchronous BL,gradients from all workers are received hb(new) N hold+N·g at each iteration.We can compare the gradients with each other,and then filter suspicious ones,or use more robust This is the updating rule for each buffer b when a gradient is aggregation rules such as median and trimmed-mean for received.We use N to denote the total number of gradients updating.However,in asynchronous BL,only one gradient stored in buffer b at the t-th iteration.After the parameter w is received at a time.Without any training instances stored is updated,all buffers will be zeroed out at once.With the on server,it is difficult for server to identify whether a benefit of buffers,server has access to B candidate gradients
BASGD: Buffered Asynchronous SGD for Byzantine Learning Algorithm 1 Buffered Asynchronous SGD (BASGD) Server: Input: learning rate η, reassignment interval ∆, buffer number B, aggregation function: Aggr(·); Initialization: initial parameter w0 , learning rate η; Set buffer: hb ← 0, Nt b ← 0; Initialize mapping table βs ← s (s = 0, 1, . . . , m − 1); Send initial w0 to all workers; Set t ← 0, and start the timer; repeat Wait until receiving g from some worker s; Choose buffer: b ← βs mod B; Let Nt b ← Nt b + 1, and hb ← (Nt b−1)hb+g Nt b ; if Nt b > 0 for each b ∈ [B] then Aggregate: Gt = Aggr([h1, . . . , hB]); Execute SGD step: wt+1 ← wt − η · Gt ; Zero out buffers: hb ← 0, Nt b ← 0 (b = 1, . . . , B); Set t ← t + 1, and restart the timer; end if if the timer has exceeded ∆ seconds then Zero out buffers: hb ← 0, Nt b ← 0 (b = 1, . . . , B); Modify the mapping table {βs} m−1 s=0 for buffer reassignment, and restart the timer; end if Send back the latest parameters back to worker s, no matter whether a SGD step is executed or not. until stop criterion is satisfied Notify all workers to stop; Worker k: (k = 0, 1, ..., m − 1) repeat Wait until receiving the latest parameter w from server; Randomly sample an index i from Dk; Compute ∇f(w; zi); Send ∇f(w; zi) to server; until receive server’s notification to stop where 0 ≤ t 0 ≤ t, and i is randomly sampled from Dk. ‘∗’ represents an arbitrary value. Our definition of Byzantine worker is consistent with most previous works (Blanchard et al., 2017; Xie et al., 2019; 2020). Either accidental failure or malicious attack will result in Byzantine workers. 3. Buffered Asynchronous SGD In synchronous BL, gradients from all workers are received at each iteration. We can compare the gradients with each other, and then filter suspicious ones, or use more robust aggregation rules such as median and trimmed-mean for updating. However, in asynchronous BL, only one gradient is received at a time. Without any training instances stored on server, it is difficult for server to identify whether a Figure 1. An example of buffers. Circle represents worker, and the number is worker ID. There are 15 workers and 5 buffers. The gradient received from worker s is stored in buffer {s mod 5}. received gradient is credible or not. In order to deal with this problem in asynchronous BL, we propose a novel method called buffered asynchronous SGD (BASGD). BASGD introduces B buffers (0 < B ≤ m) on server, and the gradient used for updating parameters will be aggregated from these buffers. The detail of the learning procedure of BASGD is presented in Algorithm 1. In this section, we will introduce the three key components of BASGD: buffer, aggregation function, and mapping table. 3.1. Buffer In BASGD, the m workers do the same job as that in ASGD, while the updating rule on server is modified. More specifi- cally, there are B buffers (0 < B ≤ m) on server. When a gradient g from worker s is received, it will be temporarily stored in buffer b, where b = s mod B, as illustrated in Figure 1. Only when each buffer has stored at least one gradient, a new SGD step will be executed. Please note that no matter whether a SGD step is executed or not, the server will immediately send the latest parameters back to the worker after receiving a gradient. Hence, BASGD introduces no barrier, and is an asynchronous algorithm. For each buffer b, more than one gradient may have been received at iteration t. We will store the average of these gradients (denoted by hb) in buffer b. Assume that there are already (N − 1) gradients g1, g2, . . . , gN−1 which should be stored in buffer b, and hb(old) = 1 N−1 PN−1 i=1 gi . When the N-th gradient gN is received, the new average value is: hb(new) = 1 N X N i=1 gi = N − 1 N · hb(old) + 1 N · gN . This is the updating rule for each buffer b when a gradient is received. We use Nt b to denote the total number of gradients stored in buffer b at the t-th iteration. After the parameter w is updated, all buffers will be zeroed out at once. With the benefit of buffers, server has access to B candidate gradients
BASGD:Buffered Asynchronous SGD for Byzantine Learning when updating parameter.Thus,a more reliable (robust) larger than any of the first B-1 values. gradient can be aggregated from the B gradients of buffers, if a proper aggregation function Aggr()is chosen. The following two aggregation functions are both q-BR. Please note that from the perspective of workers,BASGD Definition 2(Coordinate-wise median (Yin et al..2018)). is fully asynchronous,since a worker will immediately re- For candidate gradients h1,h2,...,hB Rd,h ceive the latest parameter from the server after sending a [hb1,hb2,...,hbd]T,Vb=1,2,...,B.Coordinate-wise gradient to the server,without waiting for other workers. median is defined as: Meanwhile,from the perspective of server,BASGD is semi- asynchronous because the server will not update the model Med([h1,.,hB)=[Medh.1),..,Medh.d)]T until all buffers are filled.However,it is a necessity to limit the updating frequency in ABL when server has no instances. where Med(h.j)is the scalar median of the j-th coordi- If the server always updates the model when receiving a gra- nates,Yj=1,2,...,d. dient,it will be easily foiled when Byzantine workers send Definition 3(Coordinate-wise g-trimmed-mean (Yin et al., gradients much more frequently than others.A similar con- 2018)).For any positive interger q1,mean function is not q-Byzantine tion Aggr()is called an (A1,A2)-effective aggregation Robust for any q>.We illustrate this by a one-dimension function,provided that it satisfies the following two proper- example:h1,·,hB-1∈[0,1,and hB=l0×B.Then ties for all wt E Rd in cases without delay (=0,Vt= 合∑,hb≥0=10生[0,l刂Namely.the mean is 0,1,.,T-1
BASGD: Buffered Asynchronous SGD for Byzantine Learning when updating parameter. Thus, a more reliable (robust) gradient can be aggregated from the B gradients of buffers, if a proper aggregation function Aggr(·) is chosen. Please note that from the perspective of workers, BASGD is fully asynchronous, since a worker will immediately receive the latest parameter from the server after sending a gradient to the server, without waiting for other workers. Meanwhile, from the perspective of server, BASGD is semiasynchronous because the server will not update the model until all buffers are filled. However, it is a necessity to limit the updating frequency in ABL when server has no instances. If the server always updates the model when receiving a gradient, it will be easily foiled when Byzantine workers send gradients much more frequently than others. A similar conclusion has been proved in previous works (Damaskinos et al., 2018). 3.2. Aggregation Function When a SGD step is ready to be executed, there are B buffers providing candidate gradients. An aggregation function is needed to get the final gradient for updating. A naive way is to take the mean of all candidate gradients. However, mean value is sensitive to outliers which are common in BL. For designing proper aggregation functions, we first define the q-Byzantine Robust (q-BR) condition to quantitatively describe the Byzantine resilience ability of an aggregation function. Definition 1 (q-Byzantine Robust). For an aggregation function Aggr(·): Aggr([h1, . . . , hB]) = G, where G = [G1, . . . , Gd] T and hb = [hb1, . . . , hbd] T , ∀b ∈ [B], we call Aggr(·) q-Byzantine Robust (q ∈ Z, 0 1, mean function is not q-Byzantine Robust for any q > 0. We illustrate this by a one-dimension example: h1, . . . , hB−1 ∈ [0, 1], and hB = 10 × B. Then 1 B PB b=1 hb ≥ hB B = 10 6∈ [0, 1]. Namely, the mean is larger than any of the first B − 1 values. The following two aggregation functions are both q-BR. Definition 2 (Coordinate-wise median (Yin et al., 2018)). For candidate gradients h1, h2, . . . , hB ∈ R d , hb = [hb1, hb2, . . . , hbd] T , ∀b = 1, 2, . . . , B. Coordinate-wise median is defined as: Med([h1, . . . , hB]) = [Med(h·1), . . . , Med(h·d)]T , where Med(h·j ) is the scalar median of the j-th coordinates, ∀j = 1, 2, . . . , d. Definition 3 (Coordinate-wise q-trimmed-mean (Yin et al., 2018)). For any positive interger q < B/2 and candidate gradients h1, h2, . . . , hB ∈ R d , hb = [hb1, hb2, . . . , hbd] T , ∀b = 1, 2, . . . , B. Coordinate-wise q-trimmed-mean is de- fined as: T rm([h1, . . . , hB]) = [T rm(h·1), . . . , T rm(h·d)]T , where T rm(h·j ) = 1 B−2q P b∈Mj hbj is the scalar qtrimmed-mean. Mj is the subset of {hbj} B b=1 obtained by removing the q largest elements and q smallest elements. In the following content, coordinate-wise median and coordinate-wise q-trimmed-mean are also called median and trmean, respectively. Proposition 1 shows the q-BR property of these two functions. Proposition 1. Coordinate-wise q-trmean is q-BR, and coordinate-wise median is b B−1 2 c-BR. Here, bxc represents the maximum integer that is not larger than x. According to Proposition 1, both median and trmean are proper choices for aggregation function in BASGD. The proof can be found in Appendix B of supplementary materials. Now we define another class of aggregation functions, which is also important in analysis in Section 4. Definition 4 (Stable aggregation function). Aggregation function Aggr(·) is called stable provided that ∀h1, . . . , hB, h˜ 1, . . . , h˜B ∈ R d , letting δ = (PB b=1 khb − h˜ bk 2 ) 1 2 , we have: kAggr(h1, . . . , hB) − Aggr(h˜ 1, . . . , h˜B)k ≤ δ. If Aggr(·) is a stable aggregation function, it means that when there is a disturbance with L2-norm δ on buffers, the disturbance of aggregated result will not be larger than δ. Definition 5 (Effective aggregation function). When there are at most r Byzantine workers, stable aggregation function Aggr(·) is called an (A1, A2)-effective aggregation function, provided that it satisfies the following two properties for all wt ∈ R d in cases without delay (τ t k = 0, ∀t = 0, 1, . . . , T − 1):
BASGD:Buffered Asynchronous SGD for Byzantine Learning (i).E[VF(w)TGmw]F(w)2-A1: (ii.E[Gm2 w内≤(A)2: where A1,A2are two non-negative constants,Gm Work is the gradient aggregated by Aggr()at the t-th iteration in cases without delay. For different aggregation functions,constants A and A2 Buffer_2 Buffer_3 Buffers on the server may differ.A1 and A2 are related to loss function F(),dis tribution of instances.buffer number B.maximum Byzan- tine worker number r.Inequalities(i)and (ii)in Definition 5 are two important properties in convergence proof of syn- chronous Byzantine learning methods.As revealed in (Yang 14 Workers et al.,2020),there are many existing aggregation rules for Byzantine learning.We find that most of them satisfy Defini- tion 5.For example,Krum,median,and trimmed-mean have already been proved to satisfy these two properties(Blan- Buffer_2 Buffers on uffer 3 uffer the server chard et al..2017:Yin et al..2018).SignSGD(Bernstein et al.,2019)can be seen as a combination of 1-bit quanti- zation and median aggregation,while median satisfies the Figure 2.An example of buffer reassignment.White circle repre- properties in Definition 5. sents active worker,and grey circle represents unresponsive worker. Before reassignment,buffer_0 is a straggler.After reassignment, Compared to Definition 1,Definition 5 can be used to obtain there are at least one active worker corresponding to each buffer. a tighter bound with respect to A1 and A2.However,it usually requires more effort to check the two properties in Definition 5 than those in Definition 1. Please note that too large B could lower the updating fre- 1).When reassigning buffers,the server only needs to mod- quency and damage the performance,while too small B ify the mapping table,and then stores worker-s's may harm the Byzantine-resilience.Thus,a moderate B is gradients in buffer-{B,mod B},instead of buffer_fs mod usually preferred.In some practical applications,we could B}any more.Please note that the server only needs to estimate the maximum number of Byzantine workers r,and modify the mapping table for buffer reassignment,and there set B to make the aggregation function resilient to up to is no need to notify workers. r Byzantine workers.In particular,B is suggested to be (2r+1)for median,since median is ]-BR. Besides,a timer is used on the server for indicating when to reassign buffers.The timer is started at the beginning 3.3.Mapping Table of BASGD,and is restarted immediately after each SGD step or buffer reassignment.When the timer exceeds A sec- At each iteration of BASGD.buffer_b needs at least one onds,buffers will be zeroed out,and reassignment executed. gradient for aggregation.In the worst case,all the workers Hyper-parameter A should be set properly.If A is too corresponding to buffer_b may be unresponsive.In this case, small,buffers will be zeroed out too frequently,which may buffer_b will become the straggler,and slow down the whole slow down the learning process.If A is too large,straggler learning process.To deal with this problem,we introduce buffers could not be eliminated in time the mapping table for buffer reassignment technique. We call a worker active worker if it has responsed at the 4.Convergence current iteration.If SGD step has not been excuted for A seconds,the server immediately zeroes out stored gradients In this section,we theoretically prove the convergence and in all buffers,equally reassigns active workers to each buffer, resilience of BASGD against failure or attack.There are and then continues the learning procedure.Hyper-parameter two main theorems.The first theorem presents a relatively loose but general bound for all g-BR aggregation functions. A is called reassignment interval.Figure 2 illustrates an example of reassignment.The grey circles represents unre- The second one presents a relatively tight bound for each sponsive workers.After reassignment,there are at least one distinct(A1,A2)-effective aggregation function.Since the active worker corresponding to each buffer. definition of(A1,A2)-effective aggregation function is usu- ally more difficult to verify than g-BR property,the general Specifically,we introduce a mapping tablefor bound is also useful.Here we only present the results. buffer reassignment.Initially,Bs =s(Vs =0,1,...,m- Proof details are in Appendix B of supplementary materials
BASGD: Buffered Asynchronous SGD for Byzantine Learning (i). E[∇F(wt ) T Gt syn | wt ] ≥ k∇F(wt )k 2 − A1; (ii). E[kGt synk 2 | wt ] ≤ (A2) 2 ; where A1, A2 ∈ R+ are two non-negative constants, Gt syn is the gradient aggregated by Aggr(·) at the t-th iteration in cases without delay. For different aggregation functions, constants A1 and A2 may differ. A1 and A2 are related to loss function F(·), distribution of instances, buffer number B, maximum Byzantine worker number r. Inequalities (i) and (ii) in Definition 5 are two important properties in convergence proof of synchronous Byzantine learning methods. As revealed in (Yang et al., 2020), there are many existing aggregation rules for Byzantine learning. We find that most of them satisfy Definition 5. For example, Krum, median, and trimmed-mean have already been proved to satisfy these two properties (Blanchard et al., 2017; Yin et al., 2018). SignSGD (Bernstein et al., 2019) can be seen as a combination of 1-bit quantization and median aggregation, while median satisfies the properties in Definition 5. Compared to Definition 1, Definition 5 can be used to obtain a tighter bound with respect to A1 and A2. However, it usually requires more effort to check the two properties in Definition 5 than those in Definition 1. Please note that too large B could lower the updating frequency and damage the performance, while too small B may harm the Byzantine-resilience. Thus, a moderate B is usually preferred. In some practical applications, we could estimate the maximum number of Byzantine workers r, and set B to make the aggregation function resilient to up to r Byzantine workers. In particular, B is suggested to be (2r + 1) for median, since median is b B−1 2 c-BR. 3.3. Mapping Table At each iteration of BASGD, buffer b needs at least one gradient for aggregation. In the worst case, all the workers corresponding to buffer b may be unresponsive. In this case, buffer b will become the straggler, and slow down the whole learning process. To deal with this problem, we introduce the mapping table for buffer reassignment technique. We call a worker active worker if it has responsed at the current iteration. If SGD step has not been excuted for ∆ seconds, the server immediately zeroes out stored gradients in all buffers, equally reassigns active workers to each buffer, and then continues the learning procedure. Hyper-parameter ∆ is called reassignment interval. Figure 2 illustrates an example of reassignment. The grey circles represents unresponsive workers. After reassignment, there are at least one active worker corresponding to each buffer. Specifically, we introduce a mapping table {βs} m−1 s=0 for buffer reassignment. Initially, βs = s (∀s = 0, 1, . . . , m − 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Buffer_0 Buffer_1 Buffer_2 Buffer_3 Buffer_4 Workers Buffers on the server 1 3 4 7 8 9 11 14 0 2 5 6 10 12 13 Buffer_0 Buffer_1 Buffer_2 Buffer_3 Buffer_4 Workers Buffers on the server Figure 2. An example of buffer reassignment. White circle represents active worker, and grey circle represents unresponsive worker. Before reassignment, buffer 0 is a straggler. After reassignment, there are at least one active worker corresponding to each buffer. 1). When reassigning buffers, the server only needs to modify the mapping table {βs} m−1 s=0 , and then stores worker s’s gradients in buffer {βs mod B}, instead of buffer {s mod B} any more. Please note that the server only needs to modify the mapping table for buffer reassignment, and there is no need to notify workers. Besides, a timer is used on the server for indicating when to reassign buffers. The timer is started at the beginning of BASGD, and is restarted immediately after each SGD step or buffer reassignment. When the timer exceeds ∆ seconds, buffers will be zeroed out, and reassignment executed. Hyper-parameter ∆ should be set properly. If ∆ is too small, buffers will be zeroed out too frequently, which may slow down the learning process. If ∆ is too large, straggler buffers could not be eliminated in time. 4. Convergence In this section, we theoretically prove the convergence and resilience of BASGD against failure or attack. There are two main theorems. The first theorem presents a relatively loose but general bound for all q-BR aggregation functions. The second one presents a relatively tight bound for each distinct (A1, A2)-effective aggregation function. Since the definition of (A1, A2)-effective aggregation function is usually more difficult to verify than q-BR property, the general bound is also useful. Here we only present the results. Proof details are in Appendix B of supplementary materials
BASGD:Buffered Asynchronous SGD for Byzantine Learning We first make the following assumptions,which have been an extra constant variance.The term O(rDdo(g-r+ widely used in stochastic optimization. 1))is caused by the aggregation function,which can be Assumption 1(Lower bound).Global loss function F(w) deemed as a sacrifice for Byzantine resilience.The term is bounded below:3F*∈R,F(w)≥F*,w∈Rd. O(rDdk(g-r +1))is caused by the differences of Assumption 2(Bounded bias).For any loyal worker,it can training instances among different workers.In independent and identically distributed (i.i.d.)cases,=0 and the term use locally stored training instances to estimate global gra- dient with bounded bias K:lEVf(w;z)】-7F(w)l≤ vanishes.The term O(rLDDdTmaz(q-r+1))is k,w∈Rd. caused by the delay,and related to parameter Tmaz.The term is also related to the buffer size.When N increases, Assumption 3(Bounded gradient).VF(w)is bounded: N()may increase,and thus D will decrease.Namely,larger 3D∈R+,IVF(w)l≤D,w∈R4. buffer size will result in smaller D.Besides,the factor Assumption 4 (Bounded variance).ElVf(w;zi)- (q-r+1)-or (g-r+1)decreases as g increases, Vf(w;z)|w]l|w]≤a2,w∈R4 and increases as r increases. Assumption 5(L-smoothness).Global loss function F(w) Although general,the bound presented in Theorem 1 is is differentiable and L-smooth:F(w)-VF(w'< relatively loose in high-dimensional cases,since d appears Lw-wll,w,w∈Rd. in all the three extra terms.To obtain a tighter bound.we introduce Theorem 2 for BASGD with(A1,A2)-effective Let N(t)be the (g+1)-th smallest value in {N}bE[B). aggregation function(Definition 5). where N is the number of gradients stored in buffer b at the t-th iteration.We define the constant Theorem 2.If the total number of heavily delayed workers and Byzantine workers is not larger than r,B =O(r). (B-r)√B-r+1 AB.q.r= and Aggr()is an(A1,A2)-effective aggregation function V(B-q-1)(g-r+1) in this case.With learning rate n-O().in general asynchronous cases we have: which will appear in Lemma I and Lemma 2. Lemma 1.If Aggr()is q-BR,and there are at most r ∑EVF(wI3] L[F(w0)-F* Byzantine workers (r <q),we have: T EGl21w1≤AB.gd·(D2+o2/N) +O Li Tmar DA2r +O L2(A2)2 T吃 T Lemma 2.If Aggr()is q-BR,and the total number of heavily delayed workers and Byzantine workers is not larger L&(42)2r品aL thanr(r≤q,we have:. +A1 T是 E[G-VF(w)w AB..d(TmazL.[AB..rd(D2+02/N(++). Theorem 2 indicates that if Aggr()makes a synchronous BL method converge (i.e.,satisfies Definition 5),BASGD Theorem1.LetD=∑(D2+a2/N)克.f converges when using Aggr(.)as aggregation function. Aggr()is q-BR,B=O(r),and the total number of heav- Hence,BASGD can also be seen as a technique of asyn- ily delayed workers and Byzantine workers is not larger chronization.That is to say,new asynchronous meth- thanr(),with learning rate(),we have: ods can be obtained from synchronous ones when using BASGD.The extra constant term A1 is caused by gradi- ∑TVF(wI≤O (L[F(w)-F*] ent bias.When there is no Byzantine workers(r=0) and instances are i.i.d.across workers,letting B=1 and T Aggr(h1,·,hB)=Aggr(h1)=h1,BASGD degener- rdD rDdo ates to vanilla ASGD.In this case,there is no gradient bias +O T(g-r+1) +0 (q-r+1) (A1 =0),and BASGD has a convergence rate of O(1/VT), which is the same as that of vanilla ASGD (Liu Zhang, rDdk r是LDDd造Tma 2021). +0 (g-r+1) (g-r+1) Meanwhile,it remains uncertain whether the dependence to the staleness parameter Tmaz is tight enough.Theo- Please note that the convergence rate of vanilla ASGD is rem 2 illustrates that BASGD has a convergence rate of O(T).Hence,Theorem 1 indicates that BASGD has a O(Tmaz/T),while the convergence rate of vanilla ASGD theoretical convergence rate as fast as vanilla ASGD,with can reach O(Tma/T).To the best of our knowledge,there
BASGD: Buffered Asynchronous SGD for Byzantine Learning We first make the following assumptions, which have been widely used in stochastic optimization. Assumption 1 (Lower bound). Global loss function F(w) is bounded below: ∃F ∗ ∈ R, F(w) ≥ F ∗ , ∀w ∈ R d . Assumption 2 (Bounded bias). For any loyal worker, it can use locally stored training instances to estimate global gradient with bounded bias κ: kE[∇f(w; zi)] − ∇F(w)k ≤ κ, ∀w ∈ R d . Assumption 3 (Bounded gradient). ∇F(w) is bounded: ∃D ∈ R +, k∇F(w)k ≤ D, ∀w ∈ R d . Assumption 4 (Bounded variance). E[||∇f(w; zi) − E[∇f(w; zi) | w]||2 | w] ≤ σ 2 , ∀w ∈ R d . Assumption 5 (L-smoothness). Global loss function F(w) is differentiable and L-smooth: ||∇F(w) − ∇F(w0 )|| ≤ L||w − w0 ||, ∀w, w0 ∈ R d . Let N(t) be the (q + 1)-th smallest value in {Nt b }b∈[B] , where Nt b is the number of gradients stored in buffer b at the t-th iteration. We define the constant ΛB,q,r = (B − r) √ B − r + 1 p (B − q − 1)(q − r + 1) , which will appear in Lemma 1 and Lemma 2. Lemma 1. If Aggr(·) is q-BR, and there are at most r Byzantine workers (r ≤ q), we have: E[||Gt ||2 | wt ] ≤ ΛB,q,rd · (D2 + σ 2 /N(t) ). Lemma 2. If Aggr(·) is q-BR, and the total number of heavily delayed workers and Byzantine workers is not larger than r (r ≤ q), we have: ||E[Gt − ∇F(wt ) | wt ]|| ≤ΛB,q,rd(τmaxL · [ΛB,q,rd(D2 + σ 2 /N(t) )] 1 2 + σ + κ). Theorem 1. Let D˜ = 1 T PT −1 t=0 (D2 + σ 2/N(t) ) 1 2 . If Aggr(·) is q-BR, B = O(r), and the total number of heavily delayed workers and Byzantine workers is not larger than r (r ≤ q), with learning rate η = O( 1 L √ T ), we have: PT −1 t=0 E[||∇F(wt )||2 ] T ≤ O L[F(w0 ) − F ∗ ] T 1 2 + O rdD˜ T 1 2 (q − r + 1) 1 2 ! + O rDdσ (q − r + 1) 1 2 + O rDdκ (q − r + 1) 1 2 + O r 3 2 LDDd˜ 3 2 τmax (q − r + 1) 3 4 ! . Please note that the convergence rate of vanilla ASGD is O(T − 1 2 ). Hence, Theorem 1 indicates that BASGD has a theoretical convergence rate as fast as vanilla ASGD, with an extra constant variance. The term O(rDdσ(q − r + 1)− 1 2 ) is caused by the aggregation function, which can be deemed as a sacrifice for Byzantine resilience. The term O(rDdκ(q − r + 1)− 1 2 ) is caused by the differences of training instances among different workers. In independent and identically distributed (i.i.d.) cases, κ = 0 and the term vanishes. The term O(r 3 2 LDDd˜ 3 2 τmax(q − r + 1)− 3 4 ) is caused by the delay, and related to parameter τmax. The term is also related to the buffer size. When Nt b increases, N(t) may increase, and thus D˜ will decrease. Namely, larger buffer size will result in smaller D˜. Besides, the factor (q − r + 1)− 1 2 or (q − r + 1)− 3 4 decreases as q increases, and increases as r increases. Although general, the bound presented in Theorem 1 is relatively loose in high-dimensional cases, since d appears in all the three extra terms. To obtain a tighter bound, we introduce Theorem 2 for BASGD with (A1, A2)-effective aggregation function (Definition 5). Theorem 2. If the total number of heavily delayed workers and Byzantine workers is not larger than r, B = O(r), and Aggr(·) is an (A1, A2)-effective aggregation function in this case. With learning rate η = O( √ 1 LT ), in general asynchronous cases we have: PT −1 t=0 E[k∇F(wt )k 2 ] T ≤ O L 1 2 [F(w0 ) − F ∗ ] T 1 2 ! + O L 1 2 τmaxDA2r 1 2 T 1 2 ! + O L 1 2 (A2) 2 T 1 2 ! + O L 5 2 (A2) 2 τ 2 maxr T 3 2 ! + A1. Theorem 2 indicates that if Aggr(·) makes a synchronous BL method converge (i.e., satisfies Definition 5), BASGD converges when using Aggr(·) as aggregation function. Hence, BASGD can also be seen as a technique of asynchronization. That is to say, new asynchronous methods can be obtained from synchronous ones when using BASGD. The extra constant term A1 is caused by gradient bias. When there is no Byzantine workers (r = 0), and instances are i.i.d. across workers, letting B = 1 and Aggr(h1, . . . , hB) = Aggr(h1) = h1, BASGD degenerates to vanilla ASGD. In this case, there is no gradient bias (A1 = 0), and BASGD has a convergence rate of O(1/ √ T), which is the same as that of vanilla ASGD (Liu & Zhang, 2021). Meanwhile, it remains uncertain whether the dependence to the staleness parameter τmax is tight enough. Theorem 2 illustrates that BASGD has a convergence rate of O(τmax/T 1 2 ), while the convergence rate of vanilla ASGD can reach O(τmax/T). To the best of our knowledge, there
BASGD:Buffered Asynchronous SGD for Byzantine Learning exist few works revealing the tightness of Tmaz in asyn- failure with expectation 0.Besides,each worker is manually chronous BL.and we will leave this for future work. set to have a delay,which is kdet times the computing time. In general cases,Theorem 2 guarantees BASGD to find Training set is randomly and equally distributed to different a point such that the squared L2-norm of its gradient is workers.We use the average top-1 test accuracy (in IC)or not larger than A1(but not necessarily around a stationary average perplexity (in NLP)on all workers w.r.t.epochs as point),in expectation.Please note that Assumption 3 already final metrics.For BASGD,we use median and trimmed- guarantees that gradient's squared L2-norm is not larger mean as aggregation function.Reassignment interval is set than D2.We introduce Proposition 2 to show that Ai is to be 1 second.Top-1 test accuracy (in IC)w.r.t.wall-clock guaranteed to be smaller than D2 under a mild condition. time of BASGD with more aggregation functions is reported in Appendix C of supplementary material. Proposition 2.Aggr()is an (A1,A2)-effective aggrega- Because BASGD is an ABL method.SBL methods can- tion function,and Gun is aggregated by Aggr()in syn- chronous setting.If ElllGsun -VF(wt)l w]D, not be directly compared with BASGD.The ABL method Zeno++either cannot be directly compared with BASGD, Vwt∈Rd,then we have A1≤D2 because Zeno++needs to store some instances on server The number of instances stored on server will affect the Gis the aggregated result of Aggr(),and is a ro- performance of Zeno++(Xie et al.,2020).Hence,we com- bust estimator of VF(w)used for updating.Since pare BASGD with ASGD and Kardam in our experiments. VF(w)<D,VF(w)locates in a ball with radius D.EllGm -VF(w)w]D means that the bias We set dampening function A()-for Kardam as suggested in (Damaskinos et al.,2018). of G is not larger than the radius D,which is a mild condition for Aggr(). 5.2.Image Classification Experiment As many existing works have shown (Assran et al.,2020: Nokleby et al.,2020),speed-up is also an important aspect In IC experiment,algorithms are evaluated on CIFAR- of distributed learning methods.In BASGD,different work- 10(Krizhevsky et al.,2009)with deep learning model ResNet-20(He et al.,2016).Cross-entropy is used as ers can compute gradients concurrently,make each buffer be filled more quickly,and thus speed up the model updating. the loss function.We set katk =10 for NG-attack,and However,we mainly focus on Byzantine-resilience in this atk=0.2 for RD-attack.kdet is randomly sampled from truncated standard normal distribution within 0.+oo).As work.Speed-up will be thoroughly studied in future work. suggested in (He et al.,2016),learning rate n is set to 0.1 initially for each algorithm,and multiplied by 0.1 at the 5.Experiment 80-th epoch and the 120-th epoch respectively.The weight decay is set to 10-4.We run each algorithm for 160 epochs. In this section,we empirically evaluate the performance of Batch size is set to 25. BASGD and baselines in both image classification (IC)and natural language processing (NLP)applications.Our exper- Firstly,we compare the performance of different methods iments are conducted on a distributed platform with dock- when there are no Byzantine workers.Experimental results ers.Each docker is bound to an NVIDIA Tesla V100(32G) with median and trmean aggregation functions are illustrated GPU (in IC)or an NVIDIA Tesla K80 GPU (in NLP).Please in Figure 3(a)and Figure 3(d),respectively.ASGD achieves note that different GPU cards do not affect the reported met- the best performance.BASGD(B 1)and Kardam have rics in the experiment.We choose 30 dockers as workers similar convergence rate to ASGD,but both sacrifice a little in IC,and 8 dockers in NLP.An extra docker is chosen as accuracy.Besides,the performance of BASGD gets worse server.All algorithms are implemented with PyTorch 1.3. when the buffer number B increases,which is consistent with the theoretical results.Please note that ASGD is a 5.1.Experimental Setting degenerated case of BASGD when B=1 and Aggr(h1)= h1.Hence,BASGD can achieve the same performance as We compare the performance of different methods under two ASGD when there is no failure or attack. types of attack:negative gradient attack (NG-attack)and random disturbance attack(RD-attack).Byzantine workers Then,for each type of attack,we conduct two experiments with NG-attack send gNG =-katk'g to server,where g is in which there are 3 and 6 Byzantine workers,respectively the true gradient and katkER is a parameter.Byzantine We respectively set 10 and 15 buffers for BASGD in these workers with RD-attack send grD =g+grnd to server, two experiments.For space saving,we only present average where grnd is a random vector sampled from normal distri- top-1 test accuracy in Figure 3(b)and Figure 3(e)(3 Byzan- bution (0,lloatkgll2.I).Here,atk is a parameter and I tine workers),and Figure 3(c)and Figure 3(f)(6 Byzantine is an identity matrix.NG-attack is a typical kind of mali- workers).Results about training loss are in Appendix C.We cious attack,while RD-attack can be seen as an accidental can find that BASGD significantly outperforms ASGD and
BASGD: Buffered Asynchronous SGD for Byzantine Learning exist few works revealing the tightness of τmax in asynchronous BL, and we will leave this for future work. In general cases, Theorem 2 guarantees BASGD to find a point such that the squared L2-norm of its gradient is not larger than A1 (but not necessarily around a stationary point), in expectation. Please note that Assumption 3 already guarantees that gradient’s squared L2-norm is not larger than D2 . We introduce Proposition 2 to show that A1 is guaranteed to be smaller than D2 under a mild condition. Proposition 2. Aggr(·) is an (A1, A2)-effective aggregation function, and Gt syn is aggregated by Aggr(·) in synchronous setting. If E[kGt syn − ∇F(wt )k | wt ] ≤ D, ∀wt ∈ R d , then we have A1 ≤ D2 . Gt syn is the aggregated result of Aggr(·), and is a robust estimator of ∇F(wt ) used for updating. Since k∇F(wt )k ≤ D, ∇F(wt ) locates in a ball with radius D. E[kGt syn − ∇F(wt )k | wt ] ≤ D means that the bias of Gt syn is not larger than the radius D, which is a mild condition for Aggr(·). As many existing works have shown (Assran et al., 2020; Nokleby et al., 2020), speed-up is also an important aspect of distributed learning methods. In BASGD, different workers can compute gradients concurrently, make each buffer be filled more quickly, and thus speed up the model updating. However, we mainly focus on Byzantine-resilience in this work. Speed-up will be thoroughly studied in future work. 5. Experiment In this section, we empirically evaluate the performance of BASGD and baselines in both image classification (IC) and natural language processing (NLP) applications. Our experiments are conducted on a distributed platform with dockers. Each docker is bound to an NVIDIA Tesla V100 (32G) GPU (in IC) or an NVIDIA Tesla K80 GPU (in NLP). Please note that different GPU cards do not affect the reported metrics in the experiment. We choose 30 dockers as workers in IC, and 8 dockers in NLP. An extra docker is chosen as server. All algorithms are implemented with PyTorch 1.3. 5.1. Experimental Setting We compare the performance of different methods under two types of attack: negative gradient attack (NG-attack) and random disturbance attack (RD-attack). Byzantine workers with NG-attack send g˜NG = −katk · g to server, where g is the true gradient and katk ∈ R+ is a parameter. Byzantine workers with RD-attack send g˜RD = g + grnd to server, where grnd is a random vector sampled from normal distribution N (0, kσatkgk 2 · I). Here, σatk is a parameter and I is an identity matrix. NG-attack is a typical kind of malicious attack, while RD-attack can be seen as an accidental failure with expectation 0. Besides, each worker is manually set to have a delay, which is kdel times the computing time. Training set is randomly and equally distributed to different workers. We use the average top-1 test accuracy (in IC) or average perplexity (in NLP) on all workers w.r.t. epochs as final metrics. For BASGD, we use median and trimmedmean as aggregation function. Reassignment interval is set to be 1 second. Top-1 test accuracy (in IC) w.r.t. wall-clock time of BASGD with more aggregation functions is reported in Appendix C of supplementary material. Because BASGD is an ABL method, SBL methods cannot be directly compared with BASGD. The ABL method Zeno++ either cannot be directly compared with BASGD, because Zeno++ needs to store some instances on server. The number of instances stored on server will affect the performance of Zeno++ (Xie et al., 2020). Hence, we compare BASGD with ASGD and Kardam in our experiments. We set dampening function Λ(τ ) = 1 1+τ for Kardam as suggested in (Damaskinos et al., 2018). 5.2. Image Classification Experiment In IC experiment, algorithms are evaluated on CIFAR- 10 (Krizhevsky et al., 2009) with deep learning model ResNet-20 (He et al., 2016). Cross-entropy is used as the loss function. We set katk = 10 for NG-attack, and σatk = 0.2 for RD-attack. kdel is randomly sampled from truncated standard normal distribution within [0, +∞). As suggested in (He et al., 2016), learning rate η is set to 0.1 initially for each algorithm, and multiplied by 0.1 at the 80-th epoch and the 120-th epoch respectively. The weight decay is set to 10−4 . We run each algorithm for 160 epochs. Batch size is set to 25. Firstly, we compare the performance of different methods when there are no Byzantine workers. Experimental results with median and trmean aggregation functions are illustrated in Figure 3(a) and Figure 3(d), respectively. ASGD achieves the best performance. BASGD (B > 1) and Kardam have similar convergence rate to ASGD, but both sacrifice a little accuracy. Besides, the performance of BASGD gets worse when the buffer number B increases, which is consistent with the theoretical results. Please note that ASGD is a degenerated case of BASGD when B = 1 and Aggr(h1) = h1. Hence, BASGD can achieve the same performance as ASGD when there is no failure or attack. Then, for each type of attack, we conduct two experiments in which there are 3 and 6 Byzantine workers, respectively. We respectively set 10 and 15 buffers for BASGD in these two experiments. For space saving, we only present average top-1 test accuracy in Figure 3(b) and Figure 3(e) (3 Byzantine workers), and Figure 3(c) and Figure 3(f) (6 Byzantine workers). Results about training loss are in Appendix C. We can find that BASGD significantly outperforms ASGD and
BASGD:Buffered Asynchronous SGD for Byzantine Learning ASGD ASGD -ASGD BASGD -BASGD with median但=0j BASGD wih median(B=15) -BASGD with median (B=30 -BASGD with trmean (B=10,q=3) -BASGO wih trmean(B=15.g=6 0 80 100 120 40 0 120 140 180 20 0 60 2010 Epoch (a)no attack (b)3 Byzantine workers (RD-attack) (c)6 Byzantine workers (RD-attack) 05 80 ◆ ASGD BASGD with trmean (B=10.q=3) ◆一Kardam(=3】 3nB=5.c=1 -Kardam (=6) n=10 -Kardam (y=10) HA3 th tmmean(B=15,g=5) ★Krdm(=14)】 mean (B=30,q=10 Kardam (=10) ★ ★中中 40 60 20100120140 160 20 40800100120140160 20 4008010020t4016 Epoch Epoch Epoch (d)no attack (e)3 Byzantine workers(NG-attack) (f)6 Byzantine workers(NG-attack) Figure 3.Average top-1 test accuracy w.r.t.epochs when there are no Byzantine workers(left column).3 Byzantine workers(middle column)and 6 Byzantine workers(right column).respectively.Subfigures (b)and (c)are for RD-attack,while (e)and (f)for NG-attack. Table 1.Filtered ratio of received gradients in Kardam under NG-attack in IC task(3 Byzantine workers) TERM BY FREQUENCY FILTER BY LIPSCHITZ FILTER IN TOTAL LOYAL GRADS (Y=3) 10.15%(31202/307530) 40.97%(126000/307530) 51.12% BYZANTINE GRADS (Y=3) 10.77%(3681/34170) 40.31%(13773/34170) 51.08% LOYAL GRADS (Y=8) 28.28%(86957/307530) 28.26%(86893/307530)】 56.53% BYZANTINE GRADS (Y=8) 28.38%(9699/34170) 28.06%(9588/34170) 56.44% LOYAL GRADS (y=14) 85.13%(261789/307530 3.94%(12117/307530) 89.07% BYZANTINE GRADS (y=14) 84.83%(28985/34170) 4.26%(1455/34170】 89.08% Kardam under both RD-attack (accidental failure)and NG- 5.3.Natural Language Processing Experiment attack(malicious attack).Under the less harmful RD-attack, although ASGD and Kardam still converge,they both suf- In NLP experiment,the algorithms are evaluated on the WikiText-2 dataset with LSTM (Hochreiter Schmidhuber. fer a significant loss on accuracy.Under NG-attack,both ASGD and Kardam cannot converge,even if we have tried 1997)networks.We only use the training set and test set, different values of assumed Byzantine worker number for while the validation set is not used in our experiment.For Kardam,which is denoted by a hyper-parameter y in this LSTM,we adopt 2 layers with 100 units in each.Word paper.Hence,both ASGD and Kardam cannot resist mali- embedding size is set to 100,and sequence length is set to cious attack.On the contrary.BASGD still has a relatively 35.Gradient clipping size is set to 0.25.Cross-entropy is good performance under both types of attack. used as the loss function.For each algorithm,we run each algorithm for 40 epochs.Initial learning rate n is chosen Moreover,we count the ratio of filtered gradients in Kardam, from {1,2,5,10,20),and is divided by 4 every 10 epochs. which is shown in Table 1.We can find that in order to The best test result is adopted as the final one filter Byzantine gradients,Kardam also filters approximately equal ratio of loyal gradients.It explains why Kardam The performance of ASGD under no attack is used as gold performs poorly under malicious attack. standard.We set katk =10 and atk =0.1.One of the eight workers is Byzantine.kdet is randomly sampled from
BASGD: Buffered Asynchronous SGD for Byzantine Learning 0 20 40 60 80 100 120 140 160 Epoch 40 45 50 55 60 65 70 75 80 85 90 95 Average Top-1 Accuracy ASGD BASGD with median (B=5) BASGD with median (B=10) BASGD with median (B=15) BASGD with median (B=30) Kardam ( =2) Kardam ( =10) (a) no attack 0 20 40 60 80 100 120 140 160 Epoch 0 10 20 30 40 50 60 70 80 90 Average Top-1 Accuracy ASGD BASGD with median (B=10) BASGD with trmean (B=10, q=3) Kardam ( =3) Kardam ( =6) (b) 3 Byzantine workers (RD-attack) 0 20 40 60 80 100 120 140 160 Epoch 0 10 20 30 40 50 60 70 80 90 Average Top-1 Accuracy ASGD BASGD with median (B=15) BASGD with trmean (B=15, q=6) Kardam ( =6) Kardam ( =10) (c) 6 Byzantine workers (RD-attack) 0 20 40 60 80 100 120 140 160 Epoch 40 45 50 55 60 65 70 75 80 85 90 95 Average Top-1 Accuracy ASGD BASGD with trmean (B=5, q=1) BASGD with trmean (B=10, q=3) BASGD with trmean (B=15, q=5) BASGD with trmean (B=30, q=10) Kardam ( =2) Kardam ( =10) (d) no attack 0 20 40 60 80 100 120 140 160 Epoch 0 10 20 30 40 50 60 70 80 90 Average Top-1 Accuracy ASGD BASGD with median (B=10) BASGD with trmean (B=10, q=3) Kardam ( =3) Kardam ( =8) Kardam ( =14) (e) 3 Byzantine workers (NG-attack) 0 20 40 60 80 100 120 140 160 Epoch 0 10 20 30 40 50 60 70 80 90 Average Top-1 Accuracy ASGD BASGD with median (B=15) BASGD with trmean (B=15, q=6) Kardam ( =6) Kardam ( =10) Kardam ( =14) (f) 6 Byzantine workers (NG-attack) Figure 3. Average top-1 test accuracy w.r.t. epochs when there are no Byzantine workers (left column), 3 Byzantine workers (middle column) and 6 Byzantine workers (right column), respectively. Subfigures (b) and (c) are for RD-attack, while (e) and (f) for NG-attack. Table 1. Filtered ratio of received gradients in Kardam under NG-attack in IC task (3 Byzantine workers) TERM BY FREQUENCY FILTER BY LIPSCHITZ FILTER IN TOTAL LOYAL GRADS (γ = 3) 10.15% (31202/307530) 40.97% (126000/307530) 51.12% BYZANTINE GRADS (γ = 3) 10.77% (3681/34170) 40.31% (13773/34170) 51.08% LOYAL GRADS (γ = 8) 28.28% (86957/307530) 28.26% (86893/307530) 56.53% BYZANTINE GRADS (γ = 8) 28.38% (9699/34170) 28.06% (9588/34170) 56.44% LOYAL GRADS (γ = 14) 85.13% (261789/307530) 3.94% (12117/307530) 89.07% BYZANTINE GRADS (γ = 14) 84.83% (28985/34170) 4.26% (1455/34170) 89.08% Kardam under both RD-attack (accidental failure) and NGattack (malicious attack). Under the less harmful RD-attack, although ASGD and Kardam still converge, they both suffer a significant loss on accuracy. Under NG-attack, both ASGD and Kardam cannot converge, even if we have tried different values of assumed Byzantine worker number for Kardam, which is denoted by a hyper-parameter γ in this paper. Hence, both ASGD and Kardam cannot resist malicious attack. On the contrary, BASGD still has a relatively good performance under both types of attack. Moreover, we count the ratio of filtered gradients in Kardam, which is shown in Table 1. We can find that in order to filter Byzantine gradients, Kardam also filters approximately equal ratio of loyal gradients. It explains why Kardam performs poorly under malicious attack. 5.3. Natural Language Processing Experiment In NLP experiment, the algorithms are evaluated on the WikiText-2 dataset with LSTM (Hochreiter & Schmidhuber, 1997) networks. We only use the training set and test set, while the validation set is not used in our experiment. For LSTM, we adopt 2 layers with 100 units in each. Word embedding size is set to 100, and sequence length is set to 35. Gradient clipping size is set to 0.25. Cross-entropy is used as the loss function. For each algorithm, we run each algorithm for 40 epochs. Initial learning rate η is chosen from {1, 2, 5, 10, 20}, and is divided by 4 every 10 epochs. The best test result is adopted as the final one. The performance of ASGD under no attack is used as gold standard. We set katk = 10 and σatk = 0.1. One of the eight workers is Byzantine. kdel is randomly sampled from
BASGD:Buffered Asynchronous SGD for Byzantine Learning 900 700 -ASGD (no attack) 500 -BASGD with median (B4) 400 300 ◆ardam1 200 A-Kardam (3) ASGD 00 10 15 35 5 Epoch 10 (a)RD-attack (b)RD-attack (magnified) t00 800 700 s00 500 e-ASGD (no attack) --BASGD with median (B4) 400 300 20 0””2站0站40 5 050若的 Epoch Epoch (c)NG-attack (d)NG-attack (magnified) Figure 4.Average perplexity w.r.t.epochs with 1 Byzantine worker.Subfigures(a)and(b)are for RD-attack,while Subfigures(c)and (d)for NG-attack.Due to the differences in magnitude of perplexity,y-axes of Subfigures(a)and(c)are in log-scale.In addition, Subfigures (b)and(d)illustrate that BASGD converges with only a little loss in perplexity compared to the gold standard. exponential distribution with parameter A=1.Each experi- search Project (No.61861146001)and NSFC Project(No. ment is carried out for 3 times,and the average perplexity is 61921006). reported in Figure 4.We can find that BASGD converges under each kind of attack,with only a little loss in perplexity compared to the gold standard(ASGD without attack).On References the other hand,ASGD and Kardam both fail,even if we Alistarh,D..Allen-Zhu,Z.,and Li,J.Byzantine stochastic have set the largest y(y=3)for Kardam. gradient descent.In Advances in Neural Information Processing Systems,pp.4613-4623,2018. 6.Conclusion Assran,B.M.,Aytekin,A.,Feyzmahdavian,H.R.,Johans- In this paper,we propose a novel method called BASGD for son,M.,and Rabbat,M.G.Advances in asynchronous asynchronous Byzantine learning.To the best of our knowl- parallel and distributed optimization.Proceedings of the edge,BASGD is the first ABL method that can resist mali- IEEE,108(11):2013-2031,2020. cious attack without storing any instances on server.Com- pared with those methods which need to store instances on Baruch,G.,Baruch,M.,and Goldberg.Y.A little is enough: server,BASGD has a wider scope of application.BASGD Circumventing defenses for distributed learning.In Ad- is proved to be convergent,and be able to resist failure or vances in Neural Information Processing Systems,pp attack.Empirical results show that BASGD significantly 8635-8645.2019. outperforms vanilla ASGD and other ABL baselines,when there exists failure or attack on workers. Bernstein,J.,Zhao,J.,Azizzadenesheli,K.,and Anandku- mar,A.signSGD with majority vote is communication Acknowledgements efficient and fault tolerant.In Proceedings of the Interna- tional Conference on Learning Representations,2019. This work is supported by National Key R&D Program of China(No.2020YFA0713900),NSFC-NRF Joint Re- Blanchard.P..Guerraoui.R..Stainer,J.,et al.Machine learning with adversaries:Byzantine tolerant gradient
BASGD: Buffered Asynchronous SGD for Byzantine Learning 0 5 10 15 20 25 30 35 40 Epoch 0 50 100 150 Logarithm of Perplexity ASGD (no attack) BASGD with median (B=4) Kardam ( =1) Kardam ( =3) ASGD (a) RD-attack 0 5 10 15 20 25 30 35 40 Epoch 0 100 200 300 400 500 600 700 800 900 1000 Perplexity ASGD (no attack) BASGD with median (B=4) (b) RD-attack (magnified) 0 5 10 15 20 25 30 35 40 Epoch 0 1 2 3 4 5 6 7 8 9 10 Logarithm of Perplexity 104 ASGD (no attack) BASGD with median (B=4) Kardam ( =1) Kardam ( =3) ASGD (c) NG-attack 0 5 10 15 20 25 30 35 40 Epoch 0 100 200 300 400 500 600 700 800 900 1000 Perplexity ASGD (no attack) BASGD with median (B=4) (d) NG-attack (magnified) Figure 4. Average perplexity w.r.t. epochs with 1 Byzantine worker. Subfigures (a) and (b) are for RD-attack, while Subfigures (c) and (d) for NG-attack. Due to the differences in magnitude of perplexity, y-axes of Subfigures (a) and (c) are in log-scale. In addition, Subfigures (b) and (d) illustrate that BASGD converges with only a little loss in perplexity compared to the gold standard. exponential distribution with parameter λ = 1. Each experiment is carried out for 3 times, and the average perplexity is reported in Figure 4. We can find that BASGD converges under each kind of attack, with only a little loss in perplexity compared to the gold standard (ASGD without attack). On the other hand, ASGD and Kardam both fail, even if we have set the largest γ (γ = 3) for Kardam. 6. Conclusion In this paper, we propose a novel method called BASGD for asynchronous Byzantine learning. To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD has a wider scope of application. BASGD is proved to be convergent, and be able to resist failure or attack. Empirical results show that BASGD significantly outperforms vanilla ASGD and other ABL baselines, when there exists failure or attack on workers. Acknowledgements This work is supported by National Key R&D Program of China (No. 2020YFA0713900), NSFC-NRF Joint Research Project (No. 61861146001) and NSFC Project (No. 61921006). References Alistarh, D., Allen-Zhu, Z., and Li, J. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 4613–4623, 2018. Assran, B. M., Aytekin, A., Feyzmahdavian, H. R., Johansson, M., and Rabbat, M. G. Advances in asynchronous parallel and distributed optimization. Proceedings of the IEEE, 108(11):2013–2031, 2020. Baruch, G., Baruch, M., and Goldberg, Y. A little is enough: Circumventing defenses for distributed learning. In Advances in Neural Information Processing Systems, pp. 8635–8645, 2019. Bernstein, J., Zhao, J., Azizzadenesheli, K., and Anandkumar, A. signSGD with majority vote is communication efficient and fault tolerant. In Proceedings of the International Conference on Learning Representations, 2019. Blanchard, P., Guerraoui, R., Stainer, J., et al. Machine learning with adversaries: Byzantine tolerant gradient
BASGD:Buffered Asynchronous SGD for Byzantine Learning descent.In Advances in Neural Information Processing Johnson.R.and Zhang,T.Accelerating stochastic gradient Systems,pp.119-129,2017. descent using predictive variance reduction.In Advances in Neural Information Processing Systems,pp.315-323. Bottou,L.Large-scale machine learning with stochastic 2013. gradient descent.In Proceedings of the International Conference on Computational Statistics,pp.177-186. Kairouz,P.,McMahan,H.B.,Avent,B.,Bellet,A.,Bennis, Springer,2010. M.,Bhagoji,A.N.,Bonawitz,K.,Charles,Z.,Cormode, G.,Cummings,R.,et al.Advances and open problems in Chen.Y..Su,L..and Xu,J.Distributed statistical machine federated learning.arXiv:1912.04977,2019. learning in adversarial settings:Byzantine gradient de- scent.Proceedings of the ACM on Measurement and Konevcny,J.,McMahan,H.B.,Yu,F.X.,Richtarik, Analysis of Computing Systems,1(2):1-25,2017. P.,Suresh,A.T.,and Bacon,D.Federated learn- ing:Strategies for improving communication efficiency. Damaskinos,G.,Guerraoui,R.,Patra,R.,Taziki,M..et al arXiv:1610.05492.2016. Asynchronous Byzantine machine learning(the case of Krizhevsky,A..Hinton,G.,et al.Learning multiple layers SGD).In Proceedings of the International Conference of features from tiny images.Technical report,2009. on Machine Learning,pp.1145-1154,2018 Lee,J.D.,Lin,Q.,Ma,T.,and Yang,T.Distributed stochas- Dean,J.,Corrado,G.,Monga,R.,Chen,K.,Devin,M.,Mao, tic variance reduced gradient methods by sampling extra M.,Ranzato,M.,Senior,A.,Tucker,P.,Yang,K.,et al. data with replacement.The Journal of Machine Learning Large scale distributed deep networks.In Advances in Research,.18(1):4404-4446.2017. Neural Information Processing Systems,pp.1223-1231, 2012. Li.M..Andersen.D.G..Smola.A.J..and Yu.K.Com- munication efficient distributed machine learning with Diakonikolas,I.and Kane,D.M.Recent advances in algo- the parameter server.In Advances in Neural Information rithmic high-dimensional robust statistics.arXiv preprint Processing Systems,pp.19-27,2014. arXi:1911.05911,2019 Lian,X.,Zhang,C.,Zhang.H.,Hsieh,C.-J.,Zhang.W.,and Diakonikolas,I.,Kamath,G.,Kane,D.M.,Li,J.,Moitra, Liu,J.Can decentralized algorithms outperform central- A.,and Stewart,A.Being robust (in high dimensions) ized algorithms?a case study for decentralized parallel can be practical.In Proceedings of the International stochastic gradient descent.In Advances in Neural Infor- Conference on Machine Learning,pp.999-1008.2017. mation Processing Systems,pp.5330-5340,2017. Lin,Q.,Lu,Z.,and Xiao,L.An accelerated proximal coordi- Duchi,J.,Hazan,E.,and Singer,Y.Adaptive subgradient nate gradient method.In Advances in Neural Information methods for online learning and stochastic optimization. Processing Systems,pp.3059-3067,2014. Journal of Machine Learning Research,12(Jul):2121- 2159.2011. Liu,J.and Zhang,C.Distributed learning systems with first-order methods.arXiv preprint arXiv:2104.05245, Haddadpour,F,Kamani,M.M.,Mahdavi,M.,and 2021. Cadambe,V.Trading redundancy for communication: Speeding up distributed SGD for non-convex optimiza- Ma,C.,Smith,V.,Jaggi,M.,Jordan,M.,Richtarik,P.,and tion.In Proceedings of the International Conference on Takac,M.Adding vs.averaging in distributed primal- Machine Learning,pp.2545-2554,2019. dual optimization.In Proceedings of the International Conference on Machine Learning,pp.1973-1982,2015. He,K.,Zhang,X.,Ren,S.,and Sun,J.Deep residual learn- ing for image recognition.In Proceedings of the IEEE Nokleby,M.,Raja,H.,and Bajwa,W.U.Scaling-up dis- conference on Computer Vision and Pattern Recognition, tributed processing of data streams for machine learning. pp.770-778,2016. arXiv preprint arXiv:2005.08854,2020. Schmidt,M.,Le Roux,N.,and Bach,F.Minimizing finite Hochreiter,S.and Schmidhuber,J.Long short-term memory. sums with the stochastic average gradient.Mathematical Neural Computation,9(8):1735-1780,1997. P0 gramming,162(1-2):83-112,2017. Jaggi,M.,Smith,V.,Takac,M.,Terhorst,J.,Krishnan,S., Shalev-Shwartz,S.and Zhang,T.Stochastic dual coordinate Hofmann,T.,and Jordan,M.I.Communication-efficient ascent methods for regularized loss minimization.Jour- distributed dual coordinate ascent.In Advances in Neural nal of Machine Learning Research,14(Feb):567-599, Information Processing Systems,pp.3068-3076,2014. 2013
BASGD: Buffered Asynchronous SGD for Byzantine Learning descent. In Advances in Neural Information Processing Systems, pp. 119–129, 2017. Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the International Conference on Computational Statistics, pp. 177–186. Springer, 2010. Chen, Y., Su, L., and Xu, J. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1–25, 2017. Damaskinos, G., Guerraoui, R., Patra, R., Taziki, M., et al. Asynchronous Byzantine machine learning (the case of SGD). In Proceedings of the International Conference on Machine Learning, pp. 1145–1154, 2018. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pp. 1223–1231, 2012. Diakonikolas, I. and Kane, D. M. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Being robust (in high dimensions) can be practical. In Proceedings of the International Conference on Machine Learning, pp. 999–1008, 2017. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121– 2159, 2011. Haddadpour, F., Kamani, M. M., Mahdavi, M., and Cadambe, V. Trading redundancy for communication: Speeding up distributed SGD for non-convex optimization. In Proceedings of the International Conference on Machine Learning, pp. 2545–2554, 2019. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. Jaggi, M., Smith, V., Takac, M., Terhorst, J., Krishnan, S., ´ Hofmann, T., and Jordan, M. I. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 3068–3076, 2014. Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323, 2013. Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. arXiv:1912.04977, 2019. Konevcny, J., McMahan, H. B., Yu, F. X., Richt ` arik, ´ P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv:1610.05492, 2016. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, 2009. Lee, J. D., Lin, Q., Ma, T., and Yang, T. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. The Journal of Machine Learning Research, 18(1):4404–4446, 2017. Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pp. 19–27, 2014. Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330–5340, 2017. Lin, Q., Lu, Z., and Xiao, L. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pp. 3059–3067, 2014. Liu, J. and Zhang, C. Distributed learning systems with first-order methods. arXiv preprint arXiv:2104.05245, 2021. Ma, C., Smith, V., Jaggi, M., Jordan, M., Richtarik, P., and ´ Takac, M. Adding vs. averaging in distributed primal- ´ dual optimization. In Proceedings of the International Conference on Machine Learning, pp. 1973–1982, 2015. Nokleby, M., Raja, H., and Bajwa, W. U. Scaling-up distributed processing of data streams for machine learning. arXiv preprint arXiv:2005.08854, 2020. Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017. Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013