1 FINE:A Framework for Distributed Learning on Incomplete Observations for Heterogeneous Crowdsensing Networks Luoyi Ful,Songjun Ma2,Lingkun Kong',Shiyu Liang2,Xinbing Wang1.2 Dept.of {Computer Science and Engineering,Electronic Engineering2} Shanghai Jiao Tong University,Shanghai,China Email:{yiluofu,masongjun,klk316980786,lsy18602808513,xwang8}@sjtu.edu.cn Abstract-In recent years there has been a wide range of consuming and prone to error for central server collecting data applications of crowdsensing in mobile social networks and from all mobile devices,especially those who are distributed vehicle networks.As centralized learning methods lead to un- far away from the server.Second,dealing with large volume reliabitlity of data collection,high cost of central server and concern of privacy,one important problem is how to carry out of data by centralized algorithms requires an expensive high- an accurate distributed learning process to estimate parameters configuration data center that possesses huge memory for of an unknown model in crowdsensing.Motivated by this, data storage and processing.Third,managing data by central we present the design,analysis and evaluation of FINE,a servers make the private information of users more likely to be distributed learning Framework for Incomplete-data and Non- exposed to the adversary [13]-[15].which might cause severe smooth Estimation.Our design,devoted to develop a feasible framework that efficiently and accurately learns the parameters information leakage. in crowdsensing networks,well generalizes the previous learning The three problems above imply the necessity of a dis- methods in that it supports heterogeneous dimensions of data tributed realization of parameter learning in crowdsensing records observed by different nodes,as well as minimisation environments.However,when applying existing distributed based on non-smooth error functions.In particular,FINE uses a novel Distributed Record Completion algorithm that allows each learning methods to our scenarios,restrictions of two charac- node to obtain the global consensus by an efficient communication teristics in the common distributed framework spawn addition- with neighbours,and a Distributed Dual Average algorithm that al problems.To illustrate,first,for mathematical tractability, achieves the efficiency of minimizing non-smooth error functions. the error function is usually assumed to be smooth and Our analysis shows that all these algorithms converge,of which convex for the design of efficient algorithms;while as the the convergence rates are also derived to confirm their efficiency emerging crowdsensing applications may incorporate different We evaluate the performance of our framework with experiments on synthetic and real world networks. properties,the training error functions may be non-smooth in nature [16],[17].For instance,in distributed detections [18]. source intensity functions may be non-smooth,resulting in I.INTRODUCTION non-smoothness in training error as well.Second,the common Recently,there emerge massive applications of crowdsens- framework requires each terminal to acquire a set of complete ing/participatory sensing in mobile social networks and vehicle records,i.e.,each record with data elements in all dimensions networks [17].The crowd acquire some (potentially high to ensure the accuracy of the learning process.This implicitly dimensional)data from the environment and each user in assumes that the terminals are homogeneous in functionalities the crowd can exploit the cooperatively acquired data to so that each of them should record the same types of signals perform a learning process for an accurate estimation of the (e.g.,each mobile phone can record traveling speed,waiting parameters of some specific models.This,in turn,leads to an time as well as ambient noise at every position).In contrast,it accurate prediction of future events and correct decision of the is impossible for each terminal to record full-dimension data in following action. the crowdsensing applications.For example,one mobile phone In this paper,we aim to address the issue of the accurate can only be responsible for data acquisition at its own position, learning in the undirected-static-random crowdsensing net- leaving the observation of elements in other positions (i.e., works.In order to solve this problem,there have been var- dimensions)a job of other mobile phones.Moreover,mobile ious proposed approaches [8]-[10]whose learning processes phones may hold different types of sensors,and therefore are are usually formulated as optimizations of the total training unable to acquire all kinds of signals. error,likelihood function and etc [11](e.g.,liner regression, We are thus motivated to propose a distributed learning support vector machines or expectation-maximization [12). Framework of Incomplete-data and Non-smooth Estimation However,these methods usually employ centralized learning (FINE),which aims to exhibit high compatibility to learning algorithms,which leads to three major problems.First,in real applications in crowdsensing environments.There are two world crowdsensing settings,mobile devices are likely to be major challenges in the design:1)It is difficult for each node located over an enormous space,which makes it both energy to supplement the unknown dimensions of the observed vector
1 FINE: A Framework for Distributed Learning on Incomplete Observations for Heterogeneous Crowdsensing Networks Luoyi Fu1 , Songjun Ma2 , Lingkun Kong1 , Shiyu Liang2 , Xinbing Wang1,2 Dept. of {Computer Science and Engineering1 , Electronic Engineering2} Shanghai Jiao Tong University, Shanghai, China Email: {yiluofu,masongjun,klk316980786,lsy18602808513,xwang8}@sjtu.edu.cn Abstract—In recent years there has been a wide range of applications of crowdsensing in mobile social networks and vehicle networks. As centralized learning methods lead to unreliabitlity of data collection, high cost of central server and concern of privacy, one important problem is how to carry out an accurate distributed learning process to estimate parameters of an unknown model in crowdsensing. Motivated by this, we present the design, analysis and evaluation of FINE, a distributed learning Framework for Incomplete-data and Nonsmooth Estimation. Our design, devoted to develop a feasible framework that efficiently and accurately learns the parameters in crowdsensing networks, well generalizes the previous learning methods in that it supports heterogeneous dimensions of data records observed by different nodes, as well as minimisation based on non-smooth error functions. In particular, FINE uses a novel Distributed Record Completion algorithm that allows each node to obtain the global consensus by an efficient communication with neighbours, and a Distributed Dual Average algorithm that achieves the efficiency of minimizing non-smooth error functions. Our analysis shows that all these algorithms converge, of which the convergence rates are also derived to confirm their efficiency. We evaluate the performance of our framework with experiments on synthetic and real world networks. I. INTRODUCTION Recently, there emerge massive applications of crowdsensing/participatory sensing in mobile social networks and vehicle networks [1]–[7]. The crowd acquire some (potentially high dimensional) data from the environment and each user in the crowd can exploit the cooperatively acquired data to perform a learning process for an accurate estimation of the parameters of some specific models. This, in turn, leads to an accurate prediction of future events and correct decision of the following action. In this paper, we aim to address the issue of the accurate learning in the undirected-static-random crowdsensing networks. In order to solve this problem, there have been various proposed approaches [8]–[10] whose learning processes are usually formulated as optimizations of the total training error, likelihood function and etc [11] (e.g.,, liner regression, support vector machines or expectation-maximization [12]). However, these methods usually employ centralized learning algorithms, which leads to three major problems. First, in real world crowdsensing settings, mobile devices are likely to be located over an enormous space, which makes it both energy consuming and prone to error for central server collecting data from all mobile devices, especially those who are distributed far away from the server. Second, dealing with large volume of data by centralized algorithms requires an expensive highconfiguration data center that possesses huge memory for data storage and processing. Third, managing data by central servers make the private information of users more likely to be exposed to the adversary [13]–[15], which might cause severe information leakage. The three problems above imply the necessity of a distributed realization of parameter learning in crowdsensing environments. However, when applying existing distributed learning methods to our scenarios, restrictions of two characteristics in the common distributed framework spawn additional problems. To illustrate, first, for mathematical tractability, the error function is usually assumed to be smooth and convex for the design of efficient algorithms; while as the emerging crowdsensing applications may incorporate different properties, the training error functions may be non-smooth in nature [16], [17]. For instance, in distributed detections [18], source intensity functions may be non-smooth, resulting in non-smoothness in training error as well. Second, the common framework requires each terminal to acquire a set of complete records, i.e., each record with data elements in all dimensions to ensure the accuracy of the learning process. This implicitly assumes that the terminals are homogeneous in functionalities so that each of them should record the same types of signals (e.g., each mobile phone can record traveling speed, waiting time as well as ambient noise at every position). In contrast, it is impossible for each terminal to record full-dimension data in the crowdsensing applications. For example, one mobile phone can only be responsible for data acquisition at its own position, leaving the observation of elements in other positions (i.e., dimensions) a job of other mobile phones. Moreover, mobile phones may hold different types of sensors, and therefore are unable to acquire all kinds of signals. We are thus motivated to propose a distributed learning Framework of Incomplete-data and Non-smooth Estimation (FINE), which aims to exhibit high compatibility to learning applications in crowdsensing environments. There are two major challenges in the design: 1) It is difficult for each node to supplement the unknown dimensions of the observed vector
3 especially when users hesitate to upload the acquired data formulate our problems,followed by giving insights of our to the central server for privacy concerns.2)Due to the algorithm design. huge amount of data collected and the high dimensionality of each record,it is often required that the learning process A.System Model should be operated by each terminal in a distributed fashion. The efficiency of minimizing a single non-smooth function is We consider a heterogeneous sensor network described by notoriously low [19],yet a distributed processing is even more an undirected graph g=(V,E)consisting of nodes.We challenging due to intricate interdependency among the mul- use y and to denote the set of nodes and edges,respec- tiple non-smooth optimizations processed by each respective tively.The undirected graph implies that the sensor network terminal. allows each pair of connected sensors can communicate in We overcome the difficulties above by designing two al- bi-direction. gorithms in FINE.First,we design a Distributed Record The heterogeneity refers to the fact that terminals observe Completion (DRC)algorithm to allow each node to obtain different dimensions of data.We let y(i)denote an incom- global consensus.Specifically,each terminal consistently ob- plete data vector observed by each terminal k:at time i.For tains incomplete record and completes the missing elements practical consideration,we allow the observation to include from its neighbors.The combined successive observation and noise sk(i).Assume that y*is an M-dimensional complete consensus design ensure that each node can acquire unbiased correct data vector,the incomplete observations of the k-th and accurate multidimensional global parameters in spite of agent,y(i),could be expressed by a linear mapping,Hk, the originally fragmentary inputs.Second,we design a novel from the global data vector y: Distributed Dual Average (DDA)algorithm to solve the non- yk()=Hky*+ek()∈RM (1) smooth convex optimization problems with efficiency.To sum up,our major contributions are listed as follows: where Hk is a matrix in RMxM We propose FINE,a novel framework addressing a class Each terminal requires the full dimensional data to conduct of distributed learning problems in heterogeneous crowd- learning process.Assume that terminals have various error sensing networks.FINE is robust to observation noises, functions fk.We allow each local error function to be non- capable of handling fragmentary data inputs as well as smooth.The local training error e at node k is a function non-smooth objective functions.and efficient to solve of global tunable parameter vector e and global data y,i.e., distributed learning problems with a convergence rate2 Ek f(0,y),which is convex and non-smooth on 0. f02】 We design two important algorithms in FINE:a DRC B.Problem Formulation algorithm to ensure each node to acquire complete in- In consideration of the above settings,we should first formation based on its incomplete data acquisition,and guarantee that each node can obtain an accurate estimation a DDA algorithm to solve non-smooth convex optimiza- g on the global data vector.This encourages us to design the tions with efficiency. DRC algorithm.In short,the goal of the DRC is to make each We formally prove the convergence of the above two node k hold the complete estimation which satisfies: algorithms,and further derive their convergence rates. We provide the insights on the relationship between the li四9k()=y. convergence rate and the network topology,and reveal important design principles for such networks. Then,with the estimations and the settings of error func- The rest of the paper is organized as follows.In Section II, tions,we find the optimal parameter to minimize the total we provide problem description and the design insights of our training error: framework FINE.In section III,we propose two important algorithms and demonstrate the implementation of such a VI framework.Section IV presents a detailed analysis of proposed 日*:=argmin∑fk(0,),st,0∈6,9∈y, algorithms.Section V provides detailed proofs of lemmas and 8 k=1 theorems.We give performance evaluation of our framework where C R,C RM are both convex sets.This encourage in Section VI and literature review in Section VII.Section us to design the DDA algorithm.Note that the solution needs VIII is contributed to the concluding remarks. information exchange between a node and its neighbors,which involves the consideration of the network topology. II.SYSTEM MODEL AND PROBLEM FORMULATION To sum up,our formulation deals with the non-smooth In this section,we first present the system model and then learning.We extend the previous dual averaging approach [20]. [21]by taking the incomplete observation and observation IThe privacy concer of information leakage is not new in crowdsensing noise into account. networks,which often stems from the centralized management of crowd- Remark:As a contrast,the traditional formulation [8]-[10], sensing systems [13]-[15].Therefore,users might be reluctant to upload the [22],[23]considers the smooth optimization problem based acquired data to the central server.. 2 is the number of agents and 1-C represents the spectral gap of the on a homogeneous sensor network.The homogeneity implies network. that each agent k should record the same types of signals,and
2 especially when users hesitate to upload the acquired data to the central server for privacy concerns.1 2) Due to the huge amount of data collected and the high dimensionality of each record, it is often required that the learning process should be operated by each terminal in a distributed fashion. The efficiency of minimizing a single non-smooth function is notoriously low [19], yet a distributed processing is even more challenging due to intricate interdependency among the multiple non-smooth optimizations processed by each respective terminal. We overcome the difficulties above by designing two algorithms in FINE. First, we design a Distributed Record Completion (DRC) algorithm to allow each node to obtain global consensus. Specifically, each terminal consistently obtains incomplete record and completes the missing elements from its neighbors. The combined successive observation and consensus design ensure that each node can acquire unbiased and accurate multidimensional global parameters in spite of the originally fragmentary inputs. Second, we design a novel Distributed Dual Average (DDA) algorithm to solve the nonsmooth convex optimization problems with efficiency. To sum up, our major contributions are listed as follows: • We propose FINE, a novel framework addressing a class of distributed learning problems in heterogeneous crowdsensing networks. FINE is robust to observation noises, capable of handling fragmentary data inputs as well as non-smooth objective functions, and efficient to solve distributed learning problems with a convergence rate2 of O( log √ |V| 1−C ). • We design two important algorithms in FINE: a DRC algorithm to ensure each node to acquire complete information based on its incomplete data acquisition, and a DDA algorithm to solve non-smooth convex optimizations with efficiency. • We formally prove the convergence of the above two algorithms, and further derive their convergence rates. We provide the insights on the relationship between the convergence rate and the network topology, and reveal important design principles for such networks. The rest of the paper is organized as follows. In Section II, we provide problem description and the design insights of our framework FINE. In section III, we propose two important algorithms and demonstrate the implementation of such a framework. Section IV presents a detailed analysis of proposed algorithms. Section V provides detailed proofs of lemmas and theorems. We give performance evaluation of our framework in Section VI and literature review in Section VII. Section VIII is contributed to the concluding remarks. II. SYSTEM MODEL AND PROBLEM FORMULATION In this section, we first present the system model and then 1The privacy concern of information leakage is not new in crowdsensing networks, which often stems from the centralized management of crowdsensing systems [13]–[15]. Therefore, users might be reluctant to upload the acquired data to the central server.. 2 |V| is the number of agents and 1 − C represents the spectral gap of the network. formulate our problems, followed by giving insights of our algorithm design. A. System Model We consider a heterogeneous sensor network described by an undirected graph G = (V, E) consisting of |V| nodes. We use V and E to denote the set of nodes and edges, respectively. The undirected graph implies that the sensor network allows each pair of connected sensors can communicate in bi-direction. The heterogeneity refers to the fact that terminals observe different dimensions of data. We let yk (i) denote an incomplete data vector observed by each terminal k at time i. For practical consideration, we allow the observation to include noise εk(i). Assume that y ∗ is an M-dimensional complete correct data vector, the incomplete observations of the k-th agent, yk (i), could be expressed by a linear mapping, Hk, from the global data vector y ∗ : yk (i) = Hky ∗ + εk(i) ∈ R M, (1) where Hk is a matrix in RM×M. Each terminal requires the full dimensional data to conduct learning process. Assume that terminals have various error functions fk. We allow each local error function to be nonsmooth. The local training error ϵk at node k is a function of global tunable parameter vector θ and global data y, i.e., ϵk = fk(θ, y), which is convex and non-smooth on θ. B. Problem Formulation In consideration of the above settings, we should first guarantee that each node can obtain an accurate estimation yˆ on the global data vector. This encourages us to design the DRC algorithm. In short, the goal of the DRC is to make each node k hold the complete estimation which satisfies: lim i→∞ yˆk (i) = y ∗ . Then, with the estimations and the settings of error functions, we find the optimal parameter θ ∗ to minimize the total training error: θ ∗ := arg min θ ∑ |V| k=1 fk(θ, yˆ), s.t., θ ∈ Θ, yˆ ∈ Y, where Θ ⊂ R Q, Y ⊂ RM are both convex sets. This encourage us to design the DDA algorithm. Note that the solution needs information exchange between a node and its neighbors, which involves the consideration of the network topology. To sum up, our formulation deals with the non-smooth learning. We extend the previous dual averaging approach [20], [21] by taking the incomplete observation and observation noise into account. Remark: As a contrast, the traditional formulation [8]–[10], [22], [23] considers the smooth optimization problem based on a homogeneous sensor network. The homogeneity implies that each agent k should record the same types of signals, and
3 full dimensional observations yk.The learning problem is to with the network size and networking topological param- find 0*which minimizes the summation of the local training eter C,literature [36]'s method can achieve the convergence error: rate of the error with () I In the context of crowdsensing,more practical factors, 8=arg min f0,y),s.L,0∈Θ,yk∈y, i.e.,partial observation and observation noise,need to be 8 k=1 considered,and these factors will lead to the increase of the time complexity.Therefore,it is worthwhile to propose where and y are both convex sets.e is a globally tunable a non-smooth algorithm to improve the efficiency.Towards parameter vector and the function f is assumed to be convex this end,we build our algorithm on an efficient non-smooth and smooth on the parameter 6. optimization methodology called the dual averaging method introduced in [20].The dual averaging in nature is an efficient C.Algorithm Design Insights non-smooth convex optimization method.By realizing the dual averaging method into a distributed manner,as will be In this section,we briefly describe the insights of algorithm prescribed in later sections,we will show that our algorithm design and explain how FINE can efficiently and accurately can return a consistent estimate for each node.Additionally. deal with the challenging distributed learning problems in het- as we will also provably demonstrate later,the convergence erogeneous crowdsensing networks.FINE uses a Distributed Record Completion (DRC)algorithm (Section III-A)to ensure rate of our distributed algorithm can also achieve each node to obtain global consensus in spite of the incomplete Remark:The DDA algorithm.as we will unfold in se- local observations and a Distributed Dual Average (DDA) quel,is designed by extending the centralized dual averaging algorithm (Section III-B)to efficiently solve the non-smooth method into a distributed form,requiring that each node optimization problems. performs local information exchange that follows a weight- 1)DRC Algorithm:Our DRC allows each node to com- ing process (where each edge is assigned a weight).Thus, municate with each other to complete the data and achieve intuitively the process is heavily correlated with the network consensus.on the condition that each node's successive ob- topology.Compared to [36].the DDA algorithm provides servations are required.This combined successive observation consistent efficiency despite that it is applied to solve more and consensus design ensure that each node can acquire complicated distributed learning problems that take practical unbiased and accurate multidimensional global parameters y. factors like partial observation and observation noise into On the one hand,if nodes cooperate but only process the account. initial noisy observations (i.e.,only {yk(0)}are employed), Up till now,we have presented our problem,and articulate as in traditional consensus [24-26],they all converge to the all insights which inform FINE's design.In the sequel,we average of initial estimates which might result in severe bias. will detail the design of DRC and DDA in FINE. Therefore,the requirement of both successive observations and consensus in DRC ensures that each node converges to an unbiased and accurate estimate of the global vector y'. III.ALGORITHMS Remark:The communication overhead,which quantifies the weight of communication among sensors or mobile phone In this section,we describe the details of the design of DRC users during reaching to a consensus,is an important per- and DDA algorithms used in FINE.We firstly introduce the formance metric of consensus-based learning algorithms [27] notations.Let En denote n x n identity matrix;let 1n,On [30].However,in our paper,we are interested in measuring the denote the column vectors of ones and zeros,defined in Rn time it takes for the whole network to discover the convergence respectively;let denote the standard Euclidean 2-norm for of optimization,i.e.,the optimal parameter 0*.Here the time a vector and the induced 2-norm for matrixes,equavelent to refers to communication complexity in reaching convergence the matrix spectral radius for symmetric matrixes.Besides,we of optimization,while neglecting local computation complex- use llll.to denote the dual norm to ll-ll.defined by llull.:= ity inside mobile devices as it is highly determined by the supell (u,v),which refer to the value of the linear function sensors required in a crowdsensing task and is considerably u∈X*atv∈X(X is a vector space andX*is its dual small-can be completed during clock ticks. space).We further let P[]and E[]denote the probability and 2)DDA Algorithm:A well-known sub-gradient method the expectation operators,respectively.We use the symbol [31]used for solving a single non-smooth optimization has to denote the Kronecker product manipulation,commonly used a time efficiency of ()where s is the tolerable error. in matrix manipulations.For instance,the Kronecker product For solving distributed non-smooth optimization problem con- of the n x n matrix L and Em is an nm x nm matrix,denoted sisting of multiple interdependent functions,the efficiency of byL⑧Em: the sub-gradient would be ()[21].[32].[33].where In the undirected graph C we consider,we denote the neigh- is the network size.This order does not reflect the explicit bor of an arbitrary node v∈v by Nu={u∈euw∈e}, influence of the network topology,but only reflects the worst the number of edges incident to v by the degree du=N of case of efficiency.References [23],[34],[35]provide the node v,and the degree matrix by D=diag(d1,...,dv).Then, convergence rate of the optimization error under some special we use an adjacent matrix.A-[to describe the network topologies.Aiming at the general network topology network connectivity.We set Auv=1,if ev E;or 0
3 full dimensional observations yk . The learning problem is to find θ ∗ which minimizes the summation of the local training error: θ ∗ := arg min θ ∑ |V| k=1 f(θ, yk ),s.t., θ ∈ Θ, yk ∈ Y, where Θ and Y are both convex sets. θ is a globally tunable parameter vector and the function f is assumed to be convex and smooth on the parameter θ. C. Algorithm Design Insights In this section, we briefly describe the insights of algorithm design and explain how FINE can efficiently and accurately deal with the challenging distributed learning problems in heterogeneous crowdsensing networks. FINE uses a Distributed Record Completion (DRC) algorithm (Section III-A) to ensure each node to obtain global consensus in spite of the incomplete local observations and a Distributed Dual Average (DDA) algorithm (Section III-B) to efficiently solve the non-smooth optimization problems. 1) DRC Algorithm: Our DRC allows each node to communicate with each other to complete the data and achieve consensus, on the condition that each node’s successive observations are required. This combined successive observation and consensus design ensure that each node can acquire unbiased and accurate multidimensional global parameters y ∗ . On the one hand, if nodes cooperate but only process the initial noisy observations (i.e., only {yk (0)} are employed), as in traditional consensus [24]–[26], they all converge to the average of initial estimates which might result in severe bias. Therefore, the requirement of both successive observations and consensus in DRC ensures that each node converges to an unbiased and accurate estimate of the global vector y ∗ . Remark: The communication overhead, which quantifies the weight of communication among sensors or mobile phone users during reaching to a consensus, is an important performance metric of consensus-based learning algorithms [27]– [30]. However, in our paper, we are interested in measuring the time it takes for the whole network to discover the convergence of optimization, i.e., the optimal parameter θ ∗ . Here the time refers to communication complexity in reaching convergence of optimization, while neglecting local computation complexity inside mobile devices as it is highly determined by the sensors required in a crowdsensing task and is considerably small – can be completed during clock ticks. 2) DDA Algorithm: A well-known sub-gradient method [31] used for solving a single non-smooth optimization has a time efficiency of O( 1 ε 2 ), where ε is the tolerable error. For solving distributed non-smooth optimization problem consisting of multiple interdependent functions, the efficiency of the sub-gradient would be O( |V|3 ε 2 ) [21], [32], [33], where |V| is the network size. This order does not reflect the explicit influence of the network topology, but only reflects the worst case of efficiency. References [23], [34], [35] provide the convergence rate of the optimization error under some special network topologies. Aiming at the general network topology with the network size |V| and networking topological parameter C, literature [36]’s method can achieve the convergence rate of the error ε with O( log |V| 1−C ). In the context of crowdsensing, more practical factors, i.e., partial observation and observation noise, need to be considered, and these factors will lead to the increase of the time complexity. Therefore, it is worthwhile to propose a non-smooth algorithm to improve the efficiency. Towards this end, we build our algorithm on an efficient non-smooth optimization methodology called the dual averaging method introduced in [20]. The dual averaging in nature is an efficient non-smooth convex optimization method. By realizing the dual averaging method into a distributed manner, as will be prescribed in later sections, we will show that our algorithm can return a consistent estimate θ ∗ for each node. Additionally, as we will also provably demonstrate later, the convergence rate of our distributed algorithm can also achieve O( log √ |V| 1−C ). Remark: The DDA algorithm, as we will unfold in sequel, is designed by extending the centralized dual averaging method into a distributed form, requiring that each node performs local information exchange that follows a weighting process (where each edge is assigned a weight). Thus, intuitively the process is heavily correlated with the network topology. Compared to [36], the DDA algorithm provides consistent efficiency despite that it is applied to solve more complicated distributed learning problems that take practical factors like partial observation and observation noise into account. Up till now, we have presented our problem, and articulate all insights which inform FINE’s design. In the sequel, we will detail the design of DRC and DDA in FINE. III. ALGORITHMS In this section, we describe the details of the design of DRC and DDA algorithms used in FINE. We firstly introduce the notations. Let En denote n × n identity matrix; let 1n, 0n denote the column vectors of ones and zeros, defined in R n respectively; let ∥·∥ denote the standard Euclidean 2-norm for a vector and the induced 2-norm for matrixes, equavelent to the matrix spectral radius for symmetric matrixes. Besides, we use ∥·∥∗ to denote the dual norm to ∥·∥,defined by ∥u∥∗ := sup∥v∥ ⟨u, v⟩, which refer to the value of the linear function u ∈ X∗ at v ∈ X (X is a vector space and X∗ is its dual space). We further let P[·] and E[·] denote the probability and the expectation operators, respectively. We use the symbol ⊗ to denote the Kronecker product manipulation, commonly used in matrix manipulations. For instance, the Kronecker product of the n×n matrix L and Em is an nm×nm matrix, denoted by L ⊗ Em. In the undirected graph G we consider, we denote the neighbor of an arbitrary node v ∈ V by Nv = {u ∈ V|euv ∈ E}, the number of edges incident to v by the degree dv = |Nv| of node v, and the degree matrix by D = diag(d1, ..., d|V|). Then, we use an adjacent matrix, A = [Auv]|V|×|V|, to describe the network connectivity. We set Auv = 1, if euv ∈ E; or 0
4 otherwise.Further,we define the graph Laplacian matrix3 C of notation,we rewrite iterations in Equation (2)in a compact as C=D-A,a positive semidefinite matrix with ordered form,which can describe the consensus process of all nodes eigenvalues0≤1(C)≤2(C)≤.≤vr(C).Moreover, To begin with.we store the incomplete observations of all n- for an n x n matrix B,it has a series of order singular values odes at iteration i in a long vector y()().( o1(B)≥2(B)≥.≥on(B)≥0.Interested readers may store updates of the i-th iteration in another long vector refer to [37].[38]for more details on spectral graph theory. u(i)=[uf(i),u()]T,and define the following two matrices: A.Distributed Record Completion =diag H..v, 孔 (3) The Distributed Record Completion,or DRC algorithm, allows each node k to obtain an accurate and complete global (4) data vectork.DRC algorithm is an iteratively updating process as described below. Then,using the Kronecker product symbol,we can rewrite the Equation (2)in a compact form as Algorithm 1 Framework of DRC for Each Terminal u(i+1)=u(i)-a(i)(C&EM)u(i) -B()i(严u()-y() (5) Start: Initial non-random observation yk(0)RM,and let Given the total number of iteration steps T,each node k will uk(0)=yx(0). obtain a data vector uk(T)in the end.Let ik uk(T),in Output: Section IV-A,we will show ik is in fact an unbiased estimate Estimation on the global data, on y.As now we have the detailed updating process used 1:for i=1 to T do in DRC algorithm,we will solve the distributed non-smooth 2: Receiving neigbors'estimation on the global data,i.e.. minimization problem based on yk. {u(i),v∈Nk 3: Comparing its own estimation with its incomplete ob- servation,i.e.,Hkuk(i)-yk(i): B.Distributed Dual Average Comparing its own estimation with its neighbors esti- Based on ik.we now use Distributed Dual Average,or mation,ie,∑uew(uk()园-u.(i) DDA algorithm to provide an accurate estimate on the 5: Updating estimation uk(i+1)based on Eq.(2); optimal parameter 0*for each node in a distributed style. 6:end for Formally,we need to solve the following minimization: 7:return ik uk(T); (6) Alg.1 summarizes the outline of DRC.To illustrate,the se- min>f(0,9k),s.t,日∈日,9k∈y. quence [uk}o is defined to represent the estimated records k=1 on all dimensions of data,generated by each node k as follows. Note that f is a non-smooth function,where non- Starting from the initial non-random observation y(0)RM, smoothness implies that the function does not have a con- at each iteration i,after observing an incomplete data y(i), tinuous gradient,which makes solving such function more each node k updates uk(i)by a distributed iterative algorithm. difficult than the smooth function.To deal with the non-smooth In this algorithm,each node compares its estimated record function,the sub-gradient method should be employed,while uk(i)with its neighbors',and also with the observation y(i). a slow convergence has to be endured [39].For example, Then he determines the estimated record of the next time slot solving a single non-smooth optimization has an efficiency with the difference between uk(i)and the deviations,as shown estimate of ()[40],where is the desired accuracy of in what follows: the approximation solution,while minimizing a single smooth uk(位+1)=uk()-a()∑(u()-u() function only requires an efficiency estimate of the order ()[31].Furthermore,solving the distributed non-smooth -B(i)g (Hkuk(i)-yk(i)), optimization problem consisting of multiple interdependent non-smooth functions has even lower efficiency.Therefore, where a(i)vN(uk(i)-u(i))is the deviation of records we propose the DDA algorithm to improve the efficiency from neighbors and B(i)(Huk(i)-y(i))implies the of the distributed non-smooth optimization of Eq.(6)in deviation from observations.Since uk is node k's estimation crowdsensing networks. on all dimensions,for the comparison between the record Distributed Dual Averaging(DDA)Algorithm: and the observation,we use the linear mapping Hi.Both the In Alg.2,we summarizes the outline of DDA.It is designed positive weight sequence fa(i)}izo and [B(i)}izo satisfy the by extending the centralized dual averaging method [20]into a persistence condition C.5 given in Appendix A.For the ease distributed form.Now we provide the details of the algorithm. The DDA algorithm requires each node to exchange in- 3Numerical natures of the graph can be investigated with the graph formation with its neighbors,and the exchange follows a Laplacian matrix,for example,connectivity,expanding properties,diameter and etc.In this paper,we define the network connectivity using the Laplacian weighting process,where the edge is assigned a weight.Thus, spectrum,which will be illustrated in the following assumption A.2. the process is strongly correlated with the network topology
4 otherwise. Further, we define the graph Laplacian matrix3 L as L = D − A, a positive semidefinite matrix with ordered eigenvalues 0 ≤ λ1(L) ≤ λ2(L) ≤ ... ≤ λ|V|(L). Moreover, for an n × n matrix B, it has a series of order singular values σ1(B) ≥ σ2(B) ≥ ... ≥ σn(B) ≥ 0. Interested readers may refer to [37], [38] for more details on spectral graph theory. A. Distributed Record Completion The Distributed Record Completion, or DRC algorithm, allows each node k to obtain an accurate and complete global data vector yˆk . DRC algorithm is an iteratively updating process as described below. Algorithm 1 Framework of DRC for Each Terminal. Start: Initial non-random observation yk (0) ∈ RM, and let uk(0) = yk (0). Output: Estimation on the global data, yˆk . 1: for i = 1 to T do 2: Receiving neigbors’ estimation on the global data, i.e., {uv(i), v ∈ Nk}. 3: Comparing its own estimation with its incomplete observation, i.e., Hkuk(i) − yk (i); 4: Comparing its own estimation with its neighbors estimation, i.e., ∑ v∈Nk (uk(i) − uv(i)); 5: Updating estimation uk(i + 1) based on Eq. (2); 6: end for 7: return yˆk = uk(T); Alg. 1 summarizes the outline of DRC. To illustrate, the sequence {uk}k≥0 is defined to represent the estimated records on all dimensions of data, generated by each node k as follows. Starting from the initial non-random observation y(0) ∈ RM, at each iteration i, after observing an incomplete data yk (i), each node k updates uk(i) by a distributed iterative algorithm. In this algorithm, each node compares its estimated record uk(i) with its neighbors’, and also with the observation yk (i). Then he determines the estimated record of the next time slot with the difference between uk(i) and the deviations, as shown in what follows: uk(i + 1) = uk(i) − α(i) ∑ v∈Nk (uk(i) − uv(i)) − β(i)HT k (Hkuk(i) − yk (i)), (2) where α(i) ∑ v∈Nk (uk(i) − uv(i)) is the deviation of records from neighbors and β(i)HT k (Hkuk(i) − yk (i)) implies the deviation from observations. Since uk is node k’s estimation on all dimensions, for the comparison between the record and the observation, we use the linear mapping Hk. Both the positive weight sequence {α(i)}i≥0 and {β(i)}i≥0 satisfy the persistence condition C.5 given in Appendix A. For the ease 3Numerical natures of the graph can be investigated with the graph Laplacian matrix, for example, connectivity, expanding properties, diameter and etc. In this paper, we define the network connectivity using the Laplacian spectrum, which will be illustrated in the following assumption A.2. of notation, we rewrite iterations in Equation (2) in a compact form, which can describe the consensus process of all nodes. To begin with, we store the incomplete observations of all nodes at iteration i in a long vector y(i) = [y T 1 (i), ..., y T |V|(i)]T , store updates of the i-th iteration in another long vector u(i) = [u T 1 (i), ..., u T |V|(i)]T , and define the following two matrices: H¯ = diag [ HT 1 , ..., HT |V|] , (3) H˜ = diag [ HT 1 H1, ..., HT |V|HV ] . (4) Then, using the Kronecker product symbol, we can rewrite the Equation (2) in a compact form as u(i + 1) = u(i) − α(i)(L ⊗ EM)u(i) − β(i)H¯(H¯Tu(i) − y(i)). (5) Given the total number of iteration steps T, each node k will obtain a data vector uk(T) in the end. Let yˆk = uk(T), in Section IV-A, we will show yˆk is in fact an unbiased estimate on y. As now we have the detailed updating process used in DRC algorithm, we will solve the distributed non-smooth minimization problem based on yˆk . B. Distributed Dual Average Based on yˆk , we now use Distributed Dual Average, or DDA algorithm to provide an accurate estimate ˆθ ∗ k on the optimal parameter θ ∗ for each node in a distributed style. Formally, we need to solve the following minimization: min θ ∑ |V| k=1 fk(θ, yˆk ), s.t., θ ∈ Θ, yˆk ∈ Y. (6) Note that fk is a non-smooth function, where nonsmoothness implies that the function does not have a continuous gradient, which makes solving such function more difficult than the smooth function. To deal with the non-smooth function, the sub-gradient method should be employed, while a slow convergence has to be endured [39]. For example, solving a single non-smooth optimization has an efficiency estimate of O( 1 ε 2 ) [40], where ε is the desired accuracy of the approximation solution, while minimizing a single smooth function only requires an efficiency estimate of the order O( √ 1 ε ) [31]. Furthermore, solving the distributed non-smooth optimization problem consisting of multiple interdependent non-smooth functions has even lower efficiency. Therefore, we propose the DDA algorithm to improve the efficiency of the distributed non-smooth optimization of Eq. (6) in crowdsensing networks. Distributed Dual Averaging (DDA) Algorithm: In Alg. 2, we summarizes the outline of DDA. It is designed by extending the centralized dual averaging method [20] into a distributed form. Now we provide the details of the algorithm. The DDA algorithm requires each node to exchange information with its neighbors, and the exchange follows a weighting process, where the edge is assigned a weight. Thus, the process is strongly correlated with the network topology
Algorithm 2 Framework of DDA for Each Terminal. 0(t+1)so as to minimize an averaged first-order approxima- Start: tion to the function f=kfk,while the proximal function Initial pair of vectors ((0),())ex RM,and let and step-size w(t)>0 ensure that (t)}does not oscillate Lk(0)= wildly during iterations. Output: At the end of iteration T,each node k has obtained a Estimation on the optimal parameter 6". sequence [(t))istsT.We run a local average for each node 1:for i=1 to T do as follows: 2: Computing the sub-gradient g(t)E Vofk ((t),) T 3: Receiving estimated information from neigobors,i.e., 0x(T)=T (9) {μ(t),j∈Nk t=l 4: Updating (x(t),u(t))with Eq.(7)and (8); This means if we let 0=(T)at the end of iteration T,we 5:end for will have limrof()=f(0).Thus,with this iteration, 6:return 0x(T)with Eq.(9); each node k can obtain an estimate of the optimal parameter with any desired accuracy.We will prove the convergence of DDA in Section IV. At each iteration t,each node k maintains a pair of To sum up,in order to solve distributed non-smooth min- vectors(Bk(t),hk()∈日×R,computes its own sub- imization problems in heterogeneous crowdsensing networks, gradient g(t)E Vofk((t),i).and receives information we first present a DRC algorithm to allow each heterogeneous on sequences from its neighbor nodes,i.e..u,(t),j N).node to obtain an accurate estimate on the globally required Next,at each iteration,each node updates its maintained vector data vector y.Based on this,we design a DDA algorithm (0k(t),u(t))by weighting values from its neighbors.To to ensure that each node obtains an accurate estimate on the model this weighting process,we use PRvlx to denote optimal parameter 0".In the next section,we will present a the edge weights matrix of the graph g.Thus,P>0 if and formal analysis on the convergence and convergent rates of only if ekt∈£andk≠l.This matrix represents the weight both algorithms. of each link,which can capture some natures of the link. For instance,the value can represent the intimacy between IV.MAIN PROPERTIES OF DRC AND DDA two nodes.A higher value implies the neighbor on this link will contribute more in the information exchange.Note that In this section,we present the main properties of the DRC P >0 only if (1)EE,and Pk >0 only if (k,l)EE,the and DDA algorithms.We defer detailed proofs in Section V. weight update is described with the following equations: A.Main Properties of DRC k(t+1)=∑P()+9(), (7) To begin with,we present main properties with regard to the asymptotic unbiasedness and the consistency of the IENk 0k(t+1)=Twt+1)(-k(t+1): (8) DRC algorithm.Furthermore,we carry out a convergence rate analysis by studying the deviation characteristic of DRC. where the function (u)is defined by The results rely on the following three assumptions: (A.1)Observation Noise:For i0.We ∑Pa=∑Pa=1 for all v. require the graph to be connected to allow communication among nodes.This can be guaranteed if A2(C)>0.See [37]. k=1 k∈N [38]for details. To sum up,each node k computes its new dual sequence Before presenting the final assumption,we first give the (t+1)by weighting both its own sub-gradient g(t)and following definition: the sequences {u(t),IE N}stored in its neighborhood N. Definition 1.The observations formulated by Equation (1) and the node also computes its next local primal parameters is distributedly observable if the matrix H.defined by H= (t+1)by a projection defined by the proximal function and step-size w(t)>0. ∑1HgH,iso时full rank. The intuition behind this method is:based on its current Remark:This distributed observability is essentially an iteration ((t),(t)),each node k chooses its next iteration extension of the observability condition for the centralized
5 Algorithm 2 Framework of DDA for Each Terminal. Start: Initial pair of vectors (θk(0), µk (0)) ∈ Θ × RM, and let µk (0) = yˆk . Output: Estimation on the optimal parameter θ ∗ . 1: for i = 1 to T do 2: Computing the sub-gradient gk (t) ∈ ∇θfk(θk(t), yˆk ); 3: Receiving estimated information from neigobors, i.e., {µj (t), j ∈ Nk}; 4: Updating (θk(t), µk (t)) with Eq. (7) and (8); 5: end for 6: return θˆ k(T) with Eq. (9); At each iteration t, each node k maintains a pair of vectors (θk(t), µk (t)) ∈ Θ × RM, computes its own subgradient gk (t) ∈ ∇θfk(θk(t), yˆk ), and receives information on sequences from its neighbor nodes, i.e., {µj (t), j ∈ Nk}. Next, at each iteration, each node updates its maintained vector (θk(t), µk (t)) by weighting values from its neighbors. To model this weighting process, we use P ∈ R |V|×|V| to denote the edge weights matrix of the graph G. Thus, Pkl > 0 if and only if ekl ∈ E and k ̸= l. This matrix represents the weight of each link, which can capture some natures of the link. For instance, the value can represent the intimacy between two nodes. A higher value implies the neighbor on this link will contribute more in the information exchange. Note that Pkl > 0 only if (k, l) ∈ E, and Pkl > 0 only if (k, l) ∈ E, the weight update is described with the following equations: µk (t + 1) = ∑ l∈Nk Pklµl (t) + gk(t), (7) θk(t + 1) = πω(t+1)(−µk (t + 1)), (8) where the function πω(u) is defined by πω(µ) = arg min ξ∈Θ {− ⟨µ, ξ⟩ + ωϕ(ξ)}, and {ω(t)} is the non-decreasing sequence of positive stepsizes. Specifically, we assume the matrix P is a doubly stochastic matrix, so ∑ |V| l=1 Pkl = ∑ l∈Nk Pkl = 1 for all k ∈ V; ∑ |V| k=1 Pkl = ∑ k∈Nl Pkl = 1 for all l ∈ V. To sum up, each node k computes its new dual sequence µk (t + 1) by weighting both its own sub-gradient gk (t) and the sequences {µl (t), l ∈ Nk} stored in its neighborhood Nk, and the node also computes its next local primal parameters θk(t + 1) by a projection defined by the proximal function ϕ and step-size ω(t) > 0. The intuition behind this method is: based on its current iteration (θk(t), µk (t)), each node k chooses its next iteration θk(t+1) so as to minimize an averaged first-order approximation to the function f = ∑ k fk, while the proximal function ϕ and step-size ω(t) > 0 ensure that {θk(t)} does not oscillate wildly during iterations. At the end of iteration T, each node k has obtained a sequence {θk(t)}1≤t≤T . We run a local average for each node as follows: θˆ k(T) = 1 T ∑ T t=1 θk(t). (9) This means if we let ˆθ ∗ k = θˆ k(T) at the end of iteration T, we will have limT→∞ f( ˆθ ∗ k ) = f(θ ∗ ). Thus, with this iteration, each node k can obtain an estimate of the optimal parameter with any desired accuracy. We will prove the convergence of DDA in Section IV. To sum up, in order to solve distributed non-smooth minimization problems in heterogeneous crowdsensing networks, we first present a DRC algorithm to allow each heterogeneous node to obtain an accurate estimate on the globally required data vector y. Based on this, we design a DDA algorithm to ensure that each node obtains an accurate estimate on the optimal parameter θ ∗ . In the next section, we will present a formal analysis on the convergence and convergent rates of both algorithms. IV. MAIN PROPERTIES OF DRC AND DDA In this section, we present the main properties of the DRC and DDA algorithms. We defer detailed proofs in Section V. A. Main Properties of DRC To begin with, we present main properties with regard to the asymptotic unbiasedness and the consistency of the DRC algorithm. Furthermore, we carry out a convergence rate analysis by studying the deviation characteristic of DRC. The results rely on the following three assumptions: { (A.1) Observation Noise: For i ≤ 0, the noise ε(i) = [ε T 1 (i), ..., εT |V|(i)]T } i≥0 is i.i.d. zero mean. Moreover, at each node k, the noise sequence {εk(i)}1≤k≤|V|,i≥0 is independent with each other, and the covariance of the observing noise, Sε, is independent over time i, i.e., E[ε(i)ε(j) T ] = Sεδij , ∀i, j ≥ 0, (10) where δij = 1 iff i = j or 0 otherwise. (A.2) Networking Connectivity: The second eigenvalue of graph Laplacian L is non-negative, i.e., λ2(L) ≥ 0. We require the graph to be connected to allow communication among nodes. This can be guaranteed if λ2(L) > 0. See [37], [38] for details. Before presenting the final assumption, we first give the following definition: Definition 1. The observations formulated by Equation (1) is ∑ distributedly observable if the matrix H, defined by H = |V| k=1 HT k Hk, is of full rank. Remark: This distributed observability is essentially an extension of the observability condition for the centralized
observing system which is designed to obtain consistent and where complete observation on the vector y. Now let us present the last assumption. S(y(i))=a2, (12) (A.3)Observability:The observations formulated by Equa- tion (1)is distributedly observable defined by Definition 1. ∑= -ahL⑧EM++专EuIW: (13) 1)Unbiasedness and consistency of DRC:In this part,we show the unbiasedness and the consistency of DRC algorithm, and and we provide two theorems to illustrate them respectively. S0=孔S7T (14) Theorem 1.Consider the DRC algorithm is under the Especially,the record sequence uk(i)}izo at any node k is assumptions A.1-A.3 (Section IIl-A),the record sequence asymptotically normal: uk(i)}zo at node k is asymptotic unbiased V何(u()-y)→N(0,5(()) 1im[uk(i]=y,1≤k≤ (11) We provide the proof in Appendix B.Therefore,the error We defer the proof in Section V-A.Theorem 1 shows sequence {uk()-y}izo at each node can be regarded as the unbiasedness of the algorithm.It indicates that each being convergent to a normal distribution with a rate of node's estimation on the global data would be correct on the Up until now,we have presented asymptotic unbiasedness, average in the long run.The consistency of DRC algorithm is the consistence and the asymptotic normality of the DRC guaranteed by the following theorem. algorithm.In the next section,we present main properties of the DDA algorithm. Theorem 2.Consider the DRC algorithm is under the assumptions A.1-A.3 (Section IIl-A),the records sequence uk(i)}izo at node k is consistent B.Main Properties of DDA In this section,we prove the convergency of running average P[mu同=y,1≤k≤叫=1 (T)to the optimal parameter a*and derive the convergence rate of the DDA algorithm. We provide the proof in Appendix B.Based on Theorem 2, Now we present the following theorem. record sequence {u())1at every node,with probability 1, converges to the true vector y". Theorem 4.The random family {ok(t)o and {uk(t)o 2)Convergence rate analysis:We now analyze the conver- are generated by iteration(8)and (7),with the positive non- gence rate of the DRC algorithm via its deviation character- decreasing step-size sequence w(t)where is strongly istic.We first present a relative definition which is used to convex with respect to the normwith dual norm characterized the convergence rate of sequential process. Let the record error-lly⑧y‖be bounded by an arbitrary small constant Cerr-For anyθ'∈日and each node Definition 2.A sequence of records {u(i)}izo is asymptoti- k∈V,ve have cally normal if a positive semidefinite matrix S(y)exists and satisfies that f(0k(T),ik)-f(0",y)0 a() Recall that T is the convexity parameter. Theorem 4 explicitly shows the difference between the for some a 0.Let the record sequence fu(i)izo be the state estimated results from the true optimality.It is bounded by sequence generated by (5).Then,for a> a value which is a sum of three types of errors:(1)The OPT we obtain error can be viewed as the optimization error;(2)the NET error is induced by various estimations of nodes;and (3)the V()(u()-1v8y*)→W(0,S(y(): SAMP error is incurred on account of the input noisy.The
6 observing system which is designed to obtain consistent and complete observation on the vector y. Now let us present the last assumption. (A.3) Observability: The observations formulated by Equation (1) is distributedly observable defined by Definition 1. 1) Unbiasedness and consistency of DRC: In this part, we show the unbiasedness and the consistency of DRC algorithm, and we provide two theorems to illustrate them respectively. Theorem 1. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A), the record sequence {uk(i)}i≥0 at node k is asymptotic unbiased lim i→∞ E[uk(i)] = y ∗ , ∀1 ≤ k ≤ |V|. (11) We defer the proof in Section V-A. Theorem 1 shows the unbiasedness of the algorithm. It indicates that each node’s estimation on the global data would be correct on the average in the long run. The consistency of DRC algorithm is guaranteed by the following theorem. Theorem 2. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A),the records sequence {uk(i)}i≥0 at node k is consistent P [ lim i→∞ uk(i) = y ∗ , ∀1 ≤ k ≤ |V|] = 1. We provide the proof in Appendix B. Based on Theorem 2, record sequence {uk(i)}i≥1 at every node, with probability 1, converges to the true vector y ∗ . 2) Convergence rate analysis: We now analyze the convergence rate of the DRC algorithm via its deviation characteristic. We first present a relative definition which is used to characterized the convergence rate of sequential process. Definition 2. A sequence of records {u(i)}i≥0 is asymptotically normal if a positive semidefinite matrix S(y) exists and satisfies that lim i→∞ √ i(uk(i) − y ∗ ) → N (0M, Skk(y(i))), ∀1 ≤ k ≤ n. The matrix S(y(i)) is called the asymptotic variance of the observing sequence {y(i)}i≥0, and Skk(y) ∈ RM×M denotes the k-th principal block of S(y(i)). In the following part, we analyze the asymptotic normality of the DRC algorithm. Let λmin(γL ⊗ EM + H˜) denote the smallest eigenvalue of [γL ⊗ EM + H˜]. Recalling the noise covariance Sϵ in (10), we present the following theorem to establish the asymptotic normality of the DRC algorithm. Theorem 3. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A), with weight sequence {α(i)}i≥0 and {β(i)}i≥0 that are given by α(i) = a i + 1 , lim i→∞ α(i) β(i) = γ > 0, for some a > 0. Let the record sequence {u(i)}i≥0 be the state sequence generated by (5). Then, for a > 1 2λmin(γL⊗EM+H˜) , we obtain √ (i)(u(i) − 1|V| ⊗ y ∗ ) =⇒ N (0, S(y(i))), where S(y(i)) = a 2 ∫ ∞ 0 e ΣvS0e Σv dv, (12) Σ = −a[γL ⊗ EM + H˜] + 1 2 EM|V|, (13) and S0 = H¯SϵH¯T . (14) Especially, the record sequence {uk(i)}i≥0 at any node k is asymptotically normal: √ (i)(uk(i) − y ∗ ) =⇒ N (0, Skk(y(i))). We provide the proof in Appendix B. Therefore, the error sequence {uk(i) − y ∗}i≥0 at each node can be regarded as being convergent to a normal distribution with a rate of √ 1 i . Up until now, we have presented asymptotic unbiasedness, the consistence and the asymptotic normality of the DRC algorithm. In the next section, we present main properties of the DDA algorithm. B. Main Properties of DDA In this section, we prove the convergency of running average θˆ k(T) to the optimal parameter θ ∗ and derive the convergence rate of the DDA algorithm. Now we present the following theorem. Theorem 4. The random family {θk(t)}∞ t=0 and {µk (t)}∞ t=0 are generated by iteration (8) and (7), with the positive nondecreasing step-size sequence {ω(t)}∞ t=0, where ϕ is strongly convex with respect to the norm ∥·∥ with dual norm ∥·∥∗ . Let the record error E[yˆk ] − 1|V| ⊗ y ∗ be bounded by an arbitrary small constant Cerr. For any θ ∗ ∈ Θ and each node k ∈ V, we have f(θˆ k(T), yˆk ) − f(θ ∗ , y ∗ ) ≤ OPT + NET + SAMP, where OP T = ω(T) T ϕ(θ ∗ ) + L 2 2T τ ∑ T t=1 1 ω(t) , NET = L T ∑ T t=1 1 ω(t) E [ 2 |V| ∑ |V| j=1 µ¯(t) − µj (t) ∗ + ∥µ¯(t) − µk (t)∥ ] , SAMP = LCerr, µ¯(t) = 1 |V| ∑ |V| k=1 µk (t). Recall that τ is the convexity parameter. Theorem 4 explicitly shows the difference between the estimated results from the true optimality. It is bounded by a value which is a sum of three types of errors: (1) The OPT error can be viewed as the optimization error; (2) the NET error is induced by various estimations of nodes; and (3) the SAMP error is incurred on account of the input noisy. The
theorem indicates the relationship between the difference and First,since a(① B() →Y,we have T,which will help us understand the convergency of the DDA algorithm.The detailed proof will be given in Section V. (20) We next investigate the relationship between the conver- 3鍋≤2折≥n gence rates and the spectral property of the network.For Second,let Amin(C EM+H)be the smallest eigenvalue a given graph g,we assume that communications between of the positive definite matrix cEM+H].Since a(i) nodes are controlled by a double stochastic matrix P.In the 0,we have following,we show that the spectral gap of the network,ie., Y(P)=1-02(P)of P severely influences the convergence 3i23:a()≤ .i≥i2(21) rate of DDA,where o2(P)is the second largest singular value 入min(yC⑧EM+孔) of P. Third,.the other facts include::l)λmin(A+B)≥ Theorem 5.Following Theorem 4 and recalling that (0*)io 0=A/i md C恶T I‖EMIy-a(J)C⑧EM-B(U)孔 we will have f6e(①,0-f0,g)s16t2n(T@ 一Em~6器coL+元 AVT1-o2(P,for allk∈y =1-B(j)λmim ac⑧EM+0 3)1 We defer the proof in Section V-C.Theorem 5 shows that the convergence rate of distributed subgradient methods -=1-6以n《(0-3c@Eu+3c@Ev+0 heavily relies on the graph spectral property.The dependence on the spectral quantity 1-02(P)is quite natural,since lots of ≤1-UAm(c®Ew+列. work have noticed that the propagation of information severely (22) relies on the spectral property of the underlying network. From (19)and (22),we now have for each i>io, As we have presented all main properties of our algorithms, we will next turn to the detailed proof of each theorem. u(-1‖≤ IΠ1-BU)入mia(C®Eu+) V.PROOF OF THEOREMS ×[u(io】-1v⑧y]l (23) A.Proof of Theorem I Proof:Taking the expectation of both sides of Eg.(5),it Finally,from the inequality 1-a≤e-a,0≤a≤l,we get follows i-1 [u(i+1】=E[u(]-a(i)(C⑧EM)E[u(] Ieu(】-lw⑧l‖≤exp-Amia(Gc⑧Eu+0)∑Bj) (15) j=io +B(i)孔E[y(i】-B()孔孔Eu()] ×u(io】-1y8y],i>io. Given that (24) (CEM1⑧y)=0IvIM, (16) With the facts that Amin(EM+H)>0 and the sum (1w⑧y)=孔Ey(, (17) of B(j)approaches to infinity,we have subtracting both sides of Eq.(15)by 1vy,we have 1imE[u()]-1v8y*‖=0. Eu(i+1】-1y⑧y=[EyM-a()C⑧EM Thus we complete the proof. (18) -B()Eu(]-1⑧y]. B.Proof of Theorem 4 Continuing the iteration in (18),we have,for each i>io maxfi1,i2), Before proving the theorem of algorithm convergency,we present here some basic assumptions and necessary lemmas. (A.4)A prox-function e-R exists to be T-strongly E[u(-1w1⑧y‖ -a(Gj)C⑧EM convex with respect to the norm l,i.e., p(01)≥(02)+(Wgo(02),01-02)+5I01-02P, -) ×E[z(io】-1y⑧y]. ford1,02∈日.Function is non-negative over日and(0)= (19)0.The prox-center of e is given by 0o arg mine[o():0 E To further derive the above formulation,we have the fol- Due to the page limit.we omit the proof of the positive semidefiniteness lowing facts. of the matrix
7 theorem indicates the relationship between the difference and T, which will help us understand the convergency of the DDA algorithm. The detailed proof will be given in Section V. We next investigate the relationship between the convergence rates and the spectral property of the network. For a given graph G, we assume that communications between nodes are controlled by a double stochastic matrix P. In the following, we show that the spectral gap of the network, i.e., γ(P) = 1 − σ2(P) of P severely influences the convergence rate of DDA, where σ2(P) is the second largest singular value of P. Theorem 5. Following Theorem 4 and recalling that ϕ(θ ∗ ) ≤ A2 , if we define the step-size ω(t) and the record error E[yˆk ] − 1|V| ⊗ y ∗ as: ω(t) = A √ t and Cerr = 2L A √ T · ln(T √ |V|) 1 − σ2(P) , we will have f(θˆ k(T), y) − f(θ, y ∗ ) ≤ 16L 2 A √ T ln(T √ |V|) 1 − σ2(P) , for all k ∈ V. We defer the proof in Section V-C. Theorem 5 shows that the convergence rate of distributed subgradient methods heavily relies on the graph spectral property. The dependence on the spectral quantity 1−σ2(P) is quite natural, since lots of work have noticed that the propagation of information severely relies on the spectral property of the underlying network. As we have presented all main properties of our algorithms, we will next turn to the detailed proof of each theorem. V. PROOF OF THEOREMS A. Proof of Theorem 1 Proof: Taking the expectation of both sides of Eq. (5), it follows E[u(i + 1)] = E[u(i)] − α(i)(L ⊗ EM)E[u(i)] + β(i)H¯E[y(i)] − β(i)H¯H¯T E[u(i)]. (15) Given that (L ⊗ EM)(1|V| ⊗ y ∗ ) = 0|V|M, (16) H˜(1|V| ⊗ y ∗ ) = H¯E[y(i)], (17) subtracting both sides of Eq. (15) by 1|V| ⊗ y ∗ , we have E[u(i + 1)] − 1|V| ⊗ y ∗ = [E|V|M − α(i)L ⊗ EM − β(i)H˜][E[u(i)] − 1|V| ⊗ y ∗ ]. (18) Continuing the iteration in (18), we have, for each i ≥ i0 = max{i1, i2}, E[u(i)] − 1|V| ⊗ y ∗ ≤ ( i ∏−1 j=i0 EM|V| − α(j)L ⊗ EM −β(j)H˜ ) × E[x(i0)] − 1|V| ⊗ y ∗ ] . (19) To further derive the above formulation, we have the following facts. First, since α(i) β(i) → γ, we have ∃i1 ∋: γ 2 ≤ α(i) β(i) ≤ 2γ, ∀i ≥ i1. (20) Second, let λmin(γL⊗EM +H˜) be the smallest eigenvalue of the positive definite matrix4 [γL ⊗EM +H˜]. Since α(i) → 0, we have ∃i2 ∋: α(i) ≤ 1 λmin(γL ⊗ EM + H˜) .∀i ≥ i2 (21) Third, the other facts include: 1) λmin(A + B) ≥ λmin(A)+λmin(B) (Courant-Fischer Minimax Theorem [41]), 2) λmin(L ⊗ EM) = λmin(L) ≥ 0. Based on above facts, the multiplicand of Equation (19) follows from (21), for each j ≥ i0 ||EM|V| − α(j)L ⊗ EM − β(j)H|| ˜ = EM|V| − β(j)(α(j) β(j) L ⊗ EM + H˜) = 1 − β(j)λmin( α(j) β(j) L ⊗ EM + H˜) = 1 − β(j)λmin((α(j) β(j) − γ 2 )L ⊗ EM + γ 2 L ⊗ EM + H˜) ≤ 1 − β(j)λmin( γ 2 L ⊗ EM + H˜). (22) From (19) and (22), we now have for each i > i0, E[u(i)] − 1|V| ⊗ y ≤ ( i∏−1 j=i0 (1 − β(j)λmin( γ 2 L ⊗ EM + H˜))) × E[u(i0)] − 1|V| ⊗ y ∗ ] . (23) Finally, from the inequality 1 − a ≤ e −a , 0 ≤ a ≤ 1, we get E[u(i)] − 1|V| ⊗ y ∗ ≤ exp [ −λmin( γ 2 L ⊗ EM + H˜) ∑i−1 j=i0 β(j) ] × E[u(i0)] − 1|V| ⊗ y ∗ ] , i > i0. (24) With the facts that λmin(γL ⊗ EM + H˜) > 0 and the sum of β(j) approaches to infinity, we have lim i→∞ E[u(i)] − 1|V| ⊗ y ∗ = 0. Thus we complete the proof. B. Proof of Theorem 4 Before proving the theorem of algorithm convergency, we present here some basic assumptions and necessary lemmas. (A.4) A prox-function ϕ : Θ → R exists to be τ -strongly convex with respect to the norm ∥·∥, i.e., ϕ(θ1) ≥ ϕ(θ2) + ⟨∇θϕ(θ2), θ1 − θ2⟩ + τ 2 ∥θ1 − θ2∥ 2 , for θ1, θ2 ∈ Θ. Function ϕ is non-negative over Θ and ϕ(0) = 0. The prox-center of Θ is given by θ0 = arg minθ{ϕ(θ) : θ ∈ 4Due to the page limit, we omit the proof of the positive semidefiniteness of the matrix
).Moreover,we assume that for the optimal parameter Similarly.definingT)(t)and (8*)≤A2. 0x(T)=10x(t).we have (A.5)The error function fr at each node k is L-Lipschitz with respect to the norm-l,i.e.,for 01,f2∈Θ,we have f(0k(t),y)-f(0*,x*)≤f(p(t),yk)-f(0*,y) L T lf(01,9k)-fk(02,9k)川≤Ll01-2 a7∑I8:-p(+Llk-yl t=1 Lemma 1.Define the function Based on above lemmas,we now present the proof of Theorem 4. V(0)max{(0,s-00)-wo(s)} C∈日 Proof:We perform our proof by analyzing the random family {(t))o Given an arbitrary 6",we have Then function V()is convex and differentiable on e.More- T over,its gradient is L-Lipschitz continuous with respect to the [f((),)-f(0*,9】 nom-l‖ t=1 T川 Iv(-(o≤u-,ueo, =d∑f(ee,)-fi(9,联】 可台台 where the gradient is defined as follows 7V(u)=πw(u)-uo,元(u)=arg mir{-(u,v〉+wo(u)}. ∑Ll()-0(训+f(e(》-f(8()训 k= (25) The inequality of the above equation is resulted by the L- Note that uo=πw(O). Lipschitz condition on f&. Lemma 2.Let [g(t)be an arbitrary sequence of vectors. Let g(t)ofk((t))and use the convexity of the and consider the sequence ((t)generated by function,then we will obtain the following bound: VI I叭 +-{名+0a ∑(0()-f(0〗≤∑g(),0x()-日) k=1 k=1 =1 t I川 IvI Tw(t) 9 =∑@因,p因-8)+∑g因,0e④-p》27 =1 k=1 k=1 I For any non-decreasing positive step-sizes w(t)o.and any a∈Θc,we have +∑gx(因)-9(),8()-8) k=1 三gw-)s2l8+c For the first term in the right hand side of Equation(27). from the Lemma 2,it follows that w(t) For any8∈9cC日*,we have 三a0a0-8y≤号上a8+wo (28) 1 写w(t) t=1 ≤2 子11 +w(T)(0*) In addition,we establish the convergency of algorithm via Holder's inequality implies thatElg((s)12 two auxiliary sequences and Ellgk(t)l.]L2 since llgk(t)ll.L for any k,l,s,t. We use these two inequalities to bound(28), p(t+1)=πae(m(t+1) (26 and present the following lemma. 1 . 1 Lemma 3.With definitions of the random family {k(t). 三到6.oL.le0l≤2 uk(t)o and (t)1 in Eq.(7).(8)and (26).and the For the second term in the right hand side of Equation(27), L-Lipschitz condition of each fk,for each node k V,we 8k∈Ft-1andp(t)∈Ft-1 by assumption,so have E(gk(t),8k(t)-p(t)》≤EIg(t)川I0k(t)-p(t)川 T =E(E[llgk (t)ll F:-1]10x (t)-()ll) ∑f(ex(),y)-f8,y〗≤∑f(p),)-f(8,yk川 ≤LEI8x(t)-p(t)川 =1 t=1 +∑LI8)-p)+Llgk-加. ≤尚Ela0-4L t1 (29)
8 Θ}. Moreover, we assume that for the optimal parameter θ ∗ , ϕ(θ ∗ ) ≤ A2 . (A.5) The error function fk at each node k is L-Lipschitz with respect to the norm ∥·∥, i.e., for θ1, θ2 ∈ Θ, we have |fk(θ1, yˆk ) − fk(θ2, yˆk )| ≤ L∥θ1 − θ2∥ . Lemma 1. Define the function Vω(θ) = max ζ∈Θ {⟨θ, ζ − θ0⟩ − ωϕ(ζ)}. Then function Vω(·) is convex and differentiable on Θ. Moreover, its gradient is L-Lipschitz continuous with respect to the norm ∥·∥ ∥∇Vω(u) − ∇Vω(v)∥ ≤ 1 ωτ ∥u − v∥ , ∀u, v ∈ Θ, where the gradient is defined as follows ∇Vω(u) = πω(u) − u0, πω(u) = arg min v∈Θ {− ⟨u, v⟩ + ωϕ(v)}. (25) Note that u0 = πω(0). Lemma 2. Let {g(t)}∞ t=1 be an arbitrary sequence of vectors, and consider the sequence {θ(t)}∞ t=1 generated by θ(t + 1) = arg min θ∈Θ {∑t r=1 ⟨g(r), θ⟩ + ω(t)ϕ(θ) } = πω(t) ( − ∑t r=1 g(r) ) . For any non-decreasing positive step-sizes {ω(t)}∞ t=0, and any θˆ ∈ ΘC , we have ∑ T t=1 ⟨ g(t), θ(t) − θˆ ⟩ ≤ 1 2τ ∑ T t=1 ∥g(t)∥ 2 ∗ ω(t) + ω(T)C. For any θˆ ∈ ΘC ⊂ Θ∗ , we have ∑ T t=1 ⟨ g(t), θ(t) − θˆ ⟩ ≤ 1 2τ ∑ T t=1 ∥g(t)∥ 2 ∗ ω(t) + ω(T)ϕ(θ ∗ ). In addition, we establish the convergency of algorithm via two auxiliary sequences φ(t + 1) = πω(t)(µ¯(t + 1)) (26) and present the following lemma. Lemma 3. With definitions of the random family {θk(t)}∞ t=0, {µk (t)}∞ t=0 and {φk (t)}∞ t=1 in Eq. (7), (8) and (26), and the L-Lipschitz condition of each fk, for each node k ∈ V, we have ∑ T t=1 [f(θk(t), yk ) − f(θ ∗ , y ∗ )] ≤ ∑ T t=1 [f(φ(t), yk ) − f(θ ∗ , yk )] + ∑ T t=1 [L∥θk(t) − φ(t)∥ + L∥yk − y ∗ ∥]. Similarly, defining φˆ(T) = 1 T ∑T t=1 φ(t) and θˆ k(T) = 1 T ∑T t=1 θk(t), we have f(θˆ k(t), yk ) − f(θ ∗ , x ∗ ) ≤ f(φˆ(t), yk ) − f(θ ∗ , yk ) + L ω(t)T ∑ T t=1 ∥θk(t) − φ(t)∥ + L∥yk − y ∗ ∥. Based on above lemmas, we now present the proof of Theorem 4. Proof: We perform our proof by analyzing the random family {φ(t)}∞ t=0. Given an arbitrary θ ∗ ∈ Θ, we have ∑ T t=1 [f(φ(t), yˆk ) − f(θ ∗ , yˆk )] = 1 |V| ∑ T t=1 ∑ |V| k=1 [fk(φ(t), yˆk ) − fk(θ ∗ , yˆk )] ≤ ∑ T t=1 1 |V| ∑ |V| k=1 [L∥φ(t) − θk(t)∥ + fk(φ(t)) − fk(θk(t))] . The inequality of the above equation is resulted by the LLipschitz condition on fk. Let gk (t) ∈ ∂fk(θk(t)) and use the convexity of the function, then we will obtain the following bound: ∑ |V| k=1 [fk(θk(t)) − fk(θ ∗ )] ≤ ∑ |V| k=1 ⟨gk (t), θk(t) − θ ∗ ⟩ = ∑ |V| k=1 ⟨gˆk (t), φ(t) − θ ∗ ⟩ + ∑ |V| k=1 ⟨gˆk (t), θk(t) − φ(t)⟩ + ∑ |V| k=1 ⟨gk (t) − gˆk (t), θk(t) − θ ∗ ⟩ (27) For the first term in the right hand side of Equation (27), from the Lemma 2, it follows that 1 |V| ∑ T t=1 ⟨∑ |V| k=1 gˆk (t), φ(t) − θ ∗ ⟩ ≤ 1 2τ ∑ T t=1 1 ω(t) 1 |V| ∑ |V| k=1 gˆk (t) 2 ∗ + ω(T)ϕ(θ ∗ ). (28) Holder’s inequality implies that E[∥gˆl (t)∥∗ ∥gˆk (s)∥∗ ] ≤ L 2 and E[∥gˆk (t)∥∗ ] ≤ L 2 since ∥gˆk (t)∥∗ ≤ L for any k, l, s, t. We use these two inequalities to bound (28), E 1 |V| ∑ |V| k=1 gˆk (t) 2 ∗ ≤ 1 |V|2 ∑ |V| k,l=1 E[∥gˆk (t)∥∗ ∥gˆl (t)∥∗ ] ≤ L 2 . For the second term in the right hand side of Equation (27), θk ∈ Ft−1 and φ(t) ∈ Ft−1 by assumption, so E ⟨gˆk (t), θk(t) − φ(t)⟩ ≤ E ∥gˆk (t)∥ ∥θk(t) − φ(t)∥ =E(E[∥gˆk (t)∥ |Ft−1] ∥θk(t) − φ(t)∥) ≤LE ∥θk(t) − φ(t)∥ ≤ L ω(t)τ E ∥µ¯(t) − µk (t)∥∗ . (29)
9 Thus we have which further implies that 「T川 T川 ∑∑lp()-0e(训+∑∑©(),9()-p(t》 1k= o-a晚=2芝l-m-ik-L ≤2L乏》a0-A6L. 1 t=1k=1 w(t) +可三s,e-儿.+a,- For the third term in the right hand side of Equation(27), (34) recalling that (t)EF-1.we get Taking the expectation on both sides of Inequality (34)and E[gk(t)-9k(t),0x(t)-8*】 using the fact that Ellg (t)ll.t-品牺+l,we have C.Proof of Theorem 5 Proof:The proof concentrates on deriving the bound of the networkro in Theorem4,∑l兴pi-“l.We P-l ≤IPte,l+奇IL,=2 1 w(t) (36 first define the matrix P(t)=p-1i,where(t is the element in the i-th row and j-th column of the matrix The above clearly suggests that the cutoff point is =nC In2(P) P(t,1).From Equation (7),based on the record at time l,i.e. Since there are at most T steps in the summation,we have ()we can obtain the record at time t+1 as follows: 叭 Ea0-ol.≤LP-1A%-l 4t+1)=∑Pt,小4) k=t-i i=1 (32) t-t-1 t +∑∑P6,9.(k-1)+g,). + P(t-1,k)e- +2L k=l+1=1 If t =l,this iteration will be terminated in our algorithm. ≤2ui+t-i-)+2L From the definition of (t)in Lemma 2,it follows: ≤2Lt+L+2L(byt≤T) t-1l -4,=∑Σ (-e-1)- -2h(Y+3L In(2(P)) k=1i= =2L n(T所) +3L +丙.t-少-,t-以. Ina2(P) (33) s2LI(TVi) 1-02(P) +3L
9 Thus we have E ∑ T t=1 ∑ |V| k=1 ∥φ(t) − θk(t)∥ + ∑ T t=1 ∑ |V| k=1 ⟨gˆk (t), θk(t) − φ(t)⟩ ≤ 2L ∑ T t=1 ∑ |V| k=1 E ∥µ¯(t) − µk (t)∥∗ ω(t) . For the third term in the right hand side of Equation (27), recalling that θk(t) ∈ Ft−1, we get E[⟨gk (t) − gˆk (t), θk(t) − θ ∗ ⟩] = E[⟨E(gk (t)) − gˆk (t)|Ft−1, θk(t) − θ ∗ ⟩] = 0. (30) Combining these equations, we obtain the running sum bound ∑ T t=1 [f(φ(t)) − f(θ ∗ )] ≤ ω(T)ϕ(θ ∗ ) + L 2 2τ ∑ T t=1 1 ω(t) + 2L |V| ∑ T t=1 ∑ |V| k=1 E ∥µ¯(t) − µk (t)∥∗ ω(t)τ . (31) Applying Lemma 4 to (31), it gives that ∑ T t=1 [f(θk(t), yˆk ) − f(θ ∗ , y ∗ )] ≤ ω(T)ϕ(θ ∗ ) + L 2 2τ ∑ T t=1 1 ω(t) + 2L |V| ∑ T t=1 ∑ |V| k=1 E ∥µ¯(t) − µk (t)∥∗ ω(t)τ + L ∑ T t=1 ∥yˆk − y ∗ ∥ + ∑ T t=1 ∥µ¯(t) − µk (t)∥ ω(t)τ . Dividing both sides of the inequality by T, we can obtain the theorem based on convexity of f. C. Proof of Theorem 5 Proof: The proof concentrates on deriving the bound of the network error in Theorem 4, ∑|V| j=1 E∥µ¯ (t)−µj (t)∥∗ ω(t) . We first define the matrix P(t, l) = P t−l+1, where [P(t, l)]ij is the element in the i-th row and j-th column of the matrix P(t, l). From Equation (7), based on the record at time l, i.e. µj (l), we can obtain the record at time t + 1 as follows: µj (t + 1) = ∑ |V| i=1 [P(t, l)]ijµj (l) + ∑t k=l+1 ∑ |V| i=1 [P(t, k)]ijgˆi (k − 1) + gˆj (t). (32) If t = l, this iteration will be terminated in our algorithm. From the definition of µ¯(t) in Lemma 2, it follows: µ¯(t) − µj (t) = ∑t−1 k=1 ∑ |V| i=1 ( 1 |V| − [P(t − 1, k)]ij) gˆi (k − 1) + 1 |V| ∑ |V| i=1 [gˆi (t − 1) − gˆj (t − 1)], (33) which further implies that µ¯(t) − µj (t) ∗ ≤ ∑t−1 k=1 ∑ |V| i=1 1 |V| − [P(t − 1, k)]ij ∥gˆi (k − 1)∥∗ + 1 |V| ∑ |V| i=1 [∥gˆi (t − 1)∥∗ + gˆj (t − 1) ∗ ]. (34) Taking the expectation on both sides of Inequality (34) and using the fact that E ∥gˆi (t)∥∗ ≤ L, we have E µ¯(t) − µj (t) ∗ ≤ ∑t−1 k=1 L 1|V| |V| − P(t − 1, k)ej 1 + 2L, (35) where ej represents the j-th standard basis vector in the |V|-dimensional Euclidean space. To further bound µ¯(t) − µj (t) ∗ , we break the sum in Inequality (35) into two terms, with a cutoff point t˜. From the Perron-Frobenius theory presented in [42], we have P(t, k)ej − 1|V| |V| 1 ≤ √ |V|σ t−k+1 2 (P). From the above inequality, it follows that if 1 ≤ k ≤ t− ln Cerr ln σ2(P) +1, then P(t, k)ej − 1|V| |V| 1 ≤ √ |V|Cerr. Specifically, setting Cerr = 1/T√ |V|, for ∀l : 1 ≤ k ≤ t − ln Cerr ln σ2(P ) + 1, we have P(t, k)ej − 1|V| |V| 1 ≤ 1 T . For k > t − ln Cerr ln σ2(P ) + 1, we have P(t, k)ej − 1|V| |V| 1 ≤ ∥P(t, k)ej∥1 + 1 |V| 1|V| 1 = 2. (36) The above clearly suggests that the cutoff point is t˜= ln Cerr ln σ2(P ) . Since there are at most T steps in the summation, we have E µ¯(t) − µj (t) ∗ ≤ L ∑t−1 k=t−t˜ P(t − 1, k)ej − 1|V| |V| 1 + L t−∑t˜−1 k=1 P(t − 1, k)ej − 1|V| |V| 1 + 2L ≤ 2Lt˜+ L T (t − t˜− 1) + 2L ≤ 2Lt˜+ L + 2L (by t ≤ T) = 2L ln(T √ |V|) −1 ln(σ2(P)) + 3L = 2L ln(T √ |V|) ln σ −1 2 (P) + 3L ≤ 2L ln(T √ |V|) 1 − σ2(P) + 3L
10 2 A.Performance of DRC nobile phone We first evaluate the convergence feature of DRC.The Office Room evaluation compares the estimated set uk against the global window set of data y to measure how incomplete and accurate the Network estimated set is.We use the step size a(i)and B(i)specified in Theorem 3.Besides,for the simulation,we preset an Fig.1.The real world test and the communication network M-dimensional global required data vector y and an M- dimensional orthogonal vector 0*,i.e..(y,0*)=0.Each node can only observe a single element of y with an additive which follows the upper bound of the network error in(4) Gaussian noise.Without loss of generality,we let the k-th node NET≤T 1n(T+3 observe the k-th element of y.For example,node 1 observes 2L1-2(P) a vector yk(i)with the first element y(i)=y+1(i), f=1 while all other elements=0,j≠1,where y is the first element in vector y and (i)is the observation error, Therefore,the learning error f(0i(T),;)-f(0,y")at the i.e.,local variation,following N(0,1).We learn the relation i-th node can be further bounded by between the normalized error for each node and the number fa,.动-f0,)s9 of iterations.The normalized error for k-th node is defined (0*)+LCem as lluk(i)-yl/K,i.e.,the estimation error normalized by the dimension of the vector y 5.As Fig.2 demonstrates,the 612 In(TV),912 error first rises and then decreases sharply.After the inflection T1-2(P) w(t) point of the curve,the error declines gradually,when the error w(t))and arbitrary is small and close to zero,as proved in Theorem 1.The If we choose the weight sequence small error Cerr to be real world test(Fig.2-c)presents more fluctuating than the simulation since the (i)in the real world is not only affected by the optical noise and the sensor noise.but also deviated w(t)=AV无,Cerr= 2L In(TvV) AV匠 1-02(P)' by the participator's uncertainty of the direction of holding the cellphone.The latter results in a larger variance than the then we have simulation's setting.Nevertheless,the normalized error still fa,(T,)-f0,g)≤4o0 212 In(TvV) approaches to zero. √T AVT 1-02(P) 26L2l血(T√刃,9L2,L2 B.Performance of DDA +A(匠1-2(P +V际+2⑦ In this section,we evaluate the DDA algorithm based on the results obtained above.Via comparison with other algorithms, 16L2n(T√) we prove that our algorithm has higher efficiency. ≤AVT1-2P In the simulation,we consider a distributed minimization of a sum of loss functions in an l-regression based on the Therefore,we obtain data generated by DRC.After I=100 iterations,each node 1 obtains an estimate uk(I)on y.Here,we let ik=uk.Thus, f(0:(T),)-f(0*,y*)=O √T1-02(P) the problem becomes that given K vectors,y,to estimate the orthogonal vector 0".i.e.,a vector is needed for minimizing and thereby completing the proof. f(0)= (37) VI.PERFORMANCE EVALUATION i=1 We find that f is L-Lipschitz and non-smooth at point In this section,we present our testing results on the conver- (ke)=0.We perform simulations on two graph structures, gence feature of both DRC and DDA algorithms.Besides,we e.g.,Random and Small-World as the simulation on DRC.In compare the DDA with other methods in [33].[43]and show addition,the step size w is chosen in the way presented in the efficiency of our algorithms. Theorem 5. We perform simulations on the network of=K nodes Fig.3 shows the plot of the function error maxk[f((T)- with two different graph topologies,e.g.,Random (RD)and f(*))]vs.the number of iterations T for Small-World graphs Small-World(SW).These graphs are generated by NetworkX with size K [100,300,500].Also,we show how the [44].We also perform the real world test with ten cellphones, convergence time varies as a function of the graph size K. as shown in Fig.1.The cellphones are held by each person in Next,we show the comparison between our theoretical an office room.Each cellphone records the brightness of the results (Theorem 5 provides an upper bound of the required natural light at its location and communicates with others in a network.The task is to learn the linear relationship between iterations)with the simulation results.We show how the one certain cellphone's record with others'. 5In the figure,we use the average value of all the nodes'estimation errors
10 Office Room mobile phone window Network 1 2 3 4 5 10 9 8 7 6 1 2 3 4 5 7 9 8 6 10 Fig. 1. The real world test and the communication network which follows the upper bound of the network error in (4) NET ≤ 3L T ( 2L ln(T √ |V|) 1 − σ2(P) + 3L )∑ T t=1 1 ω(t) . Therefore, the learning error f( ˆθi(T), yˆi ) − f(θ, y ∗ ) at the i-th node can be further bounded by f(θˆ i(T), yˆi ) − f(θ ∗ , y ∗ ) ≤ ω(T) T ϕ(θ ∗ ) + LCerr + ( 6L 2 T ln(T √ |V|) 1 − σ2(P) + 9L 2 T + L 2 2T τ )∑ T t=1 1 ω(t) . If we choose the weight sequence {ω(t)} T t=1 and arbitrary small error Cerr to be ω(t) = A √ t, Cerr = 2L A √ T · ln(T √ |V|) 1 − σ2(P) , then we have f(θˆ i(T), yˆi ) − f(θ ∗ , y ∗ ) ≤ Aϕ(θ ∗ ) √ T + 2L 2 A √ T ln(T √ |V|) 1 − σ2(P) + 2 A ( 6L 2 √ T ln(T √ |V|) 1 − σ2(P) + 9L 2 √ T + L 2 2 √ T τ ) ≤ 16L 2 A √ T ln(T √ |V|) 1 − σ2(P) . Therefore, we obtain f(θˆ i(T), yˆi ) − f(θ ∗ , y ∗ ) = O ( 1 √ T ln(T √ |V|) 1 − σ2(P) ) and thereby completing the proof. VI. PERFORMANCE EVALUATION In this section, we present our testing results on the convergence feature of both DRC and DDA algorithms. Besides, we compare the DDA with other methods in [33], [43] and show the efficiency of our algorithms. We perform simulations on the network of |V| = K nodes with two different graph topologies, e.g., Random (RD) and Small-World (SW). These graphs are generated by NetworkX [44]. We also perform the real world test with ten cellphones, as shown in Fig. 1. The cellphones are held by each person in an office room. Each cellphone records the brightness of the natural light at its location and communicates with others in a network. The task is to learn the linear relationship between one certain cellphone’s record with others’. A. Performance of DRC We first evaluate the convergence feature of DRC. The evaluation compares the estimated set uk against the global set of data y to measure how incomplete and accurate the estimated set is. We use the step size α(i) and β(i) specified in Theorem 3. Besides, for the simulation, we preset an M-dimensional global required data vector y and an Mdimensional orthogonal vector θ ∗ , i.e., ⟨y, θ ∗ ⟩ = 0. Each node can only observe a single element of y with an additive Gaussian noise. Without loss of generality, we let the k-th node observe the k-th element of y. For example, node 1 observes a vector yk (i) with the first element y 1 k (i) = y 1 + ξ1(i), while all other elements y j k = 0, j ̸= 1, where y 1 is the first element in vector y and ξ1(i) is the observation error, i.e., local variation, following N (0, 1). We learn the relation between the normalized error for each node and the number of iterations. The normalized error for k-th node is defined as ∥uk(i) − y∥ /K, i.e., the estimation error normalized by the dimension of the vector y 5 . As Fig. 2 demonstrates, the error first rises and then decreases sharply. After the inflection point of the curve, the error declines gradually, when the error is small and close to zero, as proved in Theorem 1. The real world test (Fig. 2-c) presents more fluctuating than the simulation since the ξk(i) in the real world is not only affected by the optical noise and the sensor noise, but also deviated by the participator’s uncertainty of the direction of holding the cellphone. The latter results in a larger variance than the simulation’s setting. Nevertheless, the normalized error still approaches to zero. B. Performance of DDA In this section, we evaluate the DDA algorithm based on the results obtained above. Via comparison with other algorithms, we prove that our algorithm has higher efficiency. In the simulation, we consider a distributed minimization of a sum of loss functions in an l1-regression based on the data generated by DRC. After I = 100 iterations, each node obtains an estimate uk(I) on y. Here, we let yˆk = uk. Thus, the problem becomes that given K vectors, yˆk , to estimate the orthogonal vector θ ∗ , i.e., a vector θ is needed for minimizing f(θ) := 1 K ∑ K i=1 | ⟨yˆk , θ⟩ |. (37) We find that f is L-Lipschitz and non-smooth at point ⟨yˆk , θ⟩ = 0. We perform simulations on two graph structures, e.g., Random and Small-World as the simulation on DRC. In addition, the step size ω is chosen in the way presented in Theorem 5. Fig. 3 shows the plot of the function error maxk[f(θˆ k(T)− f(θ ∗ ))] vs. the number of iterations T for Small-World graphs with size K ∈ [100, 300, 500]. Also, we show how the convergence time varies as a function of the graph size K. Next, we show the comparison between our theoretical results (Theorem 5 provides an upper bound of the required iterations) with the simulation results. We show how the 5 In the figure, we use the average value of all the nodes’ estimation errors