计算机科学与技术（参考文献）Physical-Layer Arithmetic for Federated Learning in Uplink MU-MIMO Enabled Wireless Networks

团购合买资源类别：文库，文档格式：PDF，文档页数：10，文件大小：297.95KB

Physical-Layer Arithmetic for Federated Learning in Uplink MU-MIMO Enabled Wireless Networks Tao Huang',Baoliu Yef,Zhihao Qu,Bin Tang',Lei Xief,Sanglu Lut National Key Laboratory for Novel Software Technology,Nanjing University,China College of Computer and Information,Hohai University,China Email:schrodinger.huang@gmail.com,yebl@nju.edu.cn,quzhihao@hhu.edu.cn,[tb,Ixie,sanglu)@nju.edu.cn Abstract-Federated learning is a very promising machine To improve the communication efficiency,methods based learning paradigm where a large number of clients cooperatively on reducing the size of updates needed to transmit have been train a global model using their respective local data.In this paper,we consider the application of federated learning in widely studied [5]-[9],by performing techniques like quanti- wireless networks featuring uplink multiuser multiple-input and zation,sparsification.etc.In addition to simply compressing multiple-output (MU-MIMO),and aim at optimizing the commu- client-side updates,there is also work focusing on improving nication efficiency during the aggregation of client-side updates the communication efficiency of federated learning by utiliz- by exploiting the inherent superposition of radio frequency ing the characteristics of networks.Specifically,for wireless (RF)signals.We propose a novel approach named Physical- Layer Arithmetic (PhyArith),where the clients encode their local networks where uplink multiuser multiple-input and multiple- updates into aligned digital sequences which are converted into output (MU-MIMO)is enabled,i.e.,clients can concurrently RF signals for sending to the server simultaneously,and the send their updates to the server in the same time-frequency server directly recovers the exact summation of these updates resource,recent work like [10]-[13]study exploiting the as required from the superimposed RF signal by employing a superposition of radio frequency (RF)signals to effectively customized sum-product algorithm.PhyArith is compatible with aggregate updates by averaging them,based on the technique commodity devices due to the use of full digital operation in both the client-side encoding and the server-side decoding processes. called over-the-air computation (AirComp).These AirComp and can also be integrated with other updates compression based based work use uncoded analog transmission to send aligned acceleration techniques.Simulation results show that PhyArith client-side updates in the form of vector,each of which has further improves the communication efficiency by 1.5 to 3 times the same number of elements placed in the same order.Clients for training LeNet-5,compared with solutions only applying in these work are assumed to be equipped with a special updates compression. device featuring linear-analog-modulation and pre-channel- compensation (or simply assumed have no fading channel). I.INTRODUCTION such that the superimposed RF signal can be well utilized for Federated learning and related decentralized learning are fast update aggregation.However,these work are not dedicated the kind of machine learning paradigm where the goal is to to obtaining the exact average of updates.While the channel train a high quality centralized model,while training data noise is not negligible and the pre-channel-compensation of remains distributed over a large number of clients [1]-[4]. each client is not perfect,their obtained average may contain In each round of learning algorithms for this paradigm,each significant aggregation error,which may seriously affect the client independently computes an update to the current model convergence of the global model to train. based on its local data,and sends this update to a central In this paper,we study utilizing the superposition of RF server,where the client-side updates are aggregated to compute signals to efficiently aggregate client-side updates in federated a new global model.Communicating the model updates in learning by using commodity devices dedicated to modern each round has been observed to be a significant performance wireless networks featuring uplink MU-MIMO,e.g.,802.11ax bottleneck [5][6]for this paradigm.It is particularly serious [14].where neither linear-analog-modulation nor pre-channel- for federated learning,since its typical clients are smart phones compensation is necessarily available.We target at reliably ob- and loT devices,which are with unreliable and relatively slow taining the exact average of updates,such that the convergence network connections. of the global model to train is guaranteed for realistic channel conditions.Inspired by the fact that the amount of information This work was supported in part by the National K&D Program of China of all updates is greater than that of their summation,our under Grant 2018YFB1004704,in part by the National Natural Science Foun- idea is that we can directly recover the exact summation dation of China under Grant 61832005.Grant 61872171,and Grant 61872174 in part by the Key R&D of Jiangsu Province under Grant BE2017152,in based on the received superimposed RF signal,and then part by the Natural Science Foundation of Jiangsu Province under Grant calculate their average.It should have lower outage probability BK20190058.in part by Project funded by China Postdoctoral Science compared with the conventional multiuser detection (MUD) Foundation under Grant 2019M661709,and in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization. based solution to deal with such mutual interference in uplink Baoliu Ye and Bin Tang are corresponding authors. MU-MIMO,i.e.,separating these colliding data streams based

Physical-Layer Arithmetic for Federated Learning in Uplink MU-MIMO Enabled Wireless Networks Tao Huang† , Baoliu Ye† , Zhihao Qu‡ , Bin Tang† , Lei Xie† , Sanglu Lu† †National Key Laboratory for Novel Software Technology, Nanjing University, China ‡College of Computer and Information, Hohai University, China Email: schrodinger.huang@gmail.com, yebl@nju.edu.cn, quzhihao@hhu.edu.cn, {tb, lxie, sanglu}@nju.edu.cn Abstract—Federated learning is a very promising machine learning paradigm where a large number of clients cooperatively train a global model using their respective local data. In this paper, we consider the application of federated learning in wireless networks featuring uplink multiuser multiple-input and multiple-output (MU-MIMO), and aim at optimizing the communication efficiency during the aggregation of client-side updates by exploiting the inherent superposition of radio frequency (RF) signals. We propose a novel approach named PhysicalLayer Arithmetic (PhyArith), where the clients encode their local updates into aligned digital sequences which are converted into RF signals for sending to the server simultaneously, and the server directly recovers the exact summation of these updates as required from the superimposed RF signal by employing a customized sum-product algorithm. PhyArith is compatible with commodity devices due to the use of full digital operation in both the client-side encoding and the server-side decoding processes, and can also be integrated with other updates compression based acceleration techniques. Simulation results show that PhyArith further improves the communication efficiency by 1.5 to 3 times for training LeNet-5, compared with solutions only applying updates compression. I. INTRODUCTION Federated learning and related decentralized learning are the kind of machine learning paradigm where the goal is to train a high quality centralized model, while training data remains distributed over a large number of clients [1]–[4]. In each round of learning algorithms for this paradigm, each client independently computes an update to the current model based on its local data, and sends this update to a central server, where the client-side updates are aggregated to compute a new global model. Communicating the model updates in each round has been observed to be a significant performance bottleneck [5] [6] for this paradigm. It is particularly serious for federated learning, since its typical clients are smart phones and IoT devices, which are with unreliable and relatively slow network connections. This work was supported in part by the National K&D Program of China under Grant 2018YFB1004704, in part by the National Natural Science Foundation of China under Grant 61832005, Grant 61872171, and Grant 61872174, in part by the Key R&D of Jiangsu Province under Grant BE2017152, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20190058, in part by Project funded by China Postdoctoral Science Foundation under Grant 2019M661709, and in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization. Baoliu Ye and Bin Tang are corresponding authors. To improve the communication efficiency, methods based on reducing the size of updates needed to transmit have been widely studied [5]–[9], by performing techniques like quantization, sparsification, etc. In addition to simply compressing client-side updates, there is also work focusing on improving the communication efficiency of federated learning by utilizing the characteristics of networks. Specifically, for wireless networks where uplink multiuser multiple-input and multipleoutput (MU-MIMO) is enabled, i.e., clients can concurrently send their updates to the server in the same time-frequency resource, recent work like [10]–[13] study exploiting the superposition of radio frequency (RF) signals to effectively aggregate updates by averaging them, based on the technique called over-the-air computation (AirComp). These AirComp based work use uncoded analog transmission to send aligned client-side updates in the form of vector, each of which has the same number of elements placed in the same order. Clients in these work are assumed to be equipped with a special device featuring linear-analog-modulation and pre-channelcompensation (or simply assumed have no fading channel), such that the superimposed RF signal can be well utilized for fast update aggregation. However, these work are not dedicated to obtaining the exact average of updates. While the channel noise is not negligible and the pre-channel-compensation of each client is not perfect, their obtained average may contain significant aggregation error, which may seriously affect the convergence of the global model to train. In this paper, we study utilizing the superposition of RF signals to efficiently aggregate client-side updates in federated learning by using commodity devices dedicated to modern wireless networks featuring uplink MU-MIMO, e.g., 802.11ax [14], where neither linear-analog-modulation nor pre-channelcompensation is necessarily available. We target at reliably obtaining the exact average of updates, such that the convergence of the global model to train is guaranteed for realistic channel conditions. Inspired by the fact that the amount of information of all updates is greater than that of their summation, our idea is that we can directly recover the exact summation based on the received superimposed RF signal, and then calculate their average. It should have lower outage probability compared with the conventional multiuser detection (MUD) based solution to deal with such mutual interference in uplink MU-MIMO, i.e., separating these colliding data streams based

on the superimposed RF signal,recovering each update,and to compress the sparse update of each client,such that they averaging them,as shown in Fig.1.We denote our solution are aligned and can be efficiently aggregated by PhyArith. as Physical-Layer Arithmetic (PhyArith),where the clients The main contributions are summarized as follows:(1) encode their local updates into aligned digital sequences which We show that even if not applying pre-channel-compensation, are converted into RF signals for sending to the server simulta- the superimposed RF signal can be exploited to efficiently neously,and the server directly recovers the exact summation aggregate updates,without incurring any aggregation error. of these updates from the superimposed RF signal through a (2)Based on full digital operation,we design the client-side customized decoder. updates encoding process of PhyArith,such that it can be applied on commodity 802.11 devices,and their summation can be reliably recovered and verified.We also design its Client server-side decoding process where a customized sum-product Client Update algorithm is provided to directly recover the summation of a ，1▣ large number of updates.(3)PhyArith can be integrated with Client Updaterl- other updates compression methods.Simulation results show RE sic Process of clients that Phy Arith further improves the communication efficiency Process of server by 1.5 to 3 times for training LeNet-5,compared with the solutions which apply quantization/sparsification to compress updates,and aggregate them via MUD. Uptim- II.RELATED WORK 黑器 A.Quantization and Sparsification Averaged update Summation of updates Updates quantization and sparsification are the two main approaches to overcome the communication bottleneck in decentralized learning.Updates quantization was first pro- posed in [5],where elements not less than zero are encoded Fig.1.Recovery of averaged update based on MUD and PhyArith as bit 1,and those less than zero are encoded as bit 0. Following such 1-bit quantization,subsequent work like [8] For this purpose,three main challenges should be well [9]introduce multilevel updates quantization,so as to balance addressed.The first challenge is,how to encode each update at the communication efficiency and updates accuracy.Updates the client with commodity devices based on full digital opera- sparsification was first proposed in [6],where only elements tion,such that PhyArith can conveniently and reliably recover with large absolute value are encoded using 1-bit quantization. their exact summation.and effectively verify its correctness.To and sent out along with their associate index.Following this address this challenge,we encode updates into aligned digital approach,various sparsification methods,like sending fixed sequences by taking two's-complement representation [15]as proportion of large elements [7],are proposed. source codes,and systematic low-density parity check(LDPC) codes [16]as channel codes,such that the addition and B.AirComp subtraction operations during aggregation can be unified,and Recent AirComp based work focus on leveraging the su- error correction capability can be provided.We also append perposition of RF signal to aggregate client-side updates. the cyclic-redundancy-checksum(CRC)[17]with each update, Specifically,[12][13]study performing power control and such that its linear property can be exploited to verify the client selection to align the signal power received at the server correctness of the directly recovered summation.The second and avoid incurring serious aggregation error.[10]studies challenge is,how to directly recover the exact summation aggregating updates during wireless networks with power and of updates at the server,which is equipped with multiple bandwidth limitations,based on coded digital transmission in antennas,based on the superimposed RF signal of these different time-frequency resources,and uncoded analog trans- encoded updates emanating from a large number of different mission in the same time-frequency resource,i.e.,performing clients.To address this challenge,from the superimposed AirComp.It is also worth noting that although recent AirComp RF signal,we propose a sum-product algorithm [18]derived related work are all based on uncoded analog transmission,its polynomial time complexity decoder to directly obtain the original idea appeared in [19],which relies on lattice codes marginal probability mass function(PMF)of the summation of to achieve reliable over-the-air computation.However,it also updates.By maximizing the obtained PMF,we can recover the assumes no fading channel,which can not be directly applied summation,verify its correctness,and update the global model. on commodity devices at the client-side.And at the server- And the third challenge is,how to integrate PhyArith with side,it lacks practical efficient decoding approach. those updates compression methods like sparsification,where the client-side updates are not necessarily aligned due to C.MUD compression.To address this challenge,we adopt compressed- Different from AirComp,the conventional approach to sensing here by using a shared underdetermined sensing matrix deal with the superimposed RF signal is multiuser detection

on the superimposed RF signal, recovering each update, and averaging them, as shown in Fig. 1. We denote our solution as Physical-Layer Arithmetic (PhyArith), where the clients encode their local updates into aligned digital sequences which are converted into RF signals for sending to the server simultaneously, and the server directly recovers the exact summation of these updates from the superimposed RF signal through a customized decoder. Update|V |−1 Update1 Update0 MUD decoder Update∗ |V |−1 Update∗ 1 Update∗ 0 Averaged update PhyArith decoder Averaged update Summation of updates PhyArith based solution MUD based solution Propagate Process of server Process of clients · · · · · · · · · RF signal Superimposed Client RF signal Client Client · · · Fig. 1. Recovery of averaged update based on MUD and PhyArith For this purpose, three main challenges should be well addressed. The first challenge is, how to encode each update at the client with commodity devices based on full digital operation, such that PhyArith can conveniently and reliably recover their exact summation, and effectively verify its correctness. To address this challenge, we encode updates into aligned digital sequences by taking two’s-complement representation [15] as source codes, and systematic low-density parity check (LDPC) codes [16] as channel codes, such that the addition and subtraction operations during aggregation can be unified, and error correction capability can be provided. We also append the cyclic-redundancy-checksum (CRC) [17] with each update, such that its linear property can be exploited to verify the correctness of the directly recovered summation. The second challenge is, how to directly recover the exact summation of updates at the server, which is equipped with multiple antennas, based on the superimposed RF signal of these encoded updates emanating from a large number of different clients. To address this challenge, from the superimposed RF signal, we propose a sum-product algorithm [18] derived polynomial time complexity decoder to directly obtain the marginal probability mass function (PMF) of the summation of updates. By maximizing the obtained PMF, we can recover the summation, verify its correctness, and update the global model. And the third challenge is, how to integrate PhyArith with those updates compression methods like sparsification, where the client-side updates are not necessarily aligned due to compression. To address this challenge, we adopt compressedsensing here by using a shared underdetermined sensing matrix to compress the sparse update of each client, such that they are aligned and can be efficiently aggregated by PhyArith. The main contributions are summarized as follows: (1) We show that even if not applying pre-channel-compensation, the superimposed RF signal can be exploited to efficiently aggregate updates, without incurring any aggregation error. (2) Based on full digital operation, we design the client-side updates encoding process of PhyArith, such that it can be applied on commodity 802.11 devices, and their summation can be reliably recovered and verified. We also design its server-side decoding process where a customized sum-product algorithm is provided to directly recover the summation of a large number of updates. (3) PhyArith can be integrated with other updates compression methods. Simulation results show that PhyArith further improves the communication efficiency by 1.5 to 3 times for training LeNet-5, compared with the solutions which apply quantization/sparsification to compress updates, and aggregate them via MUD. II. RELATED WORK A. Quantization and Sparsification Updates quantization and sparsification are the two main approaches to overcome the communication bottleneck in decentralized learning. Updates quantization was first proposed in [5], where elements not less than zero are encoded as bit 1, and those less than zero are encoded as bit 0. Following such 1-bit quantization, subsequent work like [8] [9] introduce multilevel updates quantization, so as to balance the communication efficiency and updates accuracy. Updates sparsification was first proposed in [6], where only elements with large absolute value are encoded using 1-bit quantization, and sent out along with their associate index. Following this approach, various sparsification methods, like sending fixed proportion of large elements [7], are proposed. B. AirComp Recent AirComp based work focus on leveraging the superposition of RF signal to aggregate client-side updates. Specifically, [12] [13] study performing power control and client selection to align the signal power received at the server and avoid incurring serious aggregation error. [10] studies aggregating updates during wireless networks with power and bandwidth limitations, based on coded digital transmission in different time-frequency resources, and uncoded analog transmission in the same time-frequency resource, i.e., performing AirComp. It is also worth noting that although recent AirComp related work are all based on uncoded analog transmission, its original idea appeared in [19], which relies on lattice codes to achieve reliable over-the-air computation. However, it also assumes no fading channel, which can not be directly applied on commodity devices at the client-side. And at the serverside, it lacks practical efficient decoding approach. C. MUD Different from AirComp, the conventional approach to deal with the superimposed RF signal is multiuser detection

(MUD),where updates are separately recovered and aggre-threshold.As a conclusion,we can use PhyArith to improve gated.The state-of-the-art polynomial time complexity MUD the communication efficiency of federated learning without approach includes Gaussian message based interference can- incurring any effect on its convergence,even if pre-channel- cellation (GM-IC)[20][211,and linear minimum mean square compensation is not applied. error based interference cancellation (LMMSE-IC)[22],etc. Like our server-side decoder in PhyArith,these approaches are ha =h.lhol =+1. he=-ha,hal=+l,坠yAil also derived from the sum-product algorithm,which operates hol =a=+1.MUD on the factor-graph [18]in a message-passing manner. III.MOTIVATION In this section,through a theoretical analysis,we show that even if pre-channel-compensation of AirComp is not applied. 56T8 10 11 12 the superimposed RF signal can still be exploited to efficiently SNR (dB) aggregate updates,without incurring any aggregation error. Consider the following simple channel with two-input and one- Fig.2.Symbol time in theory for different decoders output,denoted by IV.SYSTEM MODEL y=hoso +h1s1+w, A.Model of Federated Learning where y ER is the channel output symbol,ho,hER are the The goal of federated learning is to learn a model with channel gain,so,si are independent channel input symbols parameters embodied in a vector M from data stored across uniformly chosen in-1.+1,and noise w~A(0.) a large number of clients.For simplicity,we consider syn- We study the average symbol time needed to accurately chronized algorithms for federated learning where quantization obtain function f(so,s1)=so +s1 for MUD and Ph- is taken to compress client-side updates and accelerate the yArith in theory.To calculate the symbol time,we study communication.For cases where no compression methods are the entropy (i.e.,the amount of information,in bits)[23] applied,i.e.,directly sending floating-point updates,enabling of so.s1 and so+s1 first.Easy to find that entropy PhyArith is similar to the quantization case,by treating the H(s0)=H(s1)=-0.51og20.5-0.5log20.5=1,and signed integer exponent of each floating-point number as the H(s0+s1）=-0.51og20.25-0.5log20.5=1.5.And then coefficient for de-quantization,and treating the significand we study the capacity of above channel,characterised by as the quantized update element.And for cases where other mutual information,i.e.,the amount of information can be updates compression methods like sparsification are applied, retrieved from the channel,in bits per symbol time [23].For we show how to enable PhyArith for them in Section VIII. conventional successive-interference-cancellation (SIC)based A typical round t of such synchronized algorithms consists MUD,we have the mutual information between y and so. of the following steps:(1)A set of clients,denoted by V= i.e.,I(y;so)=H(y)-H(yso),and that between y and s1 tvo,...,Uv-1,is selected.Each client vV downloads on condition of so,i.e,I(;s1lso）=H(so）-H(y小so,s1). the current model M:from the server.(2)Each vV com- Therefore,the symbol time to obtain so +s1 for SIC based putes an updated model M?based on their local data.(3)The MUD is given by quantized update vector Ap =Q(M?-M:),vEV(Q() max[H(so)/I(y;so),H(s1)/I(y;s1lso)} (1) is the quantization function),as well as the de-quantization function Q(),are sent from each client to the server.(4)The For PhyArith,we calculate the mutual information between server aggregates these updates by averaging and construct an y and so +s1 instead,which is given by I(y;so +s1)= improved global model M+1=Mt +ntAt,where nt is H(y)-H(ylso +s1).Therefore,the symbol time to obtain the learning rate,and A:=v(A?)/IVI.Here we so +s1 is given by assume the quantization function Q()is like those proposed H(s0+s1)/I(y;s0+51): (2) in [8][9].As a result,without loss of generality,we have A is a vector of signed integers,and Q()can be formed as With (1)and (2),we plot Fig.2 for ho =h1,ho =+1 and ho =-h1,hol=+1,where calculation processes of entropy Q(A)=BA?+Y1, H(y),H(ylso:s1),H(ylso)and H(ylso +s1)are provided where B,Yo are floating-point de-quantization coefficients, in Appendix-A.We can find that while ho=h,hol =+1, B>0,and 1 is a vector with all entries being 1. which can be viewed as pre-channel-compensation is perfectly applied and function distortion (i.e.,f(so,s1)-y)is small, B.Model of Uplink MU-MIMO Channel symbol time of PhyArith is always less than that of MUD. Here we present the model of wireless networks like And while ho =-h1,hol =+1,which can be viewed as no 802.11ax,where uplink MU-MIMO is enabled.Let U pre-channel-compensation applied and function distortion is {uo,.,-}denote the set of antennas of the server. extremely serious,symbol time of PhyArith is also less than Clients in V are equipped with single antenna,which simul- that of MUD when signal-noise-ratio (SNR)is above some taneously transmit their local updates to the server,during

(MUD), where updates are separately recovered and aggregated. The state-of-the-art polynomial time complexity MUD approach includes Gaussian message based interference cancellation (GM-IC) [20] [21], and linear minimum mean square error based interference cancellation (LMMSE-IC) [22], etc. Like our server-side decoder in PhyArith, these approaches are also derived from the sum-product algorithm, which operates on the factor-graph [18] in a message-passing manner. III. MOTIVATION In this section, through a theoretical analysis, we show that even if pre-channel-compensation of AirComp is not applied, the superimposed RF signal can still be exploited to efficiently aggregate updates, without incurring any aggregation error. Consider the following simple channel with two-input and oneoutput, denoted by y = h0s0 + h1s1 + w, where y ∈ R is the channel output symbol, h0, h1 ∈ R are the channel gain, s0, s1 are independent channel input symbols uniformly chosen in {−1, +1}, and noise w ∼ N (0, σ2 ). We study the average symbol time needed to accurately obtain function f(s0, s1) = s0 + s1 for MUD and PhyArith in theory. To calculate the symbol time, we study the entropy (i.e., the amount of information, in bits) [23] of s0, s1 and s0 + s1 first. Easy to find that entropy H(s0) = H(s1) = −0.5 log2 0.5 − 0.5 log2 0.5 = 1, and H(s0 + s1) = −0.5 log2 0.25 − 0.5 log2 0.5 = 1.5. And then we study the capacity of above channel, characterised by mutual information, i.e., the amount of information can be retrieved from the channel, in bits per symbol time [23]. For conventional successive-interference-cancellation (SIC) based MUD, we have the mutual information between y and s0, i.e., I(y; s0) = H(y) − H(y|s0), and that between y and s1 on condition of s0, i.e., I(y; s1|s0) = H(y|s0) − H(y|s0, s1). Therefore, the symbol time to obtain s0 + s1 for SIC based MUD is given by max{H(s0)/I(y; s0), H(s1)/I(y; s1|s0)}. (1) For PhyArith, we calculate the mutual information between y and s0 + s1 instead, which is given by I(y; s0 + s1) = H(y) − H(y|s0 + s1). Therefore, the symbol time to obtain s0 + s1 is given by H(s0 + s1)/I(y; s0 + s1). (2) With (1) and (2), we plot Fig. 2 for h0 = h1, |h0| = +1 and h0 = −h1, |h0| = +1, where calculation processes of entropy H(y), H(y|s0, s1), H(y|s0) and H(y|s0 + s1) are provided in Appendix-A. We can find that while h0 = h1, |h0| = +1, which can be viewed as pre-channel-compensation is perfectly applied and function distortion (i.e., f(s0, s1) − y) is small, symbol time of PhyArith is always less than that of MUD. And while h0 = −h1, |h0| = +1, which can be viewed as no pre-channel-compensation applied and function distortion is extremely serious, symbol time of PhyArith is also less than that of MUD when signal-noise-ratio (SNR) is above some threshold. As a conclusion, we can use PhyArith to improve the communication efficiency of federated learning without incurring any effect on its convergence, even if pre-channelcompensation is not applied. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 SNR (dB) Symbol time h0 = h1, |h0| = +1, PhyArith h0 = −h1, |h0| = +1, PhyArith |h0| = |h1| = +1, MUD Fig. 2. Symbol time in theory for different decoders IV. SYSTEM MODEL A. Model of Federated Learning The goal of federated learning is to learn a model with parameters embodied in a vector M from data stored across a large number of clients. For simplicity, we consider synchronized algorithms for federated learning where quantization is taken to compress client-side updates and accelerate the communication. For cases where no compression methods are applied, i.e., directly sending floating-point updates, enabling PhyArith is similar to the quantization case, by treating the signed integer exponent of each floating-point number as the coefficient for de-quantization, and treating the significand as the quantized update element. And for cases where other updates compression methods like sparsification are applied, we show how to enable PhyArith for them in Section VIII. A typical round t of such synchronized algorithms consists of the following steps: (1) A set of clients, denoted by V = {v0, · · · , v|V |−1}, is selected. Each client v ∈ V downloads the current model Mt from the server. (2) Each v ∈ V computes an updated model Mv t based on their local data. (3) The quantized update vector Av t = Qv(Mv t − Mt), v ∈ V (Qv(·) is the quantization function), as well as the de-quantization function Q−1 v (·), are sent from each client to the server. (4) The server aggregates these updates by averaging and construct an improved global model Mt+1 = Mt + ηtAt, where ηt is the learning rate, and At = P v∈V Q−1 v (Av t )/|V |. Here we assume the quantization function Qv(·) is like those proposed in [8] [9]. As a result, without loss of generality, we have Av t is a vector of signed integers, and Q−1 v (·) can be formed as Q −1 v (Av t ) = βvAv t + γv1, where βv, γv are floating-point de-quantization coefficients, βv > 0, and 1 is a vector with all entries being 1. B. Model of Uplink MU-MIMO Channel Here we present the model of wireless networks like 802.11ax, where uplink MU-MIMO is enabled. Let U = {u0, · · · , u|U|−1} denote the set of |U| antennas of the server. Clients in V are equipped with single antenna, which simultaneously transmit their local updates to the server, during

the same time-frequency resource,e.g..the same subcarriers=[..].These codewords of channel codes within the same orthogonal frequency-division multiplexing are then mapped into m baseband symbols via modulation, (OFDM)symbols in 802.1lax.For the j-th time-frequency denoted by s=[sv.0,..,sv.m-1],and sent out. resource unit (e.g.,1 subcarrier within 1 OFDM symbol),the uplink MU-MIMO channel can be presented as follows Quantized update vector A Yuoj Svo.j Wuo.j Slice =H, a. Quantized update yector chunk Quantized undate vecto chank yuwv1-1.j Svvl-13] WuU1-1.j Quantized update vector chunk a. Encod where wj is i.i.d.random variable following circularly nor- (source codes,checksum Bit-sequence CRC mal distribution C(0,202),and denotes the additive noise of antenna u for the j-th time-frequency resource unit.C Parity bits denotes the observation of u,sv.jES(where SCC)denotes Codeword x Modalate,interpolate,up-convert the baseband symbol of v obtained via modulation,and the channel gain matrix H[hh where vector RF signal =h以o…,h2,uw- Fig.3.Encoding process of each client V.OVERVIEW OF PHYARITH Here we provide an overview of PhyArith based on the A.Source Codes model of federated learning and uplink MU-MIMO channel Regarding the source codes,in the quantized update vector presented above.PhyArith consists of two phases,i.e.,the client-side encoding phase and the server-side decoding phase. chunk av=[.,dv.j,avj+1,..].each its element avj is a signed integer,and is encoded into c bits bvjo,bv.j At the client-side,each quantized update vector A?is sliced into chunks,which are encoded into aligned digital sequence within b =[6v.0,...,b.k1]using two's-complement rep- resentation [15].according to the following rule through the source codes and channel codes encoder by ap- pending checksums and parity bits,and simultaneously trans- (Lav./2--1]mod 2, av,j0, mitted through the uplink MU-MIMO channel.The source (2c+avj)/2-1-1]mod 2,o.w. (3) codes taken here are based on two's-complement representa- tion [15],and the channel codes are based on systematic low- where I=0,...,c-1,bv.jo is the most significant bit and density parity check(LDPC)codes [6],such that the additionis the least significant bit.Here we should ensure and subtraction operations during aggregation can be unified, 2e-1≥max{lau,llv∈V,j∈N}.Compared to other systems and error correction capability can be provided.The appended for representing signed numbers (e.g.,one's complement),the checksum is the cyclic-redundancy-checksum (CRC)[17]. advantage of two's-complement is that the fundamental arith- such that its linear property can be exploited to verify the metic operations of addition,subtraction,and multiplication correctness of the directly recovered summation.At the server- are identical to those for unsigned binary numbers,such that it side,from the superimposed RF signal uo,, 1 would be much convenient for our PhyArith to directly decode the summations of a large number of updates chunks are the summation of update vectors. directly recovered through a customized decoder derived from B.Checksum the sum-product algorithm.After verifying their correctness based on the linear property of CRC,these summations are The checksum appended to each bit-sequence of source averaged and assembled into (BA+1)/IVI,which codes is cyclic-redundancy-checksum (CRC)[17].Together is then taken to update the global model to train. with the bit-sequence encoded from a.they form the source- word b..We use CRC in PhyArith because it is a linear VI.ENCODING PROCESS OF PHYARITH function with the following property In this section,we demonstrate the encoding process of PhyArith.As shown in Fig.3,the quantized update CRC(b⊕b)=CRC(b)⊕CRC(b）)=0⊕0=0，(4) vector A of client v is sliced into fixed size chunks where denotes bitwise-XOR,and CRC()returns the check- first,i.e,Ag=L…,aw,…].For each chunk a= sum of input in the form of sequence of bits.This property ,+1,],vZ.it is encoded into a bit-can help us verify the correctness of the decoded summation sequence of the source codes.The obtained bit-sequence of of update vectors.We present this verification process in detail source codes is then appended with its checksum (we use in Section VII-C. cyclic-redundancy-checksum,i.e.,CRC,in PhyArith),and forms thek bits sourceword of the channel codes,denoted C.Channel Codes by vector b=[6.0,...,bv.k-].Each such sourceword is In order to provide error correction capability in PhyArith, further encoded into the n bits codeword of the channel codes the k bits sourceword be,consisting of the bit-sequence by appending a sequence of parity bits,denoted by vector encoded from quantized update vector chunk a and its CRC

the same time-frequency resource, e.g., the same subcarriers within the same orthogonal frequency-division multiplexing (OFDM) symbols in 802.11ax. For the j-th time-frequency resource unit (e.g., 1 subcarrier within 1 OFDM symbol), the uplink MU-MIMO channel can be presented as follows   yu0,j · · · yu|U|−1,j   = Hj   sv0,j · · · sv|V |−1,j   +   wu0,j · · · wu|U|−1,j   , where wu,j is i.i.d. random variable following circularly normal distribution CN (0, 2σ 2 ), and denotes the additive noise of antenna u for the j-th time-frequency resource unit. yu,j ∈ C denotes the observation of u, sv,j ∈ S (where S ⊂ C) denotes the baseband symbol of v obtained via modulation, and the channel gain matrix Hj = [h j v0 , · · · , h j v|V |−1 ], where vector h j v = [h j v,u0 , · · · , hj v,u|U|−1 ] T . V. OVERVIEW OF PHYARITH Here we provide an overview of PhyArith based on the model of federated learning and uplink MU-MIMO channel presented above. PhyArith consists of two phases, i.e., the client-side encoding phase and the server-side decoding phase. At the client-side, each quantized update vector Av t is sliced into chunks, which are encoded into aligned digital sequence through the source codes and channel codes encoder by appending checksums and parity bits, and simultaneously transmitted through the uplink MU-MIMO channel. The source codes taken here are based on two’s-complement representation [15], and the channel codes are based on systematic lowdensity parity check (LDPC) codes [16], such that the addition and subtraction operations during aggregation can be unified, and error correction capability can be provided. The appended checksum is the cyclic-redundancy-checksum (CRC) [17], such that its linear property can be exploited to verify the correctness of the directly recovered summation. At the serverside, from the superimposed RF signal [yu0,j , · · · , yu|U|−1,j ] T , the summations of a large number of updates chunks are directly recovered through a customized decoder derived from the sum-product algorithm. After verifying their correctness based on the linear property of CRC, these summations are averaged and assembled into P v∈V (βvAv t +γv1)/|V |, which is then taken to update the global model to train. VI. ENCODING PROCESS OF PHYARITH In this section, we demonstrate the encoding process of PhyArith. As shown in Fig. 3, the quantized update vector Av t of client v is sliced into fixed size chunks first, i.e., Av t = [· · · , av, · · · ]. For each chunk av = [· · · , av,j , av,j+1, · · · ], av,j ∈ Z, it is encoded into a bitsequence of the source codes. The obtained bit-sequence of source codes is then appended with its checksum (we use cyclic-redundancy-checksum, i.e., CRC, in PhyArith), and forms the k bits sourceword of the channel codes, denoted by vector bv = [bv,0, · · · , bv,k−1]. Each such sourceword is further encoded into the n bits codeword of the channel codes by appending a sequence of parity bits, denoted by vector xv = [xv,0, · · · , xv,n−1]. These codewords of channel codes are then mapped into m baseband symbols via modulation, denoted by sv = [sv,0, · · · , sv,m−1], and sent out. Quantized update vector Av t Slice Quantized update vector chunk Quantized update vector chunk · · · Quantized update vector chunk av Encode (source codes, checksum) Bit-sequence CRC Encode (channel codes) Sourceword bv Parity bits Codeword xv Modulate, interpolate, up-convert RF signal Fig. 3. Encoding process of each client A. Source Codes Regarding the source codes, in the quantized update vector chunk av = [· · · , av,j , av,j+1, · · · ], each its element av,j is a signed integer, and is encoded into c bits {bv,j0 , · · · , bv,jc−1 } within bv = [bv,0, · · · , bv,k−1] using two’s-complement representation [15], according to the following rule bv,jl = ( ⌊av,j/2 c−l−1 ⌋ mod 2, av,j ≥ 0, ⌊(2c + av,j )/2 c−l−1 ⌋ mod 2, o.w., (3) where l = 0, · · · , c − 1, bv,j0 is the most significant bit and bv,jc−1 is the least significant bit. Here we should ensure 2 c−1 ≥ max{|av,j ||v ∈ V, j ∈ N}. Compared to other systems for representing signed numbers (e.g., one’s complement), the advantage of two’s-complement is that the fundamental arithmetic operations of addition, subtraction, and multiplication are identical to those for unsigned binary numbers, such that it would be much convenient for our PhyArith to directly decode the summation of update vectors. B. Checksum The checksum appended to each bit-sequence of source codes is cyclic-redundancy-checksum (CRC) [17]. Together with the bit-sequence encoded from av, they form the sourceword bv. We use CRC in PhyArith because it is a linear function with the following property CRC(bv ⊕bv′ ) = CRC(bv)⊕CRC(bv′ ) = 0⊕0 = 0, (4) where ⊕ denotes bitwise-XOR, and CRC(·) returns the checksum of input in the form of sequence of bits. This property can help us verify the correctness of the decoded summation of update vectors. We present this verification process in detail in Section VII-C. C. Channel Codes In order to provide error correction capability in PhyArith, the k bits sourceword bv, consisting of the bit-sequence encoded from quantized update vector chunk av and its CRC

is encoded into n bits codeword x=[v.0,..,v.n-1]of Sourceword-sumz和 channel codes.The channel codes adopted here are systematic ∑ek,(au+w1) low-density parity check (LDPC)codes [16].For each k bits Sourceword-sum z1 ∑ey(a+e1) sourceword bo,it is appended with n-k parity bits provided Sourceword-sum z2 by LDPC codes encoder,and form the n bits codeword x. ∑.e月n+1) To enable PhyArith,we require the quantized update vector Sourceword-sum到- chunks transmitted simultaneously share the same LDPC dd up and codes codebook,i.e.,the mapping from b to x should be eed update vector chun the same for different client v.Besides,for modulation,the vector chank mapping from x to s=[sv.0,....so.m-1]should be the same as well. Averaged update vector At D.Compatibility with Commodity 802.11 Devices Fig.4.Decoding process of server In 802.1lax,similar with PhyArith,systematic LDPC codes are taken for error correction.For fixed coding rate,the code- A.Building of the Factor-Graph book mapping from scrambled sourceword to LDPC codes codeword is fixed.Besides,for fixed modulation,the mapping We first show how to directly decode each sourceword- sum zi based on the superimposed RF signal.Let y;= from codeword to symbols transmitted in the same time- frequency resource is also fixed.As a conclusion,the encoding be the j-th observation vector.The process of 802.1lax is fully compatible with PhyArith.That decoding problem here can be formulated as the following optimization problem is,we can directly use commodity single antenna 802.1lax devices to send the sourceword b of PhyArith via a raw z,…,z-1 socket,where the channel codes related process is provided =ag,maxP(z0,·,z4y-1yo,·,ym-1) (5) by 802.1lax devices. z0,…,zy1-1 Since jointly optimize zo,...,z-1 is too complicated to VII.DECODING PROCESS OF PHYARITH solve.We can relax(5)as follows In this section,we demonstrate the decoding process of zij=arg max P(zi.jlyo,..ym-1), PhyArith.Limited by the decoding complexity,client set V in PhyArith is randomly partitioned into subsets V= where the marginal probability can be written as the following {V6,…,v-i}first,where VVi∈y,lVA≤△and△is sum-product form via factorization the subset size limit.Then,as shown in Fig.4,based on the superimposed RF signal,PhyArith directly decodes the P(2,ly0,…,ym-1 sourceword-sum of each subset Vi,which is denoted by = ∑P(yo,…,ym-1,z0,…,z4-1)/P(yo,…,ym-1) w24,3 z1=[,0…,2.k-il=ab A∑[P(yo,…,ym-ilso…,s-i) w24.d where B.,is an odd integer given by v1-1 ΠIΠIs,x,b6∑a,b。-zrlh】 B+1,with probability 1/2, i'=0 vEV UEV (6 6n·2-1,0.w. Here denotes all random variables exceptz and is a randomly chosen integer ensuring Vo E V,B.. I(sv,x,b)is the indicator function to ensure x is a valid 2f∈N,B,·2s≥l.For these obtained sourceword-sumz: LDPC codes codeword encoded from be,and su is a valid verified to be correct,the summation of update vector chunks symbol sequence modulated from x.And 6()denotes the of Vi,i.e..v(B+),is recovered.And for these Dirac delta function. Zi verified to be incorrect,retransmission is requested.Finally, A general method to efficiently obtain the marginal prob- all these recovered summations can be added up and averaged, ability of each zi;is the factor-graph based sum-product then assembled into the averaged update vector At,given by algorithm [18],which operates in a message-passing man- ner.Specifically,while=2,the factorization (6)can A,=[…,∑a+1/W,…] be visualized by a factor-graph,a kind of bipartite graph ViEV vEV like Fig.5,which consists of two kind of nodes.One is called variable node (denoted by circles),representing random Note that here we let each B be a large odd integer to facilitate variables,and the other is called check node (denoted by the verification of zi in Section VII-C,and make sure the squares),representing the local functions of variables.By recovered v1)is accurate in Section VII-D. exchanging messages in the form of probability mass function

is encoded into n bits codeword xv = [xv,0, · · · , xv,n−1] of channel codes. The channel codes adopted here are systematic low-density parity check (LDPC) codes [16]. For each k bits sourceword bv, it is appended with n−k parity bits provided by LDPC codes encoder, and form the n bits codeword xv. To enable PhyArith, we require the quantized update vector chunks transmitted simultaneously share the same LDPC codes codebook, i.e., the mapping from bv to xv should be the same for different client v. Besides, for modulation, the mapping from xv to sv = [sv,0, · · · , sv,m−1] should be the same as well. D. Compatibility with Commodity 802.11 Devices In 802.11ax, similar with PhyArith, systematic LDPC codes are taken for error correction. For fixed coding rate, the codebook mapping from scrambled sourceword to LDPC codes codeword is fixed. Besides, for fixed modulation, the mapping from codeword to symbols transmitted in the same timefrequency resource is also fixed. As a conclusion, the encoding process of 802.11ax is fully compatible with PhyArith. That is, we can directly use commodity single antenna 802.11ax devices to send the sourceword bv of PhyArith via a raw socket, where the channel codes related process is provided by 802.11ax devices. VII. DECODING PROCESS OF PHYARITH In this section, we demonstrate the decoding process of PhyArith. Limited by the decoding complexity, client set V in PhyArith is randomly partitioned into subsets V = {V0, · · · , V|V |−1} first, where ∀Vi ∈ V, |Vi | ≤ ∆ and ∆ is the subset size limit. Then, as shown in Fig. 4, based on the superimposed RF signal, PhyArith directly decodes the sourceword-sum of each subset Vi , which is denoted by zi = [zi,0, · · · , zi,k−1] = X v∈Vi βˆ vbv, where βˆ v is an odd integer given by βˆ v = ( βv · 2 ζ + 1, with probability 1/2, βv · 2 ζ − 1, o.w., and ζ is a randomly chosen integer ensuring ∀v ∈ V, βv · 2 ζ ∈ N, βv · 2 ζ ≫ 1. For these obtained sourceword-sum zi verified to be correct, the summation of update vector chunks of Vi , i.e., P v∈Vi (βvav + γv1), is recovered. And for these zi verified to be incorrect, retransmission is requested. Finally, all these recovered summations can be added up and averaged, then assembled into the averaged update vector At, given by At = h · · · , X Vi∈V X v∈Vi (βvav + γv1)/|V |, · · · i . Note that here we let each βˆ v be a large odd integer to facilitate the verification of zi in Section VII-C, and make sure the recovered P v∈Vi (βvav + γv1) is accurate in Section VII-D. Sourceword-sum z|V|−1 Sourceword-sum z2 Sourceword-sum z1 Sourceword-sum z0 · · · Superimposed RF signal Sum-product derived decoder Recover or request retransmit P v∈Vl (βvav + γv1) P v∈Vj (βvav + γv1) P v∈Vi (βvav + γv1) · · · · · · Add up and average Averaged update vector chunk Averaged update vector chunk Averaged update vector chunk · · · Averaged update vector At Assemble Fig. 4. Decoding process of server A. Building of the Factor-Graph We first show how to directly decode each sourcewordsum zi based on the superimposed RF signal. Let yj = [yu0,j , · · · , yu|U|−1,j ] T be the j-th observation vector. The decoding problem here can be formulated as the following optimization problem z ∗ 0 , · · · , z ∗ |V |−1 =arg max z0,··· ,z|V|−1 P(z0, · · · , z|V |−1|y0, · · · , ym−1). (5) Since jointly optimize z0, · · · , z|V |−1 is too complicated to solve. We can relax (5) as follows z ∗ i,j = arg max zi,j P(zi,j |y0, · · · , ym−1), where the marginal probability can be written as the following sum-product form via factorization P(zi,j |y0, · · · , ym−1) = X ∼zi,j P(y0, · · · , ym−1, z0, · · · , z|V |−1)/P(y0, · · · , ym−1) = A· X ∼zi,j P(y0, · · · , ym−1|sv0 , · · · , sv|V |−1 )· |V |− Y 1 i ′=0 [ Y v∈Vi ′ I(sv, xv, bv)]δ(k X v∈Vi ′ βˆ vbv − zi ′k1) . (6) Here ∼ zi,j denotes all random variables except zi,j . I(sv, xv, bv) is the indicator function to ensure xv is a valid LDPC codes codeword encoded from bv, and sv is a valid symbol sequence modulated from xv. And δ(·) denotes the Dirac delta function. A general method to efficiently obtain the marginal probability of each zi,j is the factor-graph based sum-product algorithm [18], which operates in a message-passing manner. Specifically, while |V| = 2, the factorization (6) can be visualized by a factor-graph, a kind of bipartite graph like Fig. 5, which consists of two kind of nodes. One is called variable node (denoted by circles), representing random variables, and the other is called check node (denoted by squares), representing the local functions of variables. By exchanging messages in the form of probability mass function

(PMF)between adjacent nodes,the marginal probability of Different from these MUD approaches like LMMSE-IC [22] random variables can be obtained.Generally,suppose client where all random variables sV are approximated by an subset Vi=).For Vi.let the j-th baseband independent circularly symmetric normal distribution,which symbol vector and codeword lost the information of the relationship between these random bit vector.Based on the variables.In order to efficiently utilize the superposition of factorization(6),we build the factor-graph g for decoding RF signals to aggregate updates,here we treat each random the sourceword-sum in PhyArith.In g,as shown in Fig.5,Yj variable pair (svj sv),v,vrE Vi is mutual dependent, denotes the j-th observation check node.And for client subset and use the following function derived from the probability Vi,we let C denote its j-th LDPC codes check node.X density function of multivariate normal distribution to approx- its LDPC codes variable node(corresponding to variable xi.), imate above PMF Mii its modulation check node,Si;its modulation variable Py,→S.（(si,i）=A·exp node(corresponding to variable si.j).Zij its sourceword-sum (-Real(ij -Hsij)TIReal(i.j-Hjsij)/2), (8) variable node (corresponding to variable zi.j),and i.j its sourceword-sum check node. whereis the channel gain matrix of client subset Vi.Covariance matrix Tij is obtained via (22) in Appendix-D.Vector ui.i is given by h=y-∑hE(s), vEV/V where E(sv.j)is the mean of random variable s.j,v Vi,ob- tained based on PMF Ps(sj).And operation Real(): Clw1→R2 is defined as Real(lao,…,aul-i]T)e [Re(ao),Im(co),..,Re(awl-1),Im(awI-1)]T. Algorithm 1 Decoding process of the sourceword-sum LDPC codes variable node Modulation variable node Sourceword-sum variable node nput:g,h,yi where v∈V,j∈{0，…，m-l} LDPC codes check node Modulation check node Sourceword-sum check node Observation check node 0 utput::where i∈{0，…，川-1}，j∈{0，…，k-1} 1:for iterator =0 to MaxIterationNum-1 do Fig.5.Example factor-graph for the sum-product algorithm derived decoder 2: for all variable nodes Si.j,Xi.j,Zi.jg do in PhyArith whereV=2 3: propagate their PMFs given by (11)(12)(13)(14) (15)(16)to their adjacent check nodes B.Decoding of the Sourceword-Sum 4: end for The decoding process of the sourceword-sum on factor- 5: for all check nodes Yi.j,M,i,Ci,j,∑i,i∈gdo graph G is illustrated in Algorithm 1.Given factor-graph g, propagate their PMFs given by (8)(17)(18)(19)(20) the channel gain vector hi,and obersvation yi,where vV, (21)to their adjacent variable nodes j∈{0，.·，m-l},Algorithm1 returns the decoding result 7: end for 2与where i∈{0，·，以-1}，j∈{0，…，k-1}.Algorithm 8:end for 1 has MaxIterationNum iterations.In each iteration,on factor- 9:fori=0:川-1，j=0:k-1do graph g,it first parallelly propagates PMFs from all variable 1o:take()⑦to obtain2 nodes to their adjacent check nodes,and then propagates PMFs 11:end for from check nodes to their adjacent variable nodes.And after these iterations,it takes the following formula to obtain the final decoding result C.Verification of the Sourceword-Sum Here we show how to verify the correctness of the obtained 2=arg max P,→Z（2i,j (7) sourceword-sum zi.Note that for each correct zi -Bb+ 2. where P()denotes the PMF of random variable …+月evI-ib4vI-i,we have zi.j,propagated from check node Si.j to variable node Zi.j, Zi mod 2 given by (21)in Appendix-C. The most challenging part in Algorithm 1 is how to effi- =(mod2)beo⊕…⊕（月e411-.mod2be4wI- ciently and accurately obtain the PMF of si.j propagated from =bo⊕…⊕bwI-a' Yi to Si.j,given by Py→5，(sij) where z;mod2≌[2毫0mod2,·,zk-1mod2].According to (4),we have =A·∑[Pyls0,…,sv1-1)ΠPs,→y(s小 5,3 CRC(zmod2)=CRC(bo)⊕…⊕CRC(b4v-)=0

(PMF) between adjacent nodes, the marginal probability of random variables can be obtained. Generally, suppose client subset Vi = {vi0 , · · · , vi|Vi |−1 }. For Vi , let the j-th baseband symbol vector si,j = [svi0 ,j , · · · , svi |Vi |−1 ,j ] T , and codeword bit vector xi,j = [xvi0 ,j , · · · , xvi |Vi |−1 ,j ] T . Based on the factorization (6), we build the factor-graph G for decoding the sourceword-sum in PhyArith. In G, as shown in Fig. 5, Yj denotes the j-th observation check node. And for client subset Vi , we let Ci,j denote its j-th LDPC codes check node, Xi,j its LDPC codes variable node (corresponding to variable xi,j ), Mi,j its modulation check node, Si,j its modulation variable node (corresponding to variable si,j ), Zi,j its sourceword-sum variable node (corresponding to variable zi,j ), and Σi,j its sourceword-sum check node. · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · LDPC codes variable node LDPC codes check node Observation check node Modulation variable node Modulation check node Sourceword-sum variable node Sourceword-sum check node C0,0 C0,n−k−2 C0,n−k−1 Z0,0 Z0,k C1,0 Z1,0 Z1,k X0,n−1 X0,0 X1,n−1 X1,0 Σ0,0 Σ0,k Σ1,0 Σ1,k Y1 Ym−1 M0,0 M0,m−1 M1,0 S0,0 M1,m−1 S1,0 S0,m−1 S1,m−1 Fig. 5. Example factor-graph for the sum-product algorithm derived decoder in PhyArith where |V| = 2 B. Decoding of the Sourceword-Sum The decoding process of the sourceword-sum on factorgraph G is illustrated in Algorithm 1. Given factor-graph G, the channel gain vector h j v , and obersvation yj , where v ∈ V , j ∈ {0, · · · , m − 1}, Algorithm 1 returns the decoding result z ∗ i,j where i ∈ {0, · · · , |V|−1}, j ∈ {0, · · · , k−1}. Algorithm 1 has MaxIterationNum iterations. In each iteration, on factorgraph G, it first parallelly propagates PMFs from all variable nodes to their adjacent check nodes, and then propagates PMFs from check nodes to their adjacent variable nodes. And after these iterations, it takes the following formula to obtain the final decoding result z ∗ i,j z ∗ i,j = arg max zi,j PΣi,j→Zi,j (zi,j ), (7) where PΣi,j→Zi,j (zi,j ) denotes the PMF of random variable zi,j , propagated from check node Σi,j to variable node Zi,j , given by (21) in Appendix-C. The most challenging part in Algorithm 1 is how to effi- ciently and accurately obtain the PMF of si,j propagated from Yj to Si,j , given by PYj→Si,j (si,j ) =A · X ∼si,j P(yj |s0,j , · · · , s|V |−1,j ) Y i6=i ′ PSi ′,j→Yj (si ′ ,j ) . Different from these MUD approaches like LMMSE-IC [22] where all random variables sv,j , v ∈ V are approximated by an independent circularly symmetric normal distribution, which lost the information of the relationship between these random variables. In order to efficiently utilize the superposition of RF signals to aggregate updates, here we treat each random variable pair (svl,j , svl ′ ,j ), vl , vl ′ ∈ Vi is mutual dependent, and use the following function derived from the probability density function of multivariate normal distribution to approximate above PMF PYj→Si,j (si,j ) = A · exp − Real(µi,j − Hi j si,j ) T Γ −1 i,j Real(µi,j − Hi j si,j )/2 , (8) where Hi j = [h j vi0 , · · · , h j vi |Vi |−1 ] is the channel gain matrix of client subset Vi . Covariance matrix Γi,j is obtained via (22) in Appendix-D. Vector µi,j is given by µi,j = yj − X v∈V /Vi h j vE(sv,j ), where E(sv,j ) is the mean of random variable sv,j , v ∈ Vi ′ , obtained based on PMF PSi ′,j→Yj (si ′ ,j ). And operation Real(·) : C |U| → R 2|U| is defined as Real([α0, · · · , α|U|−1] T ) , [Re(α0),Im(α0), · · · , Re(α|U|−1),Im(α|U|−1)]T . Algorithm 1 Decoding process of the sourceword-sum Input: G, h j v , yj where v ∈ V , j ∈ {0, · · · , m − 1} Output: z ∗ i,j where i ∈ {0, · · · , |V| − 1}, j ∈ {0, · · · , k − 1} 1: for iterator = 0 to MaxIterationNum − 1 do 2: for all variable nodes Si,j , Xi,j , Zi,j ∈ G do 3: propagate their PMFs given by (11) (12) (13) (14) (15) (16) to their adjacent check nodes 4: end for 5: for all check nodes Yi,j , Mi,j , Ci,j , Σi,j ∈ G do 6: propagate their PMFs given by (8) (17) (18) (19) (20) (21) to their adjacent variable nodes 7: end for 8: end for 9: for i = 0 : |V| − 1, j = 0 : k − 1 do 10: take (7) to obtain z ∗ i,j 11: end for C. Verification of the Sourceword-Sum Here we show how to verify the correctness of the obtained sourceword-sum z ∗ i . Note that for each correct zi = βˆ vi0 bvi0 + · · · + βˆ vi |Vi |−1 bvi |Vi |−1 , we have zi mod 2 =(βˆ vi0 mod 2)bvi0 ⊕ · · · ⊕ (βˆ vi |Vi |−1 mod 2)bvi |Vi |−1 =bvi0 ⊕ · · · ⊕ bvi |Vi |−1 , where zi mod 2 , [z ∗ i,0 mod 2, · · · , z∗ i,k−1 mod 2]. According to (4), we have CRC(zi mod 2) = CRC(bvi0 ) ⊕ · · · ⊕ CRC(bvi |Vi |−1 ) = 0.

That is we can verify the correctness of z:via the following B.Server-Side Process rule z is correct, CRC(z;mod 2)=0, And at the server-side,similar to the routine provided in (9) Section VII.all these recovered summations of update vector z is incorrect,o.w.. chunks,i.e vv(Ba+1).are assembled into the Note that by verifying the checksum with (9),we can guar- averaged compressed update vector C.At,given by antee the correctness of z with high probability.However, some z;can still be erroneous even if it passes the checksum C.A=[,∑∑aa+%1/Ivl verification. V:∈VveV D.Recovery of the Summation of Update Vector Chunks =】 ∑Q(A)/IW1=∑C·sP(g-M)/IWm. Ue∈ v∈V Based on the sourceword-sum z;that is verified to be cor- rect,here we show how to recover 1).Since By recovering sparse vector A:based on algorithms like each element av.j in a is encoded into [bv.jo,..,bv.j convex programming [24]for compressed-sensing,the server in bo according to (3),we have can construct an improved global model M+1=M:+nAt. {∑月，bo,…,∑月，bejc-}={ao…,2je-i}. IX.PERFORMANCE EVALUATION vEV vEV A.Experiment Setup That is each element∑：ev(,ae+w)in∑uev(a,a,+ We conduct experiments using the federated learning al- Y1)can be recovered via gorithm FedSGD presented in [3]to train LeNet-5 [25]for +∑ a L2e+c-」=0， the MNIST digit recognition task,where the training data vEV is shuffled and equally distributed over all clients.During ∑(8a,+) a-26+c each round of FedSGD,all clients are selected,each of 2x一+∑ Yv:0.W., which computes the model update (i.e.,the gradient)based EVi on all its local data (i.e.,batch size equals all its local data) (10) and forward the update to the server.In the experiment, where consistent with 802.11ax,we let U=8 (i.e.,the server e-1 a=(∑4.n2-l-1+∑a.2+mod2+c is equipped with 8 antennas)and study different total client size V in the following two subsections.Specifically,we l=0 1=0 study V=4,8,12,16,denoted by configuration 4 x 8. and e=nlog2(vvB)1.The process presented in (10)can 8 x 8,12 x 8,and 16 x 8,respectively.The channel gain be viewed as transferring each c-bits signed integer in two's- matrix between clients and server is generated according to complement representation (e.g.,[b.j,.,b.)into a the multiple antenna channel model provided in [26].The (e+c)-bits signed integer in two's-complement representation modulation scheme is BPSK.And we use the standard LDPC (e.g.,[bv.jo,bo.jo,...,bv.je,by inserting e extra most codes specified in 802.11 with coding rate 1/2,where k=324 significant bit bv.jo in its beginning),such that the addition and n =648,for error correction.The cyclic-redundancy- of Vil such signed integers won't result in the overflowing checksum adopted in the experiment is CRC24 where the problem. polynomial is 0x800063,the initial value of 24 bits register is VIII.PHYARITH WITH SPARSIFICATION 0x0,and the final XOR value is 0x0.Each quantized update vector chunk is encoded into a sequence of 300 bits.Such 300 In this section,we show how to enable PhyArith in federated bits appended with 24 bits CRC24 form every 324 bits LDPC learning where sparsification is taken to compress client-side codes sourceword. updates and accelerate the communication. A.Client-Side Process B.Outage Ratio of Aggregating Client-Side Updates In conventional sparsification based updates compression We first show the outage ratio of aggregating client-side methods,each client simply sends sparse vector A= updates during training LeNet-5 using FedSGD.Specifically, SP(M?-M:)to the server,where SP()denotes the spar- we show the outage ratio of obtaining the exact summation of sification process.Different from these methods,in order to all update vector chunks transmitted simultaneously,based on enable PhyArith,at the client-side,the update vector PhyArith and the state-of-the-art polynomial time complexity MUD approach GM-IC and LMMSE-IC [20]-[22].From the A=Q(C.SP(M-M:)), experiment results shown in Fig.6,we can see 1)PhyArith is sent from each client to the server,where C is the random always outperforms both MUD approaches;2)PhyArith is underdetermined sensing matrix shared across all clients and especially better when V>U.This is because while the server,used for compressing the sparse vector SP(M- V>U,the decoding process of MUD approaches like M:)based on compressed-sensing.For such A,it can then LMMSE-IC and GM-IC does not necessarily converge [21]. be sliced into chunks,i.e.,A=[...,a,..],and encoded We also take the appended checksums to verify the correct- according to the routine provided in Section VI. ness of all these erroneous summations directly recovered

That is we can verify the correctness of z ∗ i via the following rule ( z ∗ i is correct, CRC(z ∗ i mod 2) = 0, z ∗ i is incorrect, o.w.. (9) Note that by verifying the checksum with (9), we can guarantee the correctness of z ∗ i with high probability. However, some z ∗ i can still be erroneous even if it passes the checksum verification. D. Recovery of the Summation of Update Vector Chunks Based on the sourceword-sum zi that is verified to be correct, here we show how to recover P v∈Vi (βvav +γv1). Since each element av,j in av is encoded into {bv,j0 , · · · , bv,jc−1 } in bv according to (3), we have { X v∈Vi βˆ vbv,j0 , · · · , X v∈Vi βˆ vbv,jc−1 } = {zi,j0 , · · · , zi,jc−1 }. That is each element P v∈Vi (βvav,j + γv) in P v∈Vi (βvav + γv1) can be recovered via X v∈Vi (βvav,j + γv) .=    α 2 ζ + X v∈Vi γv, ⌊ α 2 ǫ+c−1 ⌋ = 0, α − 2 ǫ+c 2 ζ + X v∈Vi γv, o.w., (10) where α = (Xc−1 l=0 zi,jl 2 c−l−1 + Xǫ−1 l=0 zi,j0 2 c+l ) mod 2 ǫ+c , and ǫ = ⌈log2 ( P v∈Vi βˆ v)⌉. The process presented in (10) can be viewed as transferring each c-bits signed integer in two’scomplement representation (e.g., {bv,j0 , · · · , bv,jc−1 }) into a (ǫ+c)-bits signed integer in two’s-complement representation (e.g., {bv,j0 , bv,j0 , · · · , bv,jc−1 }, by inserting ǫ extra most significant bit bv,j0 in its beginning), such that the addition of |Vi | such signed integers won’t result in the overflowing problem. VIII. PHYARITH WITH SPARSIFICATION In this section, we show how to enable PhyArith in federated learning where sparsification is taken to compress client-side updates and accelerate the communication. A. Client-Side Process In conventional sparsification based updates compression methods, each client simply sends sparse vector Av t = SP(Mv t − Mt) to the server, where SP(·) denotes the sparsification process. Different from these methods, in order to enable PhyArith, at the client-side, the update vector Av t = Qv(C · SP(Mv t − Mt)), is sent from each client to the server, where C is the random underdetermined sensing matrix shared across all clients and the server, used for compressing the sparse vector SP(Mv t − Mt) based on compressed-sensing. For such Av t , it can then be sliced into chunks, i.e., Av t = [· · · , av, · · · ], and encoded according to the routine provided in Section VI. B. Server-Side Process And at the server-side, similar to the routine provided in Section VII, all these recovered summations of update vector chunks, i.e., P v∈Vi (βvav + γv1), are assembled into the averaged compressed update vector C · At, given by C · At = h · · · , X Vi∈V X v∈Vi (βvav + γv1)/|V |, · · · i = X v∈V Q −1 v (Av t )/|V | = X v∈V C · SP(Mv t − Mt)/|V |. By recovering sparse vector At based on algorithms like convex programming [24] for compressed-sensing, the server can construct an improved global model Mt+1 = Mt +ηtAt. IX. PERFORMANCE EVALUATION A. Experiment Setup We conduct experiments using the federated learning algorithm FedSGD presented in [3] to train LeNet-5 [25] for the MNIST digit recognition task, where the training data is shuffled and equally distributed over all clients. During each round of FedSGD, all clients are selected, each of which computes the model update (i.e., the gradient) based on all its local data (i.e., batch size equals all its local data) and forward the update to the server. In the experiment, consistent with 802.11ax, we let |U| = 8 (i.e., the server is equipped with 8 antennas) and study different total client size |V | in the following two subsections. Specifically, we study |V | = 4, 8, 12, 16, denoted by configuration 4 × 8, 8 × 8, 12 × 8, and 16 × 8, respectively. The channel gain matrix between clients and server is generated according to the multiple antenna channel model provided in [26]. The modulation scheme is BPSK. And we use the standard LDPC codes specified in 802.11 with coding rate 1/2, where k = 324 and n = 648, for error correction. The cyclic-redundancychecksum adopted in the experiment is CRC24 where the polynomial is 0x800063, the initial value of 24 bits register is 0x0, and the final XOR value is 0x0. Each quantized update vector chunk is encoded into a sequence of 300 bits. Such 300 bits appended with 24 bits CRC24 form every 324 bits LDPC codes sourceword. B. Outage Ratio of Aggregating Client-Side Updates We first show the outage ratio of aggregating client-side updates during training LeNet-5 using FedSGD. Specifically, we show the outage ratio of obtaining the exact summation of all update vector chunks transmitted simultaneously, based on PhyArith and the state-of-the-art polynomial time complexity MUD approach GM-IC and LMMSE-IC [20]–[22]. From the experiment results shown in Fig. 6, we can see 1) PhyArith always outperforms both MUD approaches; 2) PhyArith is especially better when |V | ≥ |U|. This is because while |V | ≥ |U|, the decoding process of MUD approaches like LMMSE-IC and GM-IC does not necessarily converge [21]. We also take the appended checksums to verify the correctness of all these erroneous summations directly recovered

by PhyArith in Fig.6 (55.479 erroneous summations in all 95 955 configurations).Experiment results show that none of them passes the verification. 905 90% 85 LMMSE-XC 80院 80 a 10 (b) 8 10 0.5 cation time time 90 909 P-yArith= A4△=2 yAth= Phy-Arith,A =3 LMMSE-I LMOSE-K 85 01 (a)13 1415 16 15 1617 18 SNR (dB)】 SNR(dB】 -0PhArith.A=S LMEMSE-C -B-LMMSE-IC 7年 70 6 810 10 05 Normalized communication time -AyAh△=2 OAih△=2 A△= Fig.7.Test accuracy of LeNet-5 trained by FedSGD where learning rate LMSE-l =0.01,updates are compressed using (a)No-Compression:(b)10-bits;(c) 17 18192021 22 (d山18 19202122 23 2-bits;(d)1-bits quantization,aggregated based on PhyArith and LMMSE-IC SNR (dB) SNR (dB) 5绕 95凭 Fig.6.Outage ratio of obtaining the exact summation of all update vector chunks transmitted simultaneously in the first round of FedSGD based on PhyArith and MUD,where FedSGD is with configuration (a)4 x 8:(b) 90% 90 8 x 8:(c)12 x 8;(d)16 x 8,learning rate n=0.01,updates are compressed using 10 bits quantization 85 85% 809 0 (a) 6 8 10 (bj C.Test Accuracy of LeNet-5 Trained by FedSGD on time ed co 95凭 95凭 And then,we show the test accuracy of LeNet-5 trained 90% by FedSGD with configuration 16 x 8,where the SNR of the uplink MU-MIMO channel is set as 23.33 dB,and the 859% client-side updates are aggregated based on PhyArith and the MUD approach LMMSE-IC.Recall the results in Fig.6 that 09 the MUD approach can not handle the 16 x 8 configuration 10 6 8 10 (c) a tion tim nication time where updates from all 16 clients are transmitted to the server simultaneously.To be fair to the MUD approach,we let all Fig.8.Test accuracy of LeNet-5 trained by FedSGD where learning rate its 16 updates be separately simultaneously transmitted in n=0.01,updates are compressed using (a)No-Compression:(b)top-30%: several slots,such that they can be efficiently aggregated. (c)top-6%;(d)top-3%sparsification,and 8-bits quantization,aggregated based on PhyArith and LMMSE-IC From the experiment results shown in Fig.7 and Fig.8,we can see that to achieve the same test accuracy,our proposal PhyArith can further improve the communication efficiency APPENDIX by 1.5 to 3 times,compared with the solutions which apply A.Calculation of Entropy quantization/sparsification to compress updates,and aggregate While hol =h=+1.0,entropy H(y)is given by them via MUD. 0=高eap(票+ X.CONCLUSION )+92 exp(v+2)2 In this paper.we propose PhyArith that improves the communication efficiency of federated learning in wireless entropy H(yso,s1)is given by networks featuring uplink MU-MIMO,by directly aggregating client-side updates from the superimposed RF signal,without affecting the convergence of the training procedure.In the 6a)=店a器 future,we would like to make other server-side aggregation and entropy H(yso)is given by methods be available in PhyArith,and further improve its decoding performance. w=2a（o(二+

by PhyArith in Fig. 6 (55, 479 erroneous summations in all configurations). Experiment results show that none of them passes the verification. 13 14 15 16 17 18 0 0.5 1 SNR (dB) Outage ratio (a) PhyArith, ∆ = 2 PhyArith, ∆ = 3 LMMSE-IC GM-IC 15 16 17 18 19 20 0 0.5 1 SNR (dB) (b) PhyArith, ∆ = 2 PhyArith, ∆ = 3 LMMSE-IC GM-IC 17 18 19 20 21 22 0 0.5 1 SNR (dB) Outage ratio (c) PhyArith, ∆ = 2 PhyArith, ∆ = 3 LMMSE-IC GM-IC 18 19 20 21 22 23 0 0.5 1 SNR (dB) (d) PhyArith, ∆ = 2 PhyArith, ∆ = 3 LMMSE-IC GM-IC Fig. 6. Outage ratio of obtaining the exact summation of all update vector chunks transmitted simultaneously in the first round of FedSGD based on PhyArith and MUD, where FedSGD is with configuration (a) 4 × 8; (b) 8×8; (c) 12×8; (d) 16×8, learning rate η = 0.01, updates are compressed using 10 bits quantization C. Test Accuracy of LeNet-5 Trained by FedSGD And then, we show the test accuracy of LeNet-5 trained by FedSGD with configuration 16 × 8, where the SNR of the uplink MU-MIMO channel is set as 23.33 dB, and the client-side updates are aggregated based on PhyArith and the MUD approach LMMSE-IC. Recall the results in Fig. 6 that the MUD approach can not handle the 16 × 8 configuration where updates from all 16 clients are transmitted to the server simultaneously. To be fair to the MUD approach, we let all its 16 updates be separately simultaneously transmitted in several slots, such that they can be efficiently aggregated. From the experiment results shown in Fig. 7 and Fig. 8, we can see that to achieve the same test accuracy, our proposal PhyArith can further improve the communication efficiency by 1.5 to 3 times, compared with the solutions which apply quantization/sparsification to compress updates, and aggregate them via MUD. X. CONCLUSION In this paper, we propose PhyArith that improves the communication efficiency of federated learning in wireless networks featuring uplink MU-MIMO, by directly aggregating client-side updates from the superimposed RF signal, without affecting the convergence of the training procedure. In the future, we would like to make other server-side aggregation methods be available in PhyArith, and further improve its decoding performance. 2 4 6 8 10 80% 85% 90% 95% Normalized communication time Test accuracy (a) PhyArith, ∆ = 3 LMMSE-IC 2 4 6 8 10 80% 85% 90% 95% Normalized communication time (b) PhyArith, ∆ = 3 LMMSE-IC 2 4 6 8 10 75% 80% 85% 90% Normalized communication time Test accuracy (c) PhyArith, ∆ = 3 LMMSE-IC 2 4 6 8 10 75% 80% 85% 90% Normalized communication time (d) PhyArith, ∆ = 3 LMMSE-IC Fig. 7. Test accuracy of LeNet-5 trained by FedSGD where learning rate η = 0.01, updates are compressed using (a) No-Compression; (b) 10-bits; (c) 2-bits; (d) 1-bits quantization, aggregated based on PhyArith and LMMSE-IC 2 4 6 8 10 80% 85% 90% 95% Normalized communication time Test accuracy (a) PhyArith, ∆ = 2 LMMSE-IC 2 4 6 8 10 80% 85% 90% 95% Normalized communication time (b) PhyArith, ∆ = 2 LMMSE-IC 2 4 6 8 10 80% 85% 90% 95% Normalized communication time Test accuracy (c) PhyArith, ∆ = 2 LMMSE-IC 2 4 6 8 10 80% 85% 90% 95% Normalized communication time (d) PhyArith, ∆ = 2 LMMSE-IC Fig. 8. Test accuracy of LeNet-5 trained by FedSGD where learning rate η = 0.01, updates are compressed using (a) No-Compression; (b) top-30%; (c) top-6%; (d) top-3% sparsification, and 8-bits quantization, aggregated based on PhyArith and LMMSE-IC APPENDIX A. Calculation of Entropy While |h0| = |h1| = +1.0, entropy H(y) is given by H(y) = H{ 1 4 √ 2πσ 2exp(−y 2 2σ 2 )+ exp(−(y + 2)2 2σ 2 ) + exp(−(y − 2)2 2σ 2 ) }, entropy H(y|s0, s1) is given by H(y|s0, s1) = H{ 1 √ 2πσ exp(−y 2 2σ 2 ) }. and entropy H(y|s0) is given by H(y|s0) = H{ 1 2 √ 2πσ exp(−(y + 1)2 2σ 2 )+exp(−(y − 1)2 2σ 2 ) }

While ho =h1,hol =+1.0,entropy H(ylso +s1)= The PMF propagated from Mit to Xi.j is given by H(ylso,s1).And while ho =-h1,hol +1.0,entropy H(ylso +s1)is given by PM.L→X（(Xi) 0w+-2e票》计 A∑ Π Px→M(X ·(18) x41j≠j,(X,,M.)e9 3am2)+m(22加 Ij',Lxigsi)]Ps-M(sit)I(j.L.xij.si.). 2σ2 The PMF propagated from Ci.!to Xi.j is given by B.PMF from Variable Node to Check Node in Algorithm 1 PC1→X,（xi） The PMF propagated from Si.j to Yi is given by A∑[Π Px→C(Xr小 (19) the Ist iteration, =x4,11≠j,(X,,C,l)eg Ps,→y(si)= 1/s4, PMy→S,(si0.w… 6⊕(xr,CEgx,rll小 (11) The PMF propagated from Si.j to Mi.j is given by For client subset Vi.let B The PMF propagated from Ei.j to Xi.j is given by ∫1/s4 the Ist iteration. PS,→M,(si)= P一X,（X) Py,→S(si,o.w.… =A·∑[Pz,-→8，(2）i(Bx-z小 (20) (12) The PMF propagated from Xi.j to Mi.is given by xi,j PX→M(Xi）= The PMF propagated from Ei.j to Zij is given by 1/21 the Ist iteration, P841→Z4,（2） =A.∑[Pxw→8，(x)6(Bgx-小 (21) A.Π PC→X（0.w. 心24， (XiCL)EG (13) D.Calculation of Covariance Matrix Ti.j The PMF propagated from Xi.j to Ci.is given by Random variables in vector Real(yj)follow multivariate PX,1→C(Xij）= normal distribution.On condition of random variable vector 1/2w the 1st iteration, sij,we can calculate the covariance matrix Ti;of random variables in vector Real(y;)as follows A. Π Pc→X,(X） l≠，(X,C,)e T:j=>(Var(Re(s))Real(h)Real(h)T+ 0.w. IEV/V Π PM.→X,(X） (Xi.j.M)EG Var(Im(s))Imag()Imag(h)T+ (14) Cov(Re(sv.j),Im(svL.j)) The PMF propagated from Xi.i to i.j is given by (Real(h)Imag(h)T+Imag(h)Real(h)T))+ PX→2，(x)=A.ΠPCu→X,(xH ∑(Cov(Re(s,Re(sr》 (X4,C4,)e9 Vey/},reV,l≠ (15) ΠPMA→X,x) (Real(hi )Real(hi)T +Real(h)Real(T)+(22) (Xi.j,Mi.1)EG Cov(m(su,,Im(sv,》 The PMF propagated from Zij to ij is given by (Imag(h)Imag(h+Imag(h)Imag(hT)+ Pz4→84，(a）=2-1M (16) Cov(Re(svj),Im(svj)). C.PMF from Check Node to Variable Node in Algorithm I (Real()Imag(h)T+Imag()Real(T)+ The PMF propagated from Mi.j to Si.j is given by Cov(Im(s,),Re(sr,)片 PM→S(si) (Imag(h)Real(hT +Real(h)Imag(h)T))+ A∑ Ⅱ[Px→aM,x) diag(…,o2,…)）为 (17) =5,(X,,M)e9 where Var()denotes the variance of random variable, I0,i,x,s】 Cov()denotes the covariance of random variables,both of which are obtained based on PMF Ps(s). where I(j',j,xi,sij)is the indicator function to ensure the diag()denotes diagonal matrix.And operation Imag(): j'th codeword bits vector is consistent with the j-thc is defined as Imag([oo) symbol vetor si.j. 【-m(ao),Re(ao,…,-Im(awl-1),Re(a4wl-1J

While h0 = h1, |h0| = +1.0, entropy H(y|s0 + s1) = H(y|s0, s1). And while h0 = −h1, |h0| = +1.0, entropy H(y|s0 + s1) is given by H(y|s0 + s1) = 1 2 H{ 1 √ 2πσ exp(−y 2 2σ 2 ) }+ 1 2 H{ 1 2 √ 2πσ exp(−(y + 2)2 2σ 2 ) + exp(−(y − 2)2 2σ 2 ) }. B. PMF from Variable Node to Check Node in Algorithm 1 The PMF propagated from Si,j to Yj is given by PSi,j→Yj (si,j ) = ( 1/|S||Vi| , the 1st iteration, PMi,j→Si,j (si,j ), o.w.. (11) The PMF propagated from Si,j to Mi,j is given by PSi,j→Mi,j (si,j ) = ( 1/|S||Vi| , the 1st iteration, PYj→Si,j (si,j ), o.w.. (12) The PMF propagated from Xi,j to Mi,l is given by PXi,j→Mi,l (xi,j ) =    1/2 |Vi| , the 1st iteration, A · Y (Xi,j ,Ci,l′ )∈G PCi,l′→Xi,j (xi,j ), o.w.. (13) The PMF propagated from Xi,j to Ci,l is given by PXi,j→Ci,l (xi,j ) =    1/2 |Vi| , the 1st iteration, A · Y l6=l ′ ,(Xi,j ,Ci,l′ )∈G PCi,l′→Xi,j (xi,j ) · Y (Xi,j ,Mi,l′ )∈G PMi,l′→Xi,j (xi,j ) , o.w.. (14) The PMF propagated from Xi,j to Σi,j is given by PXi,j→Σi,j (xi,j ) = A · Y (Xi,j ,Ci,l)∈G PCi,l→Xi,j (xi,j )· Y (Xi,j ,Mi,l)∈G PMi,l→Xi,j (xi,j ). (15) The PMF propagated from Zi,j to Σi,j is given by PZi,j→Σi,j (zi,j ) = 2−|Vi| . (16) C. PMF from Check Node to Variable Node in Algorithm 1 The PMF propagated from Mi,j to Si,j is given by PMi,j→Si,j (si,j ) = A · X ∼si,j Y (Xi,j′ ,Mi,j )∈G PXi,j′→Mi,j (xi,j′ )· I(j ′ , j, xi,j′ , si,j ) , (17) where I(j ′ , j, xi,j′ , si,j ) is the indicator function to ensure the j ′ -th codeword bits vector xi,j′ is consistent with the j-th symbol vetor si,j . The PMF propagated from Mi,l to Xi,j is given by PMi,l→Xi,j (xi,j ) = A · X ∼xi,j Y j6=j ′ ,(Xi,j′ ,Mi,l)∈G PXi,j′→Mi,l (xi,j′ )· I(j ′ , l, xi,j′ , si,l) PSi,l→Mi,l (si,l)I(j, l, xi,j , si,l). (18) The PMF propagated from Ci,l to Xi,j is given by PCi,l→Xi,j (xi,j ) = A · X ∼xi,j Y j6=j ′ ,(Xi,j′ ,Ci,l)∈G PXi,j′→Ci,l (xi,j′ )· δ(k ⊕(Xi,j′ ,Ci,l)∈G xi,j′k1) . (19) For client subset Vi , let Bi = [βˆ vi0 , · · · , βˆ vi |Vi |−1 ] T . The PMF propagated from Σi,j to Xi,j is given by PΣi,j→Xi,j (xi,j ) =A · X ∼xi,j PZi,j→Σi,j (zi,j )δ(B T i xi,j − zi,j ) . (20) The PMF propagated from Σi,j to Zi,j is given by PΣi,j→Zi,j (zi,j ) =A · X ∼zi,j PXi,j→Σi,j (xi,j )δ(B T i xi,j − zi,j ) . (21) D. Calculation of Covariance Matrix Γi,j Random variables in vector Real(yj ) follow multivariate normal distribution. On condition of random variable vector si,j , we can calculate the covariance matrix Γi,j of random variables in vector Real(yj ) as follows Γi,j = X vl∈V /Vi Var(Re(svl,j ))Real(h j vl )Real(h j vl ) T + Var(Im(svl,j ))Imag(h j vl )Imag(h j vl ) T + Cov(Re(svl,j ),Im(svl,j ))· (Real(h j vl )Imag(h j vl ) T + Imag(h j vl )Real(h j vl ) T ) + X Vi ′∈V/{Vi} X vl,vl ′∈Vi ′ ,l6=l ′ Cov(Re(svl,j ), Re(svl ′ ,j ))· (Real(h j vl )Real(h j vl ′ ) T + Real(h j vl ′ )Real(h j vl ) T )+ Cov(Im(svl,j ),Im(svl ′ ,j ))· (Imag(h j vl )Imag(h j vl ′ ) T + Imag(h j vl ′ )Imag(h j vl ) T )+ Cov(Re(svl,j ),Im(svl ′ ,j ))· (Real(h j vl )Imag(h j vl ′ ) T + Imag(h j vl ′ )Real(h j vl ) T )+ Cov(Im(svl,j ), Re(svl ′ ,j ))· (Imag(h j vl )Real(h j vl ′ ) T + Real(h j vl ′ )Imag(h j vl ) T ) + diag(· · · , σ2 , · · ·), (22) where Var(·) denotes the variance of random variable, Cov(·) denotes the covariance of random variables, both of which are obtained based on PMF PSi ′,j→Yj (si ′ ,j ). diag(·) denotes diagonal matrix. And operation Imag(·) : C |U| → R 2|U| is defined as Imag([α0, · · · , α|U|−1] T ) , [−Im(α0), Re(α0), · · · , −Im(α|U|−1), Re(α|U|−1)]T .

REFERENCES [24]E.Candes and J.Romberg,"11-magic:Recovery of sparse signals via convex programming,"2005. [1]H.B.McMahan,E.Moore,D.Ramage.S.Hampson et al, [25]Y.LeCun,L.Bottou,Y.Bengio,P.Haffner et al.,"Gradient-based "Federated learning:Collaborative machine leaming without learning applied to document recognition,"Proceedings of the IEEE. centralized training data,"https://ai.googleblog.com/2017/04/ vol.86.no.11.pp.2278-2324.1998. federated-learning-collaborative.html.2017. [26]A.M.Sayeed,"Deconstructing multiantenna fading channels,"IEEE [2]J.Konecny,H.B.McMahan,D.Ramage,and P.Richtarik,"Federated Trans.Signal Processing.vol.50.no.10.pp.2563-2579,2002. optimization:Distributed machine leaming for on-device intelligence," CoRR,vol.abs/1610.02527.2016. [3]B.McMahan,E.Moore,D.Ramage,S.Hampson,and B.A.y Arcas "Communication-efficient learning of deep networks from decentralized data,"in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics,AlSTATS 2017,pp.1273-1282. [4]R.Shokri and V.Shmatikov,"Privacy-preserving deep learning."in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security,2015,pp.1310-1321. [5]F.Seide,H.Fu,J.Droppo,G.Li,and D.Yu,"1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns."in INTERSPEECH 2014./5th Annual Conference of the International Speech Communication Association,pp.1058-1062. [6]N.Strom,"Scalable distributed DNN training using commodity GPU cloud computing,"in INTERSPEECH 2015,16th Annual Conference of the International Speech Communication Association,pp.1488-1492. [7]N.Dryden.T.Moon.S.A.Jacobs,and B.V.Essen,"Communication quantization for data-parallel training of deep neural networks,"in 2nd Workshop on Machine Learning in HPC Environments,MLHPC 2016 Pp.1-8. [8]J.Konecny,H.B.McMahan,F.X.Yu,P.Richtarik,A.T.Suresh,and D.Bacon,"Federated learning:Strategies for improving communication efficiency,"CoRR,vol.abs/1610.05492,2016. [9]D.Alistarh,D.Grubic,J.Li,R.Tomioka,and M.Vojnovic,"QSGD: communication-efficient SGD via gradient quantization and encoding," in Advances in Neural Information Processing Systems 30:Annual Conference on Neural Information Processing Systems,NIPS 2017.pp 1709-1720. [10]M.M.Amiri and D.Guindiiz,"Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,"in IEEE Interna- tional Symposium on Information Theory.ISIT 2019,pp.1432-1436. [11]G.Zhu,D.Liu,Y.Du,C.You,J.Zhang,and K.Huang,"Towards an intelligent edge:Wireless communication meets machine leaming, CoRR,vol.abs/1809.00343.2018. [12]G.Zhu,Y.Wang,and K.Huang,"Low-latency broadband analog aggregation for federated edge learning."CoRR,vol.abs/1812.11494. 208. [13]K.Yang.T.Jiang.Y.Shi,and Z.Ding."Federated learning via over- the-air computation,"CoRR,vol.abs/1812.11750,2018. [14]E.M.Khorov,A.Kiryanov,A.I.Lyakhov,and G.Bianchi,"A tutorial on IEEE 802.1lax high efficiency wlans,"IEEE Communications Surveys and Tutorials,vol.21,no.1,pp.197-216,2019. [15]P.Guide,"IntelR 64 and ia-32 architectures software developers man- ual,"Volume 3B:System programming guide,vol.2,2011. [16]R.G.Gallager,"Low-density parity-check codes"IRE Trans.Informa- tion Theory,vol.8.no.1.pp.21-28.1962. [17]W.W.Peterson and D.T.Brown,"Cyclic codes for error detection," Proceedings of the IRE.vol.49,no.1,pp.228-235,1961. [18]F.R.Kschischang,B.J.Frey,and H.Loeliger,"Factor graphs and the sum-product algorithm,"IEEE Trans.Information Theory,vol.47,no.2 pp.498-519.2001. [19]B.Nazer and M.Gastpar,"Computation over multiple-access channels," IEEE Trans.Information Theory,vol.53,no.10,pp.3498-3516,2007 [20]S.Wu,L.Kuang.Z.Ni,J.Lu,D.Huang.and Q.Guo,"Low-complexity iterative detection for large-scale multiuser MIMO-OFDM systems using approximate message passing,"J.Sel.Topics Signal Processing.vol.8. no.5,pp.902-915,2014. [21]L.Liu,C.Yuen,Y.L.Guan,Y.Li,and Y.Su,"Convergence analysis and assurance for gaussian message passing iterative detector in massive MU-MIMO systems,"IEEE Trans.Wireless Communications,vol.15 no.9,pp.6487-6501,2016. [22]J.Boutros and G.Caire,"Iterative multiuser joint decoding:Unified framework and asymptotic analysis,"IEEE Trans.Information Theory, vol.48,no.7,Pp.1772-1773,2002 [23]D.Tse and P.Viswanath,Fundamentals of Wireless Communication. Cambridge University Press,2005

REFERENCES [1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., “Federated learning: Collaborative machine learning without centralized training data,” https://ai.googleblog.com/2017/04/ federated-learning-collaborative.html, 2017. [2] J. Konecny, H. B. McMahan, D. Ramage, and P. Richt ´ arik, “Federated ´ optimization: Distributed machine learning for on-device intelligence,” CoRR, vol. abs/1610.02527, 2016. [3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, pp. 1273–1282. [4] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 2015, pp. 1310–1321. [5] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, pp. 1058–1062. [6] N. Strom, “Scalable distributed DNN training using commodity GPU cloud computing,” in INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, pp. 1488–1492. [7] N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen, “Communication quantization for data-parallel training of deep neural networks,” in 2nd Workshop on Machine Learning in HPC Environments, MLHPC 2016, pp. 1–8. [8] J. Konecny, H. B. McMahan, F. X. Yu, P. Richt ´ arik, A. T. Suresh, and ´ D. Bacon, “Federated learning: Strategies for improving communication efficiency,” CoRR, vol. abs/1610.05492, 2016. [9] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, NIPS 2017, pp. 1709–1720. [10] M. M. Amiri and D. Gund ¨ uz, “Machine learning at the wireless edge: ¨ Distributed stochastic gradient descent over-the-air,” in IEEE International Symposium on Information Theory, ISIT 2019, pp. 1432–1436. [11] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards an intelligent edge: Wireless communication meets machine learning,” CoRR, vol. abs/1809.00343, 2018. [12] G. Zhu, Y. Wang, and K. Huang, “Low-latency broadband analog aggregation for federated edge learning,” CoRR, vol. abs/1812.11494, 2018. [13] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via overthe-air computation,” CoRR, vol. abs/1812.11750, 2018. [14] E. M. Khorov, A. Kiryanov, A. I. Lyakhov, and G. Bianchi, “A tutorial on IEEE 802.11ax high efficiency wlans,” IEEE Communications Surveys and Tutorials, vol. 21, no. 1, pp. 197–216, 2019. [15] P. Guide, “Intel R 64 and ia-32 architectures software developers manual,” Volume 3B: System programming guide, vol. 2, 2011. [16] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Information Theory, vol. 8, no. 1, pp. 21–28, 1962. [17] W. W. Peterson and D. T. Brown, “Cyclic codes for error detection,” Proceedings of the IRE, vol. 49, no. 1, pp. 228–235, 1961. [18] F. R. Kschischang, B. J. Frey, and H. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. Information Theory, vol. 47, no. 2, pp. 498–519, 2001. [19] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Information Theory, vol. 53, no. 10, pp. 3498–3516, 2007. [20] S. Wu, L. Kuang, Z. Ni, J. Lu, D. Huang, and Q. Guo, “Low-complexity iterative detection for large-scale multiuser MIMO-OFDM systems using approximate message passing,” J. Sel. Topics Signal Processing, vol. 8, no. 5, pp. 902–915, 2014. [21] L. Liu, C. Yuen, Y. L. Guan, Y. Li, and Y. Su, “Convergence analysis and assurance for gaussian message passing iterative detector in massive MU-MIMO systems,” IEEE Trans. Wireless Communications, vol. 15, no. 9, pp. 6487–6501, 2016. [22] J. Boutros and G. Caire, “Iterative multiuser joint decoding: Unified framework and asymptotic analysis,” IEEE Trans. Information Theory, vol. 48, no. 7, pp. 1772–1773, 2002. [23] D. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge University Press, 2005. [24] E. Candes and J. Romberg, “l1-magic: Recovery of sparse signals via convex programming,” 2005. [25] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [26] A. M. Sayeed, “Deconstructing multiantenna fading channels,” IEEE Trans. Signal Processing, vol. 50, no. 10, pp. 2563–2579, 2002

点击进入文档下载页（PDF格式）

已到末页，全文结束

点击下载（PDF格式）

浏览记录