with T(0,,)=T(,0,)=T(,,0)=0 stantial overhead,a natural approach would be to serialize We will now show by induction on w+m+n that T(w,m,n)< the processing of incoming messages at the leaves using a w min(m,n).Clearly this holds for the base cases.Now parallel-prefix computation 15,taking advantage that the consider T(w,n,m)where w,n,and m are all nonzero,and effects of processing the various incoming messages on the suppose the bound holds for w'+n'+m'<w+n+m.Then states can be summarized as simple state updates plus pos- sible assignment to the variables waiting and obj.and such T(,n,m)≤1+T(w-1,no,mo)+T(w-1,n1,m1) operations are composable.However,this approach could 1+(w-1)min(no,mo)+(w-1)min(n1,m1) in the worst case require a parallel-prefix operation on the ≤1+(w-1)min(n,m) entire tree to collect and process a single incoming mes- sage,giving an O(n)worst-case blow-up in message traffic. ≤wmin(n,m) Instead,we will consider the structure of the pairing algo- 0 rithm carefully,and show that many incoming messages can be sent directly to the root of the tree without significantly 3.2.3 Merging Many Patricia trees increasing contention,while others can be processed using a To merge many Patricia trees,observe first that in the convergecast operations. pairwise merge procedure above a new root is determined The key observation is that at any time a supernode has immediately-it is not necessary to wait for the recursive at most one designated predecessor and at most one desig- merges of subtrees to complete.It is thus possible to start nated successor node,and that only these nodes can send a new merge between the combined tree (as represented by propose,pair,and no-pair messages.So these messages (and its new root)and another Patricia tree (which may also be their responses)can be sent directly between roots,and the the result of a recently-initiated merge)without waiting for roots can update the state of the simulated supernode lo- the subtree merges to complete.If we think of the first cally.(To enable this,we assume that all messages are ex- merge operation as a wave propagating down through the tended by including the identity of the root node of the send- trees,the second merge operation propagates as a wave just ing component,and all message-ids within.For reasons of a few steps behind it.Subsequent merge operations can space,we do not discuss processing of these messages fur- be similarly pipelined,so long as we have enough processes ther.However,probe messages and their responses occur in to handle the operations at the individual tree nodes.The much greater abundance,and thus require special handling. result is that a tree of merges of maximum depth k can be completed in O(k +W)time. 3.3.1 Consolidating Probes and Probe Responses We do not use pipelined merges,as limitations of the pair- Probe messages appear in the algorithm in two places:in ing algorithm and supernode simulation require merges to the main thread,the supernode sends probes to all neighbors be carried out sequentially.However,we can imagine an and waits for responses,and in the daemon thread handling improved pairing algorithm that chooses new pairings while received probes,the supernode must accept only the first merges are still in progress.The ability to pipeline merges probe.The main thread must also collect up to d responses may also have other applications,such as building a Patricia per leaf node and pair off those that accept. tree from a pre-existing but unsorted balanced tree of nodes The task of sending probes and collecting responses is obtained by some other means. handled by a modified convergecast procedure,initiated by the root.Pseudocode for each node's role in this proce- 3.3 Simulating Supernodes dure is given below.A wrinkle that does not appear in the Intuitively,the idea behind the tree-construction algo- simple pairing algorithm is the same component response; rithm is to use the merging procedure to join each pair of this allows a node to detect that its neighbor is in the same nodes selected by the pairing algorithm into a single compo- component and should not be troubled further. nent that acts like a single"supernode"in the next round of Upon receiving message (parent,u,initiate_probe)from pairing.If contention were not an issue,we could simulate node parent do: such a supernode easily by choosing a single representative of each component,and have it handle all the edges that If u is a leaf node: previously went into some member of the component. The problem with this approach is that we will quickly For each v ∈neighbors. send a message raise the degree of the representative nodes:toward the end (u,v,probe,root)to v and wait for all replies. of the algorithm we this accumulation of edges could produce For each node v that replies with both linear contention and a linear slowdown in the pairing (v,u,same-component),remove v from neighbors. procedure.To avoid these problems,we organize the mem- bers of a component in a binary tree,and leave the task of Let v,...vk be the nodes that reply with 'accept.'For communicating with other components to the leaves,with each odd i less than k,send a message (u,vi,pair,vi+1) the root acting as a global coordinator.In this construc- to vi and (u,vi+1,pair,vi)to vi+1.If k is odd,send tion,the neighbors set is distributed across the leaf nodes (u,parent,respond_probe,vk)to parent. while the other components of the supernode's state reside at the root.The resulting protocol is similar in many ways Else u is an internal node: to the classic distributed minimum spanning tree protocol For each child node c,send (u,c,initiate_probe)and of Gallager et al.8,although the internal communication wait for all replies costs of components are reduced by our ability to construct a balanced tree by adding new edges to the communication Let V be the union of all sets of nodes appear- graph. ing in the replies.If V v1,v2,send messages In carrying out this strategy,we have to be very careful (u,v1,pair,v2)and (u,v2,pair,v1)to vi and v2,and to ensure that the atomic operations of the daemon threads send (u,parent,respond-probe,to parent.If instead continue to appear atomic.If we were willing to accept sub- IV<1,send (u,parent,respond-probe,V)to parent.with T(0, ·, ·) = T(·, 0, ·) = T(·, ·, 0) = 0. We will now show by induction on w+m+n that T(w, m, n) ≤ w min(m, n). Clearly this holds for the base cases. Now consider T(w, n, m) where w, n, and m are all nonzero, and suppose the bound holds for w ′+n ′+m′ < w+n+m. Then T(w, n, m) ≤ 1 + T(w − 1, n0, m0) + T(w − 1, n1, m1) ≤ 1 + (w − 1) min(n0, m0) + (w − 1) min(n1, m1) ≤ 1 + (w − 1) min(n, m) ≤ w min(n, m). 3.2.3 Merging Many Patricia trees To merge many Patricia trees, observe first that in the pairwise merge procedure above a new root is determined immediately—it is not necessary to wait for the recursive merges of subtrees to complete. It is thus possible to start a new merge between the combined tree (as represented by its new root) and another Patricia tree (which may also be the result of a recently-initiated merge) without waiting for the subtree merges to complete. If we think of the first merge operation as a wave propagating down through the trees, the second merge operation propagates as a wave just a few steps behind it. Subsequent merge operations can be similarly pipelined, so long as we have enough processes to handle the operations at the individual tree nodes. The result is that a tree of merges of maximum depth k can be completed in O(k + W) time. We do not use pipelined merges, as limitations of the pairing algorithm and supernode simulation require merges to be carried out sequentially. However, we can imagine an improved pairing algorithm that chooses new pairings while merges are still in progress. The ability to pipeline merges may also have other applications, such as building a Patricia tree from a pre-existing but unsorted balanced tree of nodes obtained by some other means. 3.3 Simulating Supernodes Intuitively, the idea behind the tree-construction algorithm is to use the merging procedure to join each pair of nodes selected by the pairing algorithm into a single component that acts like a single “supernode” in the next round of pairing. If contention were not an issue, we could simulate such a supernode easily by choosing a single representative of each component, and have it handle all the edges that previously went into some member of the component. The problem with this approach is that we will quickly raise the degree of the representative nodes; toward the end of the algorithm we this accumulation of edges could produce both linear contention and a linear slowdown in the pairing procedure. To avoid these problems, we organize the members of a component in a binary tree, and leave the task of communicating with other components to the leaves, with the root acting as a global coordinator. In this construction, the neighbors set is distributed across the leaf nodes, while the other components of the supernode’s state reside at the root. The resulting protocol is similar in many ways to the classic distributed minimum spanning tree protocol of Gallager et al. [8], although the internal communication costs of components are reduced by our ability to construct a balanced tree by adding new edges to the communication graph. In carrying out this strategy, we have to be very careful to ensure that the atomic operations of the daemon threads continue to appear atomic. If we were willing to accept substantial overhead, a natural approach would be to serialize the processing of incoming messages at the leaves using a parallel-prefix computation [15], taking advantage that the effects of processing the various incoming messages on the states can be summarized as simple state updates plus possible assignment to the variables waiting and obj, and such operations are composable. However, this approach could in the worst case require a parallel-prefix operation on the entire tree to collect and process a single incoming message, giving an O(n) worst-case blow-up in message traffic. Instead, we will consider the structure of the pairing algorithm carefully, and show that many incoming messages can be sent directly to the root of the tree without significantly increasing contention, while others can be processed using a convergecast operations. The key observation is that at any time a supernode has at most one designated predecessor and at most one designated successor node, and that only these nodes can send propose, pair, and no pair messages. So these messages (and their responses) can be sent directly between roots, and the roots can update the state of the simulated supernode locally. (To enable this, we assume that all messages are extended by including the identity of the root node of the sending component, and all message-ids within.) For reasons of space, we do not discuss processing of these messages further. However, probe messages and their responses occur in much greater abundance, and thus require special handling. 3.3.1 Consolidating Probes and Probe Responses Probe messages appear in the algorithm in two places: in the main thread, the supernode sends probes to all neighbors and waits for responses, and in the daemon thread handling received probes, the supernode must accept only the first probe. The main thread must also collect up to d responses per leaf node and pair off those that accept. The task of sending probes and collecting responses is handled by a modified convergecast procedure, initiated by the root. Pseudocode for each node’s role in this procedure is given below. A wrinkle that does not appear in the simple pairing algorithm is the same component response; this allows a node to detect that its neighbor is in the same component and should not be troubled further. Upon receiving message (parent, u, initiate probe) from node parent do: If u is a leaf node: For each v ∈neighbors, send a message (u, v, probe, root) to v and wait for all replies. For each node v that replies with (v, u, same component), remove v from neighbors. Let v1, . . . vk be the nodes that reply with ‘accept.’ For each odd i less than k, send a message (u, vi, pair, vi+1) to vi and (u, vi+1, pair, vi) to vi+1. If k is odd, send (u, parent, respond probe, vk) to parent. Else u is an internal node: For each child node c, send (u, c, initiate probe) and wait for all replies. Let V be the union of all sets of nodes appearing in the replies. If V = v1, v2, send messages (u, v1, pair, v2) and (u, v2, pair, v1) to v1 and v2, and send (u, parent, respond probe,) to parent. If instead |V | < 1, send (u, parent, respond probe, V ) to parent