are of the form (s,t,)or (s,t,o,U),where s is the sender, In Section 3.2,we show that Patricia trees 17,using a t is the receiver,o is a message type,and U (if present)is a parallelized version of the Patricia tree merge procedure of vector of O(1)process identifiers.The state of each process Okasaki and Gill 18,are a good choice for the balanced tree consists of a state variable g together with a set of successors data structure.Using Patricia trees,we obtain a sorted final S.Upon receiving a set M of one or more messages,a data structure in time O(W log n)(or O((d+W)log n)in the process s adds all process identifiers that appear in M to lower-bandwidth synchronous model).with O(d)contention S,and then executes a probabilistic transition function 6 and O((d+W)nlog n)messages. mapping (g,S,M)to (g',m),where g'is a new state and Finally,we briefly discuss constructing a ring (Section 3.4) m is either L (indicating no message sent)or a message and the effects of node departures and arrivals (Section 3.5). (s,t,o,u)where t is in S and u is in SU{).When and how this message is delivered depends on whether we are 3.1 Pairing in a synchronous or asynchronous model:we discuss both The pairing problem has some similarities to the problem variants below. of finding a matching,but because we are not restricted in In the synchronous model,the computation proceeds in which nodes we pair off-except for the limits imposed by rounds,and all messages sent to a process s in round i are communication along edges of the knowledge graph-our al- delivered simultaneously in round i+1.In other words gorithm can perform an initial pruning step that pairs off we assume the standard synchronous message-passing model many of the nodes deterministically.leaving only a degree- with the added restrictions that processes can only commu- 2 surviving subgraph.We then run a simple randomized nicate with "known"processes and can only send one mes- sage per round.This yields a model essentially identical to matching algorithm on this subgraph using a coin-flipping technique similar to that of Law and Siu16 to resolve con- the one used in the resource discovery literature,except that flicts. we have added a limitation on the number of identifiers that From a very high-level perspective,the algorithm proceeds can be sent in a single message.We are also interested in as follows.Start with an directed graph G with maximum minimizing contention.which we take to be the maximum degree d.Each node starts by sending a probe to all of its number of messages received by any single process in any successors.The recipient of such a probe responds by ac- single round of the computation cepting the first one and rejecting subsequent probes;in In the asynchronous model,messages arrive one at a time this way every node has at most one designated predecessor after a delay that may vary from message to message and producing a graph of designated predecessors Gi in which that is controlled by an adversary scheduler.It is assumed, every node has in-degree at most 1.This graph is further however,that no message is delayed by more than one time pruned by having each node with two or more successors unit and that messages between any two nodes are delivered pair them off,leaving a graph G2 in which every node has in FIFO order.Processing time is treated as zero. both in-degree and out-degree at most 1.Each component Defining contention for an asynchronous model can be in such a graph is either a line or a cycle,and a constant tricky,as the adversary could choose to deliver many mes- fraction of the nodes can be matched along edges by simply sages in a short period of time;we adopt a simple measure having each node that is not at the end of a line flip a coin of contention equal to the maximum number of distinct mes- to decide whether to pair with its remaining predecessor or sages with the same recipient that are in transit at any point successor,and pairing those adjacent nodes whose choices in time.-Assuming that each process sends at most one are consistent.A simple calculation shows that on average message to each neighbor in the knowledge graph before re- half of the nodes in G2(and all nodes in G-G2)are paired ceiving a reply,the contention is trivially bounded by the by this procedure,from which we can deduce that about half degree of the knowledge graph in both the synchronous and of the nodes are paired on average in the worst case where asynchronous models. G-G2 is empty. The algorithm sketched above can be implemented di- 3.ALGORITHMS rectly in a synchronous system where all nodes start simul- taneously,because after the initial probing phase there is This section contains our main results,a family of algo- no confusion about the structure of the graph,and after rithms for quickly constructing tree-structured overlay net- a phase consisting of a known number of rounds any un- works starting with a weakly-connected communication graph. matched nodes can simply restart the protocol along with We begin by describing (in Section 3.1)a randomized dis- any supernodes resulting from merges in the previous phase. tributed algorithm for pairing nodes:this produces a match- But in an asynchronous setting the situation is more com- ing on the set of nodes that includes a constant fraction of plicated.While some of the early pruning steps can still be the nodes on average,in time O(d),with O(d)contention used (in particular,we still have each node accept and re- and O(n)messages,each of size at most O(W),where W is spond to a single designated predecessor),the final matching the maximum identifier size.Paired nodes are then joined stages require more care. together into simulated combined nodes that are internally There are two main problems.The first is that no node organized as balanced trees (see Section 3.3).The partici- can detect when a phase of the pairing protocol has fin- pants in each combined node are carefully deployed so that ished,so that an unmatched node cannot detect the end of the pairing and joining algorithms in later rounds still pro- a pairing phase and restart the protocol.Instead,the best duce only O(d)contention;however,communication within an unmatched node can hope for is that the faithless suitor each subtree adds an factor to the cost of communication in who spurned it initially will return to accept its advances af- the pairing algorithm that depends on the depth of the tree. ter it finishes digesting luckier candidates.But this creates 2 An alternative assumption is that each process is only guar- the possibility of creating very long chains of nodes,each anteed to accept at least one message per time unit,with waiting for the next to finish a merge that is itself delayed other messages waiting in a delivery queue.This yields sim- by waiting for nodes further down the chain. ilar time bounds.except that the running time must be mul- This problem is compounded by the fact that a node that tiplied by the contention to account for queuing delays. has not yet received a probe cannot tell whether it has noare of the form (s, t, σ) or (s, t, σ, U), where s is the sender, t is the receiver, σ is a message type, and U (if present) is a vector of O(1) process identifiers. The state of each process consists of a state variable q together with a set of successors S. Upon receiving a set M of one or more messages, a process s adds all process identifiers that appear in M to S, and then executes a probabilistic transition function δ mapping (q, S, M) to (q ′ , m), where q ′ is a new state and m is either ⊥ (indicating no message sent) or a message (s, t, σ, u) where t is in S and u is in S ∪ {⊥}. When and how this message is delivered depends on whether we are in a synchronous or asynchronous model; we discuss both variants below. In the synchronous model, the computation proceeds in rounds, and all messages sent to a process s in round i are delivered simultaneously in round i + 1. In other words, we assume the standard synchronous message-passing model with the added restrictions that processes can only communicate with “known” processes and can only send one message per round. This yields a model essentially identical to the one used in the resource discovery literature, except that we have added a limitation on the number of identifiers that can be sent in a single message. We are also interested in minimizing contention, which we take to be the maximum number of messages received by any single process in any single round of the computation. In the asynchronous model, messages arrive one at a time after a delay that may vary from message to message and that is controlled by an adversary scheduler. It is assumed, however, that no message is delayed by more than one time unit and that messages between any two nodes are delivered in FIFO order. Processing time is treated as zero. Defining contention for an asynchronous model can be tricky, as the adversary could choose to deliver many messages in a short period of time; we adopt a simple measure of contention equal to the maximum number of distinct messages with the same recipient that are in transit at any point in time.2 Assuming that each process sends at most one message to each neighbor in the knowledge graph before receiving a reply, the contention is trivially bounded by the degree of the knowledge graph in both the synchronous and asynchronous models. 3. ALGORITHMS This section contains our main results, a family of algorithms for quickly constructing tree-structured overlay networks starting with a weakly-connected communication graph. We begin by describing (in Section 3.1) a randomized distributed algorithm for pairing nodes; this produces a matching on the set of nodes that includes a constant fraction of the nodes on average, in time O(d), with O(d) contention and O(n) messages, each of size at most O(W), where W is the maximum identifier size. Paired nodes are then joined together into simulated combined nodes that are internally organized as balanced trees (see Section 3.3). The participants in each combined node are carefully deployed so that the pairing and joining algorithms in later rounds still produce only O(d) contention; however, communication within each subtree adds an factor to the cost of communication in the pairing algorithm that depends on the depth of the tree. 2An alternative assumption is that each process is only guaranteed to accept at least one message per time unit, with other messages waiting in a delivery queue. This yields similar time bounds, except that the running time must be multiplied by the contention to account for queuing delays. In Section 3.2, we show that Patricia trees [17], using a parallelized version of the Patricia tree merge procedure of Okasaki and Gill [18], are a good choice for the balanced tree data structure. Using Patricia trees, we obtain a sorted final data structure in time O(W log n) (or O((d+W) log n) in the lower-bandwidth synchronous model), with O(d) contention and O((d + W)n log n) messages. Finally, we briefly discuss constructing a ring (Section 3.4) and the effects of node departures and arrivals (Section 3.5). 3.1 Pairing The pairing problem has some similarities to the problem of finding a matching, but because we are not restricted in which nodes we pair off—except for the limits imposed by communication along edges of the knowledge graph—our algorithm can perform an initial pruning step that pairs off many of the nodes deterministically, leaving only a degree- 2 surviving subgraph. We then run a simple randomized matching algorithm on this subgraph using a coin-flipping technique similar to that of Law and Siu [16] to resolve con- flicts. From a very high-level perspective, the algorithm proceeds as follows. Start with an directed graph G with maximum degree d. Each node starts by sending a probe to all of its successors. The recipient of such a probe responds by accepting the first one and rejecting subsequent probes; in this way every node has at most one designated predecessor, producing a graph of designated predecessors G1 in which every node has in-degree at most 1. This graph is further pruned by having each node with two or more successors pair them off, leaving a graph G2 in which every node has both in-degree and out-degree at most 1. Each component in such a graph is either a line or a cycle, and a constant fraction of the nodes can be matched along edges by simply having each node that is not at the end of a line flip a coin to decide whether to pair with its remaining predecessor or successor, and pairing those adjacent nodes whose choices are consistent. A simple calculation shows that on average half of the nodes in G2 (and all nodes in G − G2) are paired by this procedure, from which we can deduce that about half of the nodes are paired on average in the worst case where G − G2 is empty. The algorithm sketched above can be implemented directly in a synchronous system where all nodes start simultaneously, because after the initial probing phase there is no confusion about the structure of the graph, and after a phase consisting of a known number of rounds any unmatched nodes can simply restart the protocol along with any supernodes resulting from merges in the previous phase. But in an asynchronous setting the situation is more complicated. While some of the early pruning steps can still be used (in particular, we still have each node accept and respond to a single designated predecessor), the final matching stages require more care. There are two main problems. The first is that no node can detect when a phase of the pairing protocol has finished, so that an unmatched node cannot detect the end of a pairing phase and restart the protocol. Instead, the best an unmatched node can hope for is that the faithless suitor who spurned it initially will return to accept its advances after it finishes digesting luckier candidates. But this creates the possibility of creating very long chains of nodes, each waiting for the next to finish a merge that is itself delayed by waiting for nodes further down the chain. This problem is compounded by the fact that a node that has not yet received a probe cannot tell whether it has no