Disjoint Set Union
This data structure provides the following capabilities. We are given several elements, each of which is a separate set. A DSU will have an operation to combine any two sets, and it will be able to tell in which set a specific element is. The classical version also introduces a third operation, it can create a set from a new element.
Thus the basic interface of this data structure consists of only three operations: - make_set() - create a new set consisting of the new element v. - union_sets(a, b) - merges the two specified sets (the set in which the element a is located, and the set in which the element b is located). - find_set(v) - return the representative (also called leader) of the set that contains the element v. This representative is an element of its corresponding set. It is selected in each set by the data structure itself (and can change over time, namely after union_sets call). This representative can be used to check if two elements are part of the same set or not. a and b are exactly in the same set, if find_set(a) == find_set(b). Otherwise they are in different sets.
As described on more detail later, the data structure allows us to do each of these operations in almost \(O(1)\) time on average.
Also in one of the subsections an alternative structure of a DSU is explained, which achieves a slower average complexity of \(O(\log n)\), but can be more powerful than the regular DSU structure.
Build an efficient data structure
We will store the sets in the form of trees: each tree will correspond to one set. And the root of the tree will be the representative/leader of the set.
In the following image we can see the representative of such trees.
![[DSU_example.png]] In the beginning, every element starts as a single set, there fore each vertex is its own tree. Then we combine the set containing the element 1 and the et containing the element 2. Then we combine the set containing the element 3 and the set containing the element 4. And in the last step, we combine the set containing the element 1 and the set containing the element 3.
For the implementation this means that we will have to maintain an array parent that stores a reference to its immediate ancestor in the tree.
Naive implementation
We can already write the first implementation of the Disjoint Set Union data structure. It will be pretty inefficient at first, but later we can improve it using two optimizations, so that it will take nearly constant time for each function call.
As we said, all the information about the sets of elements will be kept in an array parent.
To create a new set (operation make_set(v)), we simply create a tree with root in the vertex x, meaning that it is its own ancestor.
To combine two sets (operation union_sets(a, b)), we first find the representative of the set in which a is located, and the representative of the set in which b is located. If the representatives are identical, that we have nothing to do, the sets are already merged. Otherwise, we can simply specify that one of the representatives is the parent of the other representative - thereby combining the two trees.
Finally the implementation of the find representative function (operation find_set(v)): we simply climb the ancestors of the vertex v until we reach the root, i.e. a vertex such that the reference to the ancestor leads to itself. This operation is easily implemented recursively.
void make_set(int v) {
parent[v] = v;
}
int find_set(int v) {
if (v == parent[v])
return v;
return find_set(parent[v]);
}
void union_sets(int a, int b) {
a = find_set(a);
b = find_set(b);
if (a != b)
parent[b] = a;
}However this implementation is inefficient. It is easy to construct an example, so that the trees degenerate into long chains. In that case each call find_set(v) can take \(O(n)\) time.
This is far away from the complexity that we want to have (nearly constant time). Therefore we will consider two optimizations that will allow to significantly accelerate the work.
Path compression optimization
This optimization is designed for speeding up find_set.
If we call find_set() for some vertex x, we actually find the representative p for all vertices that we visit on the path between v and the actually representative p. The trick is to make the paths for all those nodes shorter, by setting the parent of each visited vertex directly to p.
We can see the operation in the following image. On the left there is a tree, and on the right side there is the compressed tree after calling find_set(7), which shortens the paths for the visited node 7, 5, 3 and 2.
![[DSU_path_compression.png]]
The new implementation of find_set is as follows:
int find_set(int v) {
if (v == parent[v])
return v;
return parent[v] = find_set(parent[v]);
}The simple implementation does what was intended: first find the representative of the set (root vertex), and then in the process of stack unwinding the visited nodes are attached directly to the representative.
This simple modification of the operation already achieves the time complexity \(O(\log n)\) per call on average (here without proof). There is a second modification, that will make it even faster.
Union by size / rank
In this optimization we will change the union_set operation. To be precise, we will change which tree gets attached to the other one. In the naive implementation the second tree always got attached to the first one. In practice that can lead to trees containing chains of length \(O(n)\). With this optimization we will avoid this by choosing very carefully which tree gets attached.
There are many possible heuristic that can be used. Most popular are the following two approaches: In the first approach we use the size of the trees as rank, and in the second one we use the depth of the tree (more precisely, the upper bound on the tree depth, because the depth will get smaller when applying path compression).
In both approaches the essence of the optimization is the same: we attach the tree with the lower rank to the one with the bigger rank.
Here is the implementation of union by size:
void make_set(int v) {
parent[v] = v;
size[v] = 1;
}
void union_sets(int a, int b){
a = find_set(a);
b = find_set(b);
if (a != b) {
if (size[a] < size[b])
swap(a, b);
parent[b] = a;
size[a] += size[b];
}
}and here is the implementation of union by rank based on the depth of the trees:
void make_set(int v) {
parent[v] = v;
rank[v] = 0;
}
void union_sets(int a, int b) {
a = find_set(a);
b = find_set(b);
if (a != b) {
if (rank[a] < rank[b])
swap(a, b);
parent[b] = a;
if (rank[a] == rank[b])
rank[a]++;
}
}Both optimizations are equivalent in terms of time and space complexity. So in practice we can use any of them.
Time complexity
As mentioned before, if we combine both optimizations - path compression with union by size / rank - we will reach nearly constant time queries. It turns out, that the final amortized complexity is \(O(\alpha(n))\), where \(\alpha (n)\) is the inverse Ackermann function, which grows very slowly. In fact it grows so lowly, that it doesn’t exceed 4 for all reasonable \(n\) (approximately \(n<10^{600}\)).
Amortized complexity is the total time per operation, evaluated over a sequence of multiple operations. The idea is to guarantee the total time of the entire sequence, while allowing single operations to be much slower than the amortized time. E.g. in our case a single call might take \(O(\log n)\) in the worst case, but if we do \(m\) such calls back to back we will end up with an average time of \(O(\alpha (n))\).
We will also not present a proof for this time complexity, since it is quite long and complicated.
Also, it’s worth mentioning that DSU with union by size / rank, but without path compression works in \(O(\log n)\) time per query.
Linking by index / coin-flip linking
Both union by rank and union by size require that we store additional data for each set, and maintain these values during each union operation. There exist also a randomized algorithm, that simplifies the union operation a little bit: linking by index.
We assign each set a random value called the index, and we attach the set with the smaller index to the one with the larger one. It is likely that a bigger set will have a bigger than the smaller set, therefore this operation is closely related to union by size. In fact it can be proven, that this operation has the same time complexity as union by size. However in practice it is slightly slower than union by size.