Introduction to Disjoint Set Data Structure | DSU

Revision en2, by kartik8800, 2023-09-18 23:31:55

Introduction to Disjoint Set Data Strucutres

Hello Codeforces!

I recently read about Disjoint Sets from the book Introduction to Algorithms and wanted to share the learnings in a simplified manner.

So below article(and corresponding videos) is an attempt to create a good starting point for people who:

  1. want to learn about DSU
  2. practice some good DSU problems

Contents

  1. Introduction through an illustrative problem (video version: https://youtu.be/JDycPHW4kIs?si=WEp5Ft2jBWp9KnO2)
  2. Sample SPOJ Problem and Implementation (video version: https://youtu.be/O4w-aX5mSks?si=vr0XSbyUswcXx-Yv)
  3. Upcoming Practice Problems
  4. Some Applications and References

We will learn about disjoint set data structures and their operations union and find through a sample problem.

Illustrative Problem

Q: Given a undirected $$$Graph(V, E)$$$, answer Q queries of the form $$$(u, v)$$$ where answer to the query is True if there is a path from u to v, false otherwise. image

So for above graph:

  1. $$$query(D,A)$$$ = true
  2. $$$query(D,C)$$$ = true
  3. $$$query(D,F)$$$ = false

Some solutions to above problem

Solution1:

  1. For each query $$$(u, v)$$$, perform a BFS/DFS.
  2. Time Complexity will be $$$O(Q*(V + E))$$$
  3. In worst case graph can have order of $$$V^2$$$ Edges.
  4. So worst case complexity can be $$$O(Q*V^2)$$$

Solution2:

  1. perform 1 BFS/DFS per node of the Graph
  2. When BFS is done using node X, store all the nodes that can be visited from X.
  3. Per Query time is $$$O(1)$$$ so overall $$$O(Q)$$$
  4. Preprocessing time is $$$O(V * (V + E)) = O(V^3)$$$
  5. Hence overall $$$O(Q + V^3)$$$

What is a Disjoint Set Data Structure?

  1. It is a collection of disjoint dynamic sets. $$$D = {S1, S2, S3, …………, Sk}$$$
  2. Each set has a Representative R and consists of some elements.
  3. Assume total elements is N: $$$Size(S1) + Size(S2) + … + Size(Sk) = N$$$

A disjoint set structure supports:

  1. $$$MAKE-SET(X):$$$ Creates a new Set with only element X and representative X.
  2. $$$FIND(X):$$$ Returns the representative of the set to which X belongs.
  3. $$$UNION(X, Y):$$$ Unites the sets containing the elements X and Y into a single new Set. The representative of this set is usually either the representative of Sx or representative of Sy

Diagramatic Example of a disjoint set with total 8 elements and 3 sets:

image

From above:

  1. $$$Find(H) = G$$$
  2. $$$Find(F) = F$$$
  3. $$$Find(B) =Find(E) = A$$$

[Assume that A, G and F are representative elements their sets]

Using Disjoint Set DS for solving the problem

  1. Run $$$MAKE-SET(u)$$$ for each of the V nodes in the graph.
  2. Run $$$UNION(u,v)$$$ for each edge $$$(u,v)$$$ in the graph.
  3. For each Query (u,v): a) If $$$FIND(u) == FIND(v)$$$ then answer = true b) Else answer = false

Running 1. and 2. on sample graph constructs the Disjoint set data structure shown in diagram.

Time complexity for DSU solution

Overall Complexity is sum of:

  1. $$$O(V * MAKE-SET)$$$
  2. $$$O(E * UNION) = O(V^2 * UNION)$$$
  3. $$$O(Q * FIND)$$$

Disjoint Set — Linked List Implementation

image

  1. Each set is represented as a link list.
  2. The set has HEAD pointer to representative element and also a TAIL pointer.
  3. Each element of the set has a back-pointer to the set.

Complexity Analysis for link list implementation

  1. Make Set is O(1) -> only need to create a new set with 1 element
  2. Find Set is O(1) -> thanks to back pointers
  3. Union is length of the longer set -> no-thanks to back pointers(all of 2nd set element back-pointers need to be updated to 1st set)

Note: For a total of N elements in the collection there will be at most N-1 union operations as post that all elements will be in the same set.

Worst Case cost of Union is when:

  1. All sets have size 1.
  2. 1st union we unite two sets of size 1 and get a set of size 2 -> cost is 1 back pointer change.
  3. 2nd time we unite a set of size 1 with a set of size 2 -> cost is 2 back pointer change.
  4. ith time we unite a set of size 1 with a set of size i -> cost is i back pointer change.
  5. Overall cost over n-1 union operations is $$$1 + 2 + 3 + .. + n-1 = O(N^2)$$$

Hence union is still $$$O(N)$$$ in the worst case.

Weighted Union Heuristic for link list Implementation

While performing $$$union(x,y)$$$:

  1. Always take smaller set and attach it the larger set.
  2. Need to maintain size of set for each set(which should be easy)

Complexity analysis: Union is now $$$O(logN)$$$, but why?

  1. The cost of a union operation is the cost of changing back pointers of the elements in the smaller set.
  2. Say we change the back pointer of an Element X belonging to $$$S_x$$$, the resulting set will have at least $$$2 * S_x$$$ elements.(since X belong to smaller set and hence it's backpointer was updated)
  3. If back pointer of X is changed K times there need to be $$$>= (2^K) * S_x$$$ elements
  4. K can be at most log(N) as we only have N elements.
  5. hence for a given element we can change the back-pointer at most logN times and overall cost $$$<= NlogN$$$

Revisiting the sample problem

Worst Case complexity of Graph problem has now Improved :)

  1. $$$O(V * MAKE-SET) = O(V)$$$
  2. $$$O(E * UNION) = O(V^2 * UNION) = O(V^2 * logV) = O(V^2 * logV)$$$
  3. $$$O(Q * FIND) = O(Q)$$$

So $$$O(V^2 * logV)$$$ instead of $$$O(V^3)$$$

Disjoint Set — Forest Implementation

image

  1. Each set is represented as a tree.
  2. Each element is a node of the tree and maintains a pointer to it's parent in the tree.
  3. The representative element is the parent of itself.

$$$Find(X) = X \;if\,parent[X] = X \;else\,Find(X) = Find(parent[X])$$$

Forest Implementation — Time Complexities

We may still end up getting a chain :(

Worst case complexities:

  1. UNION is $$$O(1) * O(FIND)$$$ in worst case(only need to change parent pointer of one representative to another, problem is finding the representative using FIND)
  2. MAKE SET is $$$O(1)$$$ in worst case(only need to create a set with 1 element which is it's own parent)
  3. FIND however is $$$O(N)$$$ in the worst case(we may end up getting a link list)

Time Complexities with Heuristics

Heuristic: Union by Rank

While performing union always take the Set(tree) with less height and attach it to the set with greater height.

  1. Overall height after N-1 union will be order of LogN
  2. Hence ensuring Find is no worse than LogN

Heuristic: Path Compression When performing find operation, change the parent pointer of each node to the actual representative of the node. image

The time complexity when applying both heuristics together is:

  1. Make Set is $$$O(1)$$$
  2. Find Set is $$$O(\alpha(n))$$$
  3. Union is amortised $$$O(\alpha(n))$$$

What is $$$\alpha(n)$$$?

  1. Where alpha is the inverse of Ackerman function $$$A_k(1)$$$
  2. $$$\alpha(n) <= 4$$$ for all $$$N <= 16^{512}$$$
  3. $$$16^{512}\; »\; 10^{80}$$$
  4. $$$10^80$$$ is the number of atoms in observable universe

Hence for all practical purposes $$$\alpha(n) = 4 = constant$$$.

Proof is harder and omitted from scope of this article, refer Introduction To Algorithms by Thomas H. Cormen

Revisiting the sample problem

  1. Make Set is $$$O(1)$$$
  2. Find Set is $$$O(1)$$$
  3. Union is $$$amortised O(\alpha(n))$$$

Worst Case complexity of Graph problem has now Improved :)

  1. $$$O(V * MAKE-SET) = O(V)$$$
  2. $$$O(E * UNION) = O(V^2 * UNION) = O(V^2 * logV) = O(V^2 * \alpha(V))$$$
  3. $$$O(Q * FIND) = O(Q)$$$

Hence time complexity is now $$$O(V^2 + Q)$$$ for all practical purposes.

SPOJ Problem — FRNDCIRC + Generic Implementation

Editorial

Upcoming Practice Problems

Currently I have planned the problem https://codeforces.me/problemset/problem/150/B and will be soon adding both a written and video editorial for the same.

Few other practice problems include: https://codeforces.me/blog/entry/55219?#comment-390897 (DSU tag). I will be using some of these to create more editorials.

If you have more suggestions please add in comments.

Applications and References

Some direct applications:

  1. Finding cycles in a graph
  2. Kruskals minimum spanning tree algorithm

Some references:

  1. https://www.youtube.com/@AlgosWithKartik
  2. Introduction to Algorithms Book
  3. CP algorithms: https://cp-algorithms.com/data_structures/disjoint_set_union.html

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en2 English kartik8800 2023-09-18 23:31:55 175 Tiny change: '1)$\n2. $\Alpha(n) <=' -> '1)$\n2. $\alpha(n) <='
en1 English kartik8800 2023-09-13 22:28:36 9218 Initial revision (published)