Next: Kd-Trees Up: Data Structures Previous: Graph Data Structures

## Set Data Structures

Input description: A universe of items and a collection of subsets , where .

Problem description: Represent each subset so as to efficiently (1) test whether , (2) find the union or intersection of and , and (3) insert or delete members of S.

Discussion: In mathematical terms, a set is an unordered collection of objects drawn from a fixed universal set. However, it is usually useful for implementation to represent each set in a single canonical order, typically sorted, so as to speed up or simplify various operations. Sorted order turns the problem of finding the union or intersection of two subsets into a linear-time operation - just sweep from left to right and see what you are missing.     It also makes possible element searching in sublinear time. Finally, printing the elements of a set in a canonical order paradoxically reminds us that order really doesn't matter.

We distinguish sets from two other kinds of objects: strings and dictionaries. If there is no fixed-size universal set, a collection of objects is best thought of as a dictionary, as discussed in Section .   If the order does matter in a subset, i.e. if is not the same as , then your structure is more profitably thought of as a string, so see Sections and .

When each subset has cardinality exactly two, they form edges in a graph whose vertices are the universal set. A system of subsets with no restrictions on the cardinality of its members is called a hypergraph. It often can be profitable to consider whether your problem has a graph-theoretical analogy, like connected components or shortest path in a hypergraph.

Your primary alternatives for representing arbitrary systems of subsets are:

• Bit vectors -   If your universal set U contains n items, an n-bit vector or array can represent any subset . Bit i will be 1 if , otherwise bit i is 0. Since only one bit is used per element, bit vectors can be very space efficient for surprisingly large values of |U|. Element insertion and deletion simply flips the appropriate bit. Intersection and union are done by ``and-ing'' or ``or-ing'' the bits together. The only real drawback of a bit vector is that for sparse subsets, it takes O(n) time to explicitly identify all members of S.
• Containers or dictionaries -     A subset can also be represented using a linked list, array, binary tree, or dictionary containing exactly the elements in the subset. No notion of a fixed universal set is needed for such a data structure. For sparse subsets, dictionaries can be more space and time efficient than bit vectors and easier to work with and program. For efficient union and intersection operations, it pays to keep the elements in each subset sorted, so a linear-time traversal through both subsets identifies all duplicates.

In many applications, the subsets are all pairwise disjoint, meaning that each element is in exactly one subset.   For example, consider maintaining the connected components of a graph or the party affiliations of politicians.     Each vertex/hack is in exactly one component/party. Such a system of subsets is called a set partition.   Algorithms for constructing partitions of a given set are provided in Section .

For data structures, the primary issue is maintaining a given set partition as things change over time, perhaps as edges are added or party members defect. The queries we are interested in include ``which set is a particular item in?'' and ``are two items in the same set?'' as we modify the set by (1) changing one item, (2) merging or unioning two sets, or (3) breaking a set apart.   Your primary options are:

• Dictionary with subset attribute -   If each item in a binary tree has associated a field recording the name of the subset it is in, set identification queries and single element modifications can be performed in the time it takes to search in the dictionary, typically . However, operations like performing the union of two subsets take time proportional to (at least) the sizes of the subsets, since each element must have its name changed. The need to perform such union operations quickly is the motivation for the ...
• Union-Find Data Structure - Suppose we represent a subset using a rooted tree, where each node points to its parent instead of its children.   Further, let the name of the subset be the name of the item that is the root. Finding out which subset we are in is simple, for we keep traversing up the parent pointers until we hit the root. Unioning two subsets is also easy. Just make the root of one of two trees point to the other, so now all elements have the same root and thus the same subset name.

Certain details remain, such as which subset should be the ultimate root of a union, but these are described in most every algorithms text. Union-Find is a fast, extremely simple data structure that every programmer should know about. It does not support breaking up subsets created by unions, but usually this is not an issue.

Neither of these options provides access to all of the items in a particular subset without traversing all the items in the set. However, both can be appropriately augmented with extra pointers if it is important that this operation be fast.

Implementations: LEDA (see Section ) provides dictionary data structures to maintain sets and the union-find data structure to maintain set partitions, all in C++.

LINK is an environment for combinatorial computing that provides special support for hypergraphs, including visualization of hypergraphs.   Although written in C++, it provides an additional Scheme language interface for interacting with the graphs.   LINK is available from http://dimacs.rutgers.edu/Projects/LINK.html.

Many textbooks contain implementations of the union-find data structure, including [MS91] (see Section ).   An implementation of union-find underlies any implementation of Kruskal's minimum spanning tree algorithm. Section contains a selection of minimum spanning tree codes.

Notes: Optimal algorithms for such set operations as intersection and union were presented in [Rei72]. Good expositions on set data structures include [AHU83].

Galil and Italiano [GI91] survey data structures for disjoint set union.   Expositions on the union-find data structure appear in most algorithm texts, including [CLR90, MS91]. The upper bound of on m union-find operations on an n-element set is due to Tarjan [Tar75], as is a matching lower bound on a restricted model of computation [Tar79]. The inverse Ackerman function grows notoriously slowly, so this performance is close to linear. Expositions on the Ackerman bound include [AHU74]. An interesting connection between the worst-case of union-find and the length of Davenport-Schintzl sequences, a combinatorial structure that arises in computational geometry, is established in [SA95].

The power set of a set S is the collection of all subsets of S.   Explicit manipulation of power sets quickly becomes intractable due to their size.   Implicit representations of power sets in symbolic form becomes necessary for nontrivial computations. See [BCGR92] for algorithms for and computational experience with symbolic power set representations.

Related Problems: Generating subsets (see page ), generating partitions (see page ), set cover (see page ), minimum spanning tree (see page ).

Next: Kd-Trees Up: Data Structures Previous: Graph Data Structures

Algorithms
Mon Jun 2 23:33:50 EDT 1997