next up previous contents index CD CD Algorithms
Next: Kd-Trees Up: Data Structures Previous: Graph Data Structures

Set Data Structures



Input description: A universe of items tex2html_wrap_inline26702 and a collection of subsets tex2html_wrap_inline26704 , where tex2html_wrap_inline26706 .

Problem description: Represent each subset so as to efficiently (1) test whether tex2html_wrap_inline26708 , (2) find the union or intersection of tex2html_wrap_inline26710 and tex2html_wrap_inline26712 , and (3) insert or delete members of S.

Discussion: In mathematical terms, a set is an unordered collection of objects drawn from a fixed universal set. However, it is usually useful for implementation to represent each set in a single canonical order, typically sorted, so as to speed up or simplify various operations. Sorted order turns the problem of finding the union or intersection of two subsets into a linear-time operation - just sweep from left to right and see what you are missing.     It also makes possible element searching in sublinear time. Finally, printing the elements of a set in a canonical order paradoxically reminds us that order really doesn't matter.   

We distinguish sets from two other kinds of objects: strings and dictionaries. If there is no fixed-size universal set, a collection of objects is best thought of as a dictionary, as discussed in Section gif.   If the order does matter in a subset, i.e. if tex2html_wrap_inline26714 is not the same as tex2html_wrap_inline26716 , then your structure is more profitably thought of as a string, so see Sections gif and gif.    

When each subset has cardinality exactly two, they form edges in a graph whose vertices are the universal set. A system of subsets with no restrictions on the cardinality of its members is called a hypergraph. It often can be profitable to consider whether your problem has a graph-theoretical analogy, like connected components or shortest path in a hypergraph.  

Your primary alternatives for representing arbitrary systems of subsets are:

In many applications, the subsets are all pairwise disjoint, meaning that each element is in exactly one subset.   For example, consider maintaining the connected components of a graph or the party affiliations of politicians.     Each vertex/hack is in exactly one component/party. Such a system of subsets is called a set partition.   Algorithms for constructing partitions of a given set are provided in Section gif.

For data structures, the primary issue is maintaining a given set partition as things change over time, perhaps as edges are added or party members defect. The queries we are interested in include ``which set is a particular item in?'' and ``are two items in the same set?'' as we modify the set by (1) changing one item, (2) merging or unioning two sets, or (3) breaking a set apart.   Your primary options are:

Neither of these options provides access to all of the items in a particular subset without traversing all the items in the set. However, both can be appropriately augmented with extra pointers if it is important that this operation be fast.

Implementations: LEDA (see Section gif) provides dictionary data structures to maintain sets and the union-find data structure to maintain set partitions, all in C++.    

LINK is an environment for combinatorial computing that provides special support for hypergraphs, including visualization of hypergraphs.   Although written in C++, it provides an additional Scheme language interface for interacting with the graphs.   LINK is available from

Many textbooks contain implementations of the union-find data structure, including [MS91] (see Section gif).   An implementation of union-find underlies any implementation of Kruskal's minimum spanning tree algorithm. Section gif contains a selection of minimum spanning tree codes.    

Notes: Optimal algorithms for such set operations as intersection and union were presented in [Rei72]. Good expositions on set data structures include [AHU83].

Galil and Italiano [GI91] survey data structures for disjoint set union.   Expositions on the union-find data structure appear in most algorithm texts, including [CLR90, MS91]. The upper bound of tex2html_wrap_inline26728 on m union-find operations on an n-element set is due to Tarjan [Tar75], as is a matching lower bound on a restricted model of computation [Tar79]. The inverse Ackerman function tex2html_wrap_inline26730 grows notoriously slowly, so this performance is close to linear. Expositions on the Ackerman bound include [AHU74]. An interesting connection between the worst-case of union-find and the length of Davenport-Schintzl sequences, a combinatorial structure that arises in computational geometry, is established in [SA95].     

The power set of a set S is the collection of all tex2html_wrap_inline26732 subsets of S.   Explicit manipulation of power sets quickly becomes intractable due to their size.   Implicit representations of power sets in symbolic form becomes necessary for nontrivial computations. See [BCGR92] for algorithms for and computational experience with symbolic power set representations.

Related Problems: Generating subsets (see page gif), generating partitions (see page gif), set cover (see page gif), minimum spanning tree (see page gif).        

next up previous contents index CD CD Algorithms
Next: Kd-Trees Up: Data Structures Previous: Graph Data Structures

Mon Jun 2 23:33:50 EDT 1997