Next: Numerical Problems Up: Data Structures Previous: Set Data Structures

## Kd-Trees

Input description: A set S of n points in k dimensions.

Problem description: Construct a tree that partitions the space by half-planes such that each point is contained in its own box-shaped region.

Discussion: Although many different flavors of kd-trees have been devised, their purpose is always to hierarchically decompose space into a relatively small number of cells such that no cell contains too many input objects.     This provides a fast way to access any input object by position. We traverse down the hierarchy until we find the cell containing the object and then scan through the few objects in the cell to identify the right one.

Typical algorithms construct kd-trees by partitioning point sets. Each node in the tree is defined by a plane through one of the dimensions that partitions the set of points into left/right (or up/down) sets, each with half the points of the parent node.   These children are again partitioned into equal halves, using planes through a different dimension. Partitioning stops after levels, with each point in its own leaf cell. Alternate kd-tree construction algorithms insert points incrementally and divide the appropriate cell, although such trees can become seriously unbalanced.

The cutting planes along any path from the root to another node defines a unique box-shaped region of space, and each subsequent plane cuts this box into two boxes. Each box-shaped region is defined by 2k planes, where k is the number of dimensions. Indeed, the `kd' in kd-tree is short for k-dimensional tree. In any search performed using a kd-tree, we maintain the current region defined by the intersection of these half-spaces as we move down the tree.

Different flavors of kd-trees differ in exactly how the splitting plane is selected. Options include:

• Cycling through the dimensions - partition first on , then before cycling back to .
• Cut along the largest dimension - select the partition dimension so as to make the resulting boxes as square or cube-like as possible. Selecting a plane to partition the points in half does not mean selecting a splitter in the middle of the box-shaped regions, since all the points may be in the left side of the box.
• Quadtrees or Octtrees -     Instead of partitioning with single planes, use all axis-parallel planes that pass through a given partition point. In two dimensions, this means creating four child cells, in 3D this means eight child cells. Quadtrees seem particularly popular on image data, where leaf cells imply that all pixels in the regions have the same color.

Nonorthogonal (i.e. not axis-parallel) cutting planes have also been used, although they make maintaining the cell boundaries more complicated.

Ideally, our partitions evenly split both the space (ensuring nice, fat, regular regions) and the set of points (ensuring a log height tree) evenly, but this can be impossible for a given point set. The advantages of fat cells become clear in many applications of kd-trees:

• Point location -   To identify which cell a query point q lies in, we start at the root and test which side of the partition plane contains q. By repeating this process on the appropriate child node, we travel the tree to find the leaf cell containing q in time proportional to its height. See Section for more on point location.
• Nearest neighbor search -   To find the point in S closest to a query point q, we perform point location to find the cell c containing q. Since c is bordered by some point p, we can compute the distance d(p,q) from p to q. Point p is likely very close to q, but it might not be the single closest neighbor. Why? Suppose q lies right at the boundary of a cell. Then q's nearest neighbor might lie just to the left of the boundary in another cell. Thus we must traverse all cells that lie within a distance of d(p,q) of cell c and verify that none of them contain closer points. With nice, fat cells, very few cells should need to be tested. See Section for more on nearest neighbor search.
• Range search -   Which points lie within a query box or region? Starting from the root, check to see whether the query region intersects or contains the cell defining the current node. If it does, check the children; if not, none of the leaf cells below this node can possibly be of interest. We quickly prune away the irrelevant portions of the space. See Section for more on range search.
• Partial key search -     Suppose we want to find a point p in S, but we do not have full information about p. Say we are looking for someone of age 35 and height 5'8'' but of unknown weight in a 3d-tree with dimensions age, weight, and height. Starting from the root, we can identify the correct decendant for all but the weight dimension. To be sure we find the right point, we must search both children of this node. We are better off the more fields we know, but such partial key search can be substantially faster than checking all points against the key.

Kd-trees are most effective data structures for small and moderate numbers of dimensions, say from 2 up to maybe 20 dimensions. As the dimensionality increases, they lose effectiveness, primarily because the ratio of the volume of a unit sphere in k-dimensions shrinks exponentially compared to a unit cube in k-dimensions.    Thus exponentially many cells will have to be searched within a given radius of a query point, say for nearest-neighbor search. Also, the number of neighbors for any cell grows to 2k and eventually become unmanageable.

The bottom line is that you should try to avoid working in high-dimensional spaces, perhaps by discarding the least important dimensions.

Implementations: Ranger is a tool for visualizing and experimenting with nearest neighbor and orthogonal range queries in high-dimensional data sets, using multidimensional search trees.   Four different search data structures are supported by Ranger: naive kd-trees, median kd-trees, nonorthogonal kd-trees, and the vantage point tree. For each of these, Ranger supports queries in up to 25 dimensions under any Minkowski metric.   It includes generators for a variety of point distributions in arbitrary dimensions.   Finally, Ranger provides a number of features to aid in visualizing multidimensional data, best illustrated by the accompanying video [MS93]. To identify the most appropriate projection at a glance, Ranger provides a matrix of all two-dimensional projections of the data set. Ranger is written in C,   runs on Silicon Graphics and HP workstations, and is available from the algorithm repository.

The 1996 DIMACS implementation challenge focuses on data structures for higher-dimensional data sets. The world's best kd-tree implementations were likely to be identified in the course of the challenge, and they are accessible from http://dimacs.rutgers.edu/.

Bare bones implementations in C of kd-tree and quadtree data structures appear in [GBY91]. See Section for details on how to ftp them.

Notes: The best reference on kd-trees and other spatial data structures are two volumes by Samet [Sam90a, Sam90b], in which all major variants are developed in substantial detail.

Bentley [Ben75] is generally credited with developing kd-trees, although they have the typically murky history associated with most folk data structures. The most reliable history is likely from Samet [Sam90b].

An exposition on kd-trees for orthogonal range queries in two dimensions appears in [PS85].   Expositions of grid files and other spatial data structures include [NH93].

Algorithms that quickly produce a point provably close to the query point are a recent development in higher-dimensional nearest neighbor search.   A sparse weighted-graph structure is built from the data set, and the nearest neighbor is found by starting at a random point and walking greedily in the graph towards the query point. The closest point found during several random trials is declared the winner. Similar data structures hold promise for other problems in high-dimensional spaces. See [AM93, AMN 94].

Related Problems: Nearest-neighbor search (see page ), point location (see page ), range search (see page ).

Next: Numerical Problems Up: Data Structures Previous: Set Data Structures

Algorithms
Mon Jun 2 23:33:50 EDT 1997