As software engineers, our job is not just to find a solution to problems, but to find the best solution to a problem.
You don’t aspire to be just another monkey with a fancy title!
To demonstrate the power of good algorithms, let’s borrow a case study from Sedgewick and Wayne’s beautifully crafted book.
The Dynamic Connectivity problem
We start with the following problem specification: The input is a sequence of pairs of integers, where each integer represents an object of some type and we are to interpret the pair p q as meaning “p is connected to q.”
We assume that “is connected to” is an equivalence relation, which means that it is
■ Reflexive: p is connected to p.
■ Symmetric: If p is connected to q, then q is connected to p.
■ Transitive: If p is connected to q and q is connected to r, then p is connected to r.
An equivalence relation partitions the objects into equivalence classes. In this case, two objects are in the same equivalence class if and only if they are connected. Our goal is to write a program to filter out extraneous pairs (pairs where both objects are in the same equivalence class) from the sequence.
See the illustration below,
Assume the universe consists of 10 points numbered from 0-9. Initially, none of the points are connected. The inputs are the pair of numbers on the left.
The first pair (4,3). We check if there exists a connection between the pair. So, we create a connection between the two nodes and prints out the connection. Same for the next 4 inputs of (3,8), (6,5), (9,4) and (2,1).
But the next input, (8,9). Even though there is no direct connection between 8 and 9, they are part of a connected component, [3,4,8,9]. So, no new connection is to be made and nothing is printed. And so on..
Given a grid with numerous connected components, identifying the connected components or if two nodes are connected pose a computational problem. See the universe below,
You can quickly identify component consisting of five sites at the bottom left, but you might have difficulty verifying that all of the other sites are connected to one another. For a program, the task is
even more difficult, because it has to work just with site names and connections and has no access to the geometric placement of sites in the diagram. How can we tell quickly whether or not any given two sites in such a network are connected?
A solution to the problem provides an effective implementation for,
void union(int p, int q) – add connection between p and q
int find(int p) – component identifier for p (0 to N-1)
boolean connected(int p, int q) – return true if p and q are in the same component
The union() operation merges two components if the two sites are in different components, the find() operation returns an integer component identifier for a given site, the connected() operation determines whether two sites are in the same component.
As we shall soon see, the development of an algorithmic solution for dynamic connectivity thus reduces to the task of developing an implementation of this API. Every implementation has to
■ Define a data structure to represent the known connections
■ Develop efficient union(), find(), connected() implementations that are based on that data structure
As usual, the nature of the data structure has a direct impact on the efficiency of the algorithms, so data structure and algorithm design go hand in hand. The API already specifies the convention that both sites and components will be identified by int values between 0 and N-1, so it makes sense to use a site-indexed array id[] as our basic data structure to represent the components.
Initially, none of the sites are connected and each site corresponds to a component. Initially, we start with N components, each site in its own component, so we initialize id[i] to i for all i from 0 to N-1.
All of our implementations use a one-line implementation of connected() that returns the boolean value find(p) == find(q). The find(int p) methods return the identified for the component. So, it should make perfect sense to assert that, if two sites correspond to the same component, they are connected.
Implementations
We shall consider three different implementations, all based on using the site-indexed id[] array, to determine whether two sites are in the same connected component.
Quick-find
One approach is to maintain the invariant that p and q are connected if and only if id[p] is equal to id[q]. In other words, all sites in a component must have the same value in id[]. This method is called quick-find because find(p) just returns id[p], which immediately implies that connected(p, q) reduces to just the test id[p] == id[q] and returns true if and only if p and q are in the same component.
For implementing union, we first check whether they are already in the same component, in which case there is nothing to do. Otherwise, we are faced with the situation that all of the id[] entries corresponding to sites in the same component as p have one value and all of the id[] entries corresponding to sites in the same component as q have another value. To combine the two components into one, we have to make all of the id[] entries corresponding to both sets of sites the same value, as shown in the example. To do so, we go through the array, changing all the entries with values equal to id[p] to the value id[q].
We could have decided to change all the entries equal to id[q] to the value id[p]—the choice between these two alternatives is arbitrary.
The implementation is pretty simple,
public int find(int p) { return id[p]; } public void union(int p, int q) { // Put p and q into the same component. int pID = find(p); int qID = find(q); // Nothing to do if p and q are already in the same component. if (pID == qID) return; // Rename p’s component to q’s name. for (int i = 0; i < id.length; i++){ if (id[i] == pID) { id[i] = qID; } count--; } }
The find() operation is certainly quick, as it only accesses the id[] array once in order to complete the operation. But quick-find is typically not useful for large problems because union() needs to scan through the whole id[] array for each input pair.
In particular, suppose that we use quick-find for the dynamic connectivity problem and wind up with a single component. This requires at least (N-1) calls to union(). union() iterates through all the n elements at least once, which clearly gives us a N 2 complexity. This analysis generalizes to say that quick-find is quadratic for typical applications where we end up with a small number of components. Modern computers can execute hundreds of millions or billions of instructions per second, so this cost is not noticeable if N is small, but we also might find ourselves with millions or billions of sites and connections to process in a modern application. Trying to solve a problem with millions or billions of sites will not give us an acceptable time.
Most definitely Quick-Find can be improved! Let’s see that on Part 2