Paul Brown's Weblog

ATTENTION: This blog has moved to a new location. Please update your bookmarks, browsing habits, and aggregators.

Wednesday, June 18, 2003

Ruminations on String Matching and Switch Statements

This post (and this weblog) has a new home.

I'm not really a grumpy old fart, but sometimes I feel like it... I can't let discussions (and some comments from Bob Lee) on string comparison go by without at least chiming in with a history lesson.

Perfect Hash Functions

My first comment is on perfect hash functions and probably lies off the beaten path for people who didn't have a hard-core CS education. A perfect hash function is pre-constructed based on a known (at compile time) set of keys and has the property that the function returns a distinct values for distinct keys. (If you plug in a string that is not part of the initial set, you could (and probably will) have a collision.) Java usage in a switch statement can be done in a non-awkward way:

public class PerfectHash { public static final int FRANK = f("FRANK"); public static final int BEANS = f("BEANS"); public static int f(String s) { ... } }

And then to use it:

... public foo() { switch(PerfectHash.f(s)) { case PerfectHash.FRANK: ... break; case PerfectHash.BEANS: ... break; } ... } }

The somewhat awkward part comes in generating the hash function in a convenient way (from the perspective of the IDE or build system) and this almost certainly counts as premature optimization.

The following is (probably) the easiest way to get an int to switch on:

public static final int FRANK = 0; public static final int BEANS = 1; ... public static final HashMap _map = new HashMap(); static { _map.put("FRANK", new int[] {FRANK}); _map.put("BEANS", new int[] {BEANS}); ... } public static int f(String s) { Object o = _map.get(s); return (o == null) ? -1 : ((int[])o)[0]; }

The use of int[] is simply a personal preference over java.lang.Integer.

The fact is that Bob is probably right most of the time — the hashcode() is a perfect hash function under many circumstances, but it will require pre-checking to ensure that there are no collisions.

String Comparison Algorithms

My second comment is about string comparison algorithms. Equality checking is straightfoward: every character must be checked at some point, and while there are various heuristics (Java hashcode, length, etc.) that can be applied to determine non-matches early, there is no way to do better than O(s.length()). For Java purposes, using the intern() function is a good idea in the context of a reasonable (i.e., known and finite) number of strings. (In fact, most modern XML parsers automatically intern() all Names (element names, attribute names, etc.). See the relevant SAX feature.) The only issues occur because there is no way to mark a function to indicate that it returns an intern'd string as opposed to a generic string.

The problem of containment is more interesting. It's a relatively old topic and is well-studied, e.g., in the classic Knuth-Morris-Pratt and Boyer-Moore from 1977 (making it older than several FiveSight employees...), which give linear O(s.length() + t.length()) (KMP) and sublinear O(s.length()/t.length()) (B-M) algorithms for searching for s as a substring of t. (Here is a comparison from the horse's mouth, as it were.) Last time I looked (JDK 1.4), the Java String.indexOf(String) function used the naive method for matching a substring which takes O(s.length() * t.length()) but requires no pre-computation. I'll have to Google for an implementation that computes and caches workers that search for specific strings.
5:26:22 PM