There is a world where Google, Yahoo! and Microsoft compete to build better search engines and -- also -- for our money. Then there is a completely different world, the corporate market. And the next big thing in Web search in this other world might be the WebFountain supercomputing project from IBM. It's not your ordinary project. It already took 4 years to 200 IBM engineers and dozens of million of dollars to build it. It also needs lots of hardware resources: several hundreds of powerful processors and 160 terabytes of storage. This project has an impressive goal: transform the huge amounts of structured and unstructured data available on the Web into business trends. Not the thing that Google does. And not for the same price either. For example, Factiva, an information services company, has licensed WebFountain and plans to offer it to its customers for about $200,000 a year.
This analysis looks at the goals, the resources and the status of this project as well as its future.
[Note: Most of the excerpts come from the CNET News.com article referenced at the bottom of this page.]
Harnessing the Internet's data to find meaning is a visionary ideal of Web search that has yet to be attained. As more companies manage their businesses on the Web, however, analysts predict they will be looking to extract value from its bits and bytes, and many software companies are now examining ways to bring that value to them.
IBM is hoping to cash in on the trend with the 4-year-old WebFountain project, which is just now coming of age. It's an ambitious research platform that relies on the Web's structured and unstructured data, as well as on storage and computational capacity, and IBM's computing expertise.
Whether WebFountain can deliver today, the problem it hopes to crack holds particular attractions for IBM. Big Blue has been pushing a new computing business model in which customers would rent processing power from a central provider rather than purchase their own hardware and software. WebFountain dovetails nicely with this utility computing model. IBM hopes to use the project to create a platform that would be used as a back end by other software developers interested in tapping data-mining capabilities.
It's not only a complex problem to solve, it's a big one.
Analysts said they expect to see increasing demand from corporations for services that mine so-called unstructured data on the Web. According to a study from researchers at the University of California at Berkeley, the static Web is an estimated 167 terabytes of data. In contrast, the deep Web is between 66,800 and 91,850 terabytes of data.
Providing services for unstructured-information management is an estimated $6.46 billion market this year and a $9.72 billion industry by 2006, according to research from IDC.
IBM assigned 200 researchers to the project, almost all of them being located in IBM's Almaden Research Center. And they have some large hardware resources.
A main cluster consists of 32 eight-server racks running dual 2.4GHz Intel Xeon processors, capable of writing 10GB of data per second to disk. The system can store 160 terabytes of compressed data.
The central cluster is supported by two adjacent 64 dual-processor clusters that handle auxiliary tasks. One bank crawls the Web -- indexing about 250 million pages weekly -- while the other handles queries.
This represents 768 processors, but the center is expecting an upgrade.
The cluster and storage system is migrating to blade servers this year, which will save space and provide a total of 896 processors for data mining and 256 for storage. In total, the system will add 1,152 processors, allowing it to process as many as 8 billion Web pages within 24 hours.
In other words, WebFountain will index more pages in one day than a complete Google Dance.
There are several applications running today, like the partnership with Semagix.
In one of the first public applications of the technology, IBM on Tuesday teamed with software provider Semagix to offer an anti-money-laundering system for financial institutions, with Citibank as its first customer.
The two companies have quietly been working together for months to develop an application that helps banks flag suspects attempting to legitimize stolen funds. Those efforts are in accordance with the USA Patriot Act, signed into law two years ago to fight terrorism.
Then there is this other one with Factiva, an information retrieval company owned by Dow Jones and Reuters, as reported the San Jose Mercury News.
IBM is still figuring out how to lower the cost of such analysis. So far, Factiva, an information services company, has licensed WebFountain and plans to offer it to its customers for about $200,000 a year. Eventually, says Andrew Tompkins, one of the creators of WebFountain, these tools will be at everyone's fingertips.
IBM also does consulting and provides WebFountain services to other corporations.
IBM says the WebFountain service has already yielded some promising results in early test runs, pointing to 2002 market research done on behalf of oil conglomerate British Petroleum as one telling example.
BP already knew that gas prices and car washes are customers' chief concerns while at the pump. But by unearthing news of a tiny Chicago-area gas station that created "cop-landing" areas for police officers, WebFountain called attention to another consumer worry: crime. Now BP is exploring plans to improve safety at its stations, giving away coffee, doughnuts and Internet connections to attract police officers.
WebFountain promises to combine its intelligence with visualization tools to chart industry trends or identify a set of emerging rivals to a particular company. The platform could be used to analyze financial information over a five-year span to see if the economy is growing, for example. Or it could be used to look at job listings to pinpoint emerging trends in employment.
"The Web has become just a huge bulletin board, and if you can look at that over time and see how things have changed, it answers the question, 'Tell me what's going on?'" said Sue Feldman, analyst at market research firm IDC. "This looks for the predicable structure in text, and uses that just the way people do, to do some analysis, categorize information and to understand it."
- IBM sets out to make sense of the Web, by Stefanie Olsen, CNET News.com, February 5, 2004
- Monster librarian at work, by Dean Takahashi, San Jose Mercury News, February 4, 2004
- The Rise of Intelligent Agents: Automated Conversion of Data to Information, by Martha Young and Michael Jude, Computerworld, February 5, 2004
- IBM progressing on the development of its search engine technology, by Serge Thibodeau, for Search Engine Journal, February 5, 2004
- the IBM Almaden Research Center
- and Rewriting the rules of business, also from IBM
Sources: Roland Piquepaille, March 1, 2004; with the above references