Michael Fioritto's Radio Weblog

Wednesday, July 31, 2002

More content categorizers..
Information overload.

We cope but it isn't getting much better.

And sometimes finding what we're looking for is like a needle in a field of haystacks. Or a leaf in forest of trees.

Search alone is rarely enough to find what you need in very large data spaces. For example, Google search results and Monster candidate listings often return thousands of close hits. Matching engines efficiently apply criteria to a two-sided search (both employer and worker have demands to be met and supply ways to meet the others' demands).

Taxonomies are another approach. Yahoo! and Open Directory show the value of navigating through clumps and clusters of related sites. But you have your own data to mine. And creating a taxonomy by hand is expensive and slow.

Enter taxonomy helpers. They do several things:

Analyze source files: Suck metadata from your diverse resources (documents, web pages, emails, news feeds, etc.) into a common and comparable format
Define clusters: Help define your topics and how the topics are related. This is compute and storage intense, so it is often done bit by bit. Starting with broad categories and refining and splitting them as they fill up.
Categorize: Assign each resource into one or more categories in the taxonomy, typically using metadata.
Serve: Manage a user experience for surfing or flying through the taxonomy.

Here's a roundup on some shipping categorizers.

First, I noted Quiver, a tool that recommends topics for human review and approval.

Back in April, eContent Magazine wrote a piece on Taxonomy's Role in Content Management.

Taxonomy technology greatly assists the sharing of enterprise knowledge. But don't expect to sit back and watch it go. Experts agree that those searching for an out-of-the-box solution shouldn't hold their breath. Count on adding a little elbow grease, but the results will be worth it.

They mentioned taxonomy vendors:

Autonomy creates and maintains outlines using pattern and cluster analysis. Separate components analyze documents for their content and categorize them to taxonomy branches and leaves.

Inxight Software's Categorizer filters, classifies and delivers content to users and corporate knowledge bases. It scales to millions of documents and thousands of topics in multiple languages. A sister product, MetaText Server elicits structured data from unstructured sources.

Lotus Discovery Server extracts, analyzes, and categorizes structured and unstructured content to reveal the relationships between the information as well as the people, topics, and user activity in an organization.

Microsoft's SharePoint Portal Server has manual content categorization features.

Semio's SemioTagger autocategorizes content.

Sopheon autocategorizes content from multiple sources, including sources external to the enterprise.

They also pointed out taxonomy visualization sites.

Antarcti.ca uses cartography to map clusters of information spatially.

Inxight VizServer's Star Tree (shown here) and Table Lens help you to meaningfully surf large dataspaces.

TheBrain Technologies
www.thebrain.com

Now eWeek reviews three more products in this space:

Applied Semantics' Auto-Categorizer 1.1
Interwoven's Metatagger 3.0
Thunderstone Software's Texis Categorizer 4.1.

eWeek's overview of the comparison findings is worth reading as is their eVal Scorecard: Content Categorization. Note they used very small record sets, the low thousands. Even a small company will organize hundreds of thousands of records, if not millions.

One last note. Standards in this area are few and rarely implemented. These few are RDF (Resource Description Framework), DAML (DARPA Agent Markup Language), and DAML+ OIL (Ontology Inference Layer).

Now where should I categorize this post?
[a klog apart]
12:59:22 PM    Trackback []

What She Means Is "Access" Will Cost.
Factiva CEO: News Will Cost in Two Years

"Consumers will be coughing up for all online media content by 2004, according to Factiva CEO Clare Hart, who sees a two-year turnaround for ISPs to get with the paid-for-content program....

According to Hart, consumers do not want to pay for online content because they have been trained not to, whereas business users are used to putting their hands in their pockets for particular information.
'I think that the media has trained the online consumer that there is no value in what they publish,' Hart said. 'In two years we will see a turnaround in the consumer market. It�s going to take some time for publishers to build the infrastructure to bill consumers. In the meantime, consumers are going to learn that they have to pay, business users on the other hand have know this for a long time.' Analyst group AMR Interactive said media companies are almost as unanimous in their insistence that paid content is the only way forward as Internet users are in their unwillingness to come up with the cash." [ZDNet, via The Virtual Acquisition Shelf & News Desk]

If Hart is right about this to any degree, then it's going to cause an even bigger rift between publishers/aggregators and libraries. Expect libraries to continue being the hot cyber-battleground as everyone works through the digital rights management & fee-based model versus fair use & information in the "commons" debate.
[The Shifted Librarian]
12:47:58 PM    Trackback []

PHP Class 'AmazonLiteXMLParser' released. Give an XML from Amazon's Web Service, this class parses the XML and creates and array of with product information. [XML News by CodingTheWeb.com]
12:39:16 PM    Trackback []

What is a weblog?.
Some good news. I've been given permission to republish Meg Hourihan's excellent essay on weblogs. At the time it came out I was getting ready to write something similar, it was the right time for the weblog world to define weblogs, because so many journalists had been trying to do it. Meg did such a great job, and I want to carry more voices through DaveNet, so I asked her, and then her editor at O'Reilly for permission, and this morning they said yes.

From there, I want to start an outline about what a weblog is, because there's more to say. Maybe it'll be a three-column table. In column 1, a topic. For example: Fact-checking. In the second column, how centralized journalism does it; and in the third column, how it works in the weblog world. That way, if someone understands how fact-checking works in the print world, they have a basis for understanding how it works when done in the open.

Perhaps you see more errors in weblogs, but they can get corrected quickly. I guess the diff is that you can see the process in weblogs. Some people say this is a bad thing, but I think it's good. When I see writing that's too polished, where the grammar is too perfect, I am suspicious that at a deeper level it has been sanitized and dumbed-down. I like getting my news and opinion straight from the source without the middleman.

Another row. In column 1, "Research". In column 2, "A reporter spends two weeks interviewing experts, with transcription errors, dumbing-down, etc added." In column 3, "Experts spend a lifetime trying new ideas, learning from their mistakes, and learning how to explain their philosophy. Weblogs let them publish their ideas without intermediaries."
[Scripting News]
12:33:33 PM    Trackback []