Steve Land's Brain
Unformed thoughts are much more interesting than hardened opinions
















 

Metadata

Let's take a stroll outside. Oh look, there's a tree. I know it's a tree, because there is a scrap of paper nailed to it with the word "tree" written in black Sharpie. At the base of this tree is a plaque, which reads: "This California Black Oak was planted in 1929 by a state-funded road crew." Step back a few feet and you can see the larger sign behind the tree, which says "Plant" and points to the tree with a big red arrow. There is a metal band around the tree that has something stamped in it... let's see: "Plant #8428753-0034"

Metadata, which literally means "data about data", is becoming a popular pattern of solution for organizing content and managing digital resources. There are standards around metadata (Dublin Core and RDF). People make very compelling arguments for the utility of metadata.

It's interesting to me that Metadata gets a special name, and other kinds of data do not. You don't say, this is Productdata, which is literally "data about products", or Customerdata. Why is data about data held up as a unique solution?

It might be the prefix "meta". I found a great definition for this on a site about complex systems, calresco.org:

Meta-: A prefix used to denote a higher level of thought about the subject, e.g. metascience (where we consider how we approach science), meta-ethics where we consider how we define normative behaviour. Each level in a complex system can be considered as a meta-viewpoint upon the previous level of emergence. Relates to category or type theory and higher-order logic.

So, once you have any data at all, you can of course have a higher level of thought about that data, and create something you can call Metadata. I have data about number of hits to a Web site. This is just a metric of number of page requests. I could step back and think about this data in a more abstract way, and say, wouldn't it be great if we could also know what caused the hits to occur? Did a user follow a link, open a Bookmark, or type in the address?  If I could capture this extra data, it could be called "meta-" to the original data.

However, relational database theory is already doing all of this. If I have a table "Person" in a relational database, nobody would think that I'm a genius if I also had a table called "PersonType" that contained classifications of persons. Each person record would contain a pointer to a record in the PersonType table, and we'd know, for each record, whether this person is an Employee, Vendor, or Customer, for instance. Same would go for my Web hit counter. I probably would have a table called "HitSource" with normalized values representing all possible sources for a hit to a Web page. Each record would then refer to one of these values, and boom, you've done the meta thing.

Perhaps the difference now is that relational data can be expressed as detatched documents, such as XML, so that the metadata itself can be part of the "data" (or thing) that the metadata is being meta about. Similarly, a document, such as a Microsoft Word document, can hold Properties that describe the document itself, not the contents: Author, Keywords, etc.

Relational databases work because people work for companies, and companies can mandate definitions for things that everyone agrees with. These definitions become semantic boundaries that are formalized in a database schema, and then reinforced through time as real data is added to tables in that database. Every once in a while, the existing database structure will be found to no longer be adequate to hold the conceptual information that is needed by the business, and the database has to change.

The problem here is that the database has to have a schema at all. The schema itself pre-defines all the categories and nomenclature that will be used to codify business concepts. The design generally needs to be decided up front. Good analysts and database architects can build something that is flexible to some extent, but even the flexibility itself is confined by the underlying schemas.

On the other extreme, metadata empowers "thing-creators" to determine classifications of their "things" in a more ad-hoc way. All metadata schemes pre-define fields that authors or editors can fill in with values that will describe the things at this "higher level of thought". Within an organization, industry, or standard, controlled-value lists of choices of values may exist to constrain authors from being completely original with their metadata vaules, making the metadata-adding process a bit more like a relational database.

There are a few problems here. One is, each individual has different ideas about how important the metadata is for their item. A marketer will want to add as many keywords, related or not, to make her item appear in any search result. A craftsman might think his item stands on its own, and might think the metadata process is not important at all. The importance might also be influenced by workload, stress, original idea of who your audience is, whether your thing is already slated to become part of a publication or not, etc. Let's call this the "Quality problem".

Another problem is that, no matter how much metadata you have about a thing, there is a great likelihood that an unexpected new usage of the thing will come up that cannot be supported using the metadata that you already have. Imagine a library of digital photos. Each digital photo has a title, keywords, author, subject, and so on. Now, a marketing team wants photos that would be good for cheering up patients in the burn unit of a hospital. There is a high likelihood that none of the metadata would have any bearing on the decisions to use one photo over the next, and human judgment takes over from the metadata world. Let's call this the "Unanticipated requirements" problem.

Another problem is that the metadata itself is also just data. So, it will need some metadata of its own. What is an author? How do we choose one set of metadata classification schemes over another? Yes, the standards do address this, but there is nothing inherently "correct" about these standards either. What if we have metadata for "Author", "Creator", "Publisher", and "Contributor", and we want to get a list of all the people who worked on a piece. Some of these fields may have people's names, but others might have company names. There is nothing about these fields that ensures that they always contain humans; so it's back to metadata on top of metadata. Let's call this the "Meta-meta problem".

The problems that I find with metadata are really just symptoms of the problems I see with semantic labels being mistaken for reality. The map is not the territory. Data is a particular view of reality, constained by somebody's concepts and semantics, captured in normalized structures. Individual bits of data are viewpoints of reality, removed from reality, and thus de-contextualized. Metadata attempts to solve the lack of context problem by adding bits of context around the edges. This is often useful, mostly because we are still using data, and any context we can get is helpful.

The solution to the impedance mismatch between semantics and reality, I think, will be different in kind, in a way that metadata is not. Reality is context. Semantics divides reality into bits, and which bits you choose to focus on depends on who you are, what your culture is, and why you are focusing in the first place. Data takes semantics to another level, forcing the next person to also focus on the bits of reality that you have already entrenched in your data schema.

So, in one scenario, we go for a walk and see the tree with all this metadata on it. Someone else decided what was important about the tree, what form the metadata will take (bits of paper, plaques, signs with arrows), and then captured what they thought communicated about the essence of the tree best. In another scenario, we are walking along and see the tree directly. No labels, no words. Just a tree in context of the earth and sky and sun and moon and gravity and air and water and whatever actual reality bits we choose to focus on. We can widen or narrow our focus at will, looking at the veins of individual leaves, the twists of branches, the texture of bark, the shape of the ground above the roots, the watershed that funnels rainfall around the tree, the geopolitical boundaries that define the nationality of the tree, whatever. To do so, we interact with a whole and use our minds to divide reality. This is totally different than getting pre-digested bits of reality and interacting with them.

 


Click here to visit the Radio UserLand website.
Click to see the XML version of this web page.
Click here to send an email to the editor of this weblog.
© Copyright 2005 Steve Land.
Last update: 4/21/2005; 8:22:05 AM.