How do RSS search engines differ from 'conventional' HTML search engines such as Google? Ignore for now that Google can index other data sources such as PDF files.
The following are not criticisms of Feedster or any RSS search engine. Instead I see them as challenges and without challenges there'd be no progress. Some have easy solutions, others not. In this piece I've used Feedster as the canonical example of an RSS search engine. The questions raised are relevant to all RSS search engines but as Feedster is the nom du jour I've taken the liberty of using it as a convenient reference.
In no particular order, the significant order depends upon what you think is important:
1. Relevance ranking.
Compare these two searches for 'RSS search engines', firstly using Google then using Feedster. I know Feedster is just starting up but we need to know how it's searching its data and what the relevance ranking model is.
2. Finding and searching legacy data.
How will Feedster cope with legacy data? Take a look at my weblog home page RSS feed:
It lists only 3 items yet I've been writing this weblog for a couple of years. How will Feedster discover the rest of my written content? If the date of the first post listed in any one site's RSS file is taken as year 0 for that site, then fair enough, at least it'll be current. But as I write more for example then older posts will fall off the end of my RSS feed. Will these items be stored in Feedster's database or will the engine only ever search what's in the RSS feed at any one time? If the latter then a lot of relevant content will always be hidden from Feedster's gaze. (see also Walking the web - content scalability).
3. Walking the web - content scalability.
Google can find new content by walking the web. It picks up a link and walks it to find pages with yet more links. Eventually the whole web, at least in theory, can be walked as Google plays the ultimate six degrees of Kevin Bacon game (everything is related to everything else by a number of links).
Feedster can walk this walk to some extent using RSS autodiscovery but sooner or later (most likely sooner) it'll come up against a page with no associated RSS file and presumably the walk stops there. One day maybe all pages will be part of an RSS feed somewhere but not for a while yet I suspect.
Here's my weblog:
I actually use my weblog as 3 separate weblogs, the home page and 2 categories (in Radio UserLand parlance). So in Feedster presumably my weblog has 3 instances:
Now could Feedster guess that? Well yes, to some extent, it could walk my weblog and discover the RSS links from each page coming up with these 3 unique RSS URLs. But could Feedster use my weblog's domain name to discover more RSS feeds from other people's weblogs? Probably not. The domain name where my weblog resides is:
How can you infer from this what other weblogs exists in this domain, let alone what their home URL is?
So my guess is that right now Feedster has a more complex database than Google. It probably has a table listing all unique RSS feed URLs then a larger database of each RSS item. The RSS items database is probably what gets searched. My guess is that Google doesn't maintain a separate table of top-level domains. is this a problem for either system? well, I guess it depends upon what you want to achieve. Feedster can use these extra data to its advantage in its advanced search. But so can Google.
Feedster can only discover content if it's part of an RSS feed. Google isn't limited in this respect and can find any page on the web providing it has a link to it from some page already in the Google database (you can of course suggest a link to both Google and Feedster to get a new site into their databases). Right now it would seem that RSS search engines are limited to a theoretically finite data-set of pages with an associated RSS feed. This may change with time (see also Finding and searching legacy data).