Robots.txt Support is (Partially) Implemented
I just wanted to let people know that we've now partially implemented Robots.txt. Now before anyone flames on, I can just let you know that the main Feedster spider, the thing that checks sites regularly, is using Robots.txt. We're grabbing it once per week, every Sunday. Please note that there is NO standard for how frequently you grab it and this is what we decided. If more frequent is needed, let us know.
And please bear in mind that if you use robots.txt to turn off our indexing of your blog (which is fine) then we will still periodically check your url to make sure that you haven't changed your mind. If you want us to just go the heck away, never touch you, etc., then you have to let us know personally since we'll flip the database bit that says "A real live human made an intelligent reasoned decision to opt out of Feedster so we're never bothering them again unless they specifically ask us too".
As per the ill defined robots.txt spec we look for robots.txt in the root directory of a weblog. If your blog is located at http://radio.weblogs.com/0103807/ then this means that we're looking for the file http://weblogs.com/robots.txt. Subdomains and blogs are way, way, way too random for us to check every possible location.
Now where Robots.txt isn't doing yet is handling images. The reason for this is we use a separate crawler to handle Images and, in particular, we have to do a bunch of path analysis to compare relative to absolute paths and the whole issue of www and no www. So while we're making progress, the current implementation is, indeed, buggy and hence turned off. We'll get it in as soon as possible but with Gnomedex this week and my traveling to, of all places, Des Moines, Iowa, this isn't really likely. At least not a stable, reliable version.
When:
12:12:11 PM |
Permalink: |
|
IM Me About This
|
|