The FuzzyBlog :: Scott Johnson's Blog

Scott Johnson / The FuzzyGroup, Feedster / PHP Consulting / Random geeky stuff / I Blog Therefore I Am.

Thursday, July 17, 2003

Ye 'Olde Random Pile O' Links

Once again here's what I'm reading:

Laugh Out Loud: Strange Solutions
Disgraceful: Our Deficit
I'm a Speaker: PHP CON San Francisco
OSCOM Online: http://www.cmsreview.com/oscom/audiovideo.html
Shades of Larry Niven: FlashMobs / FlashMobs Boston
Well Said: Dave on Women
I Need to Read: http://About-AdSense.com/
Free Microsoft Development Tools: MS Versus SourceForge
Aggregators: Adam Wants an Aggregator for Dummies | RSS Reader for FireBird
Russell: Python Versus Java
Google, Blogs and Popularity: Luke (he's right -- I'm on the 2nd page of results for just "scott")

When: 8:11:58 PM | Permalink:

| comment [] | IM Me About This

WinerWatcher Revisited or "We're Saying No to this One"

To start: Thank You!!! I got more quality feedback on this issue than I have on I think any other issue ever. That's wonderful. You know when I got into blogging, I was a bit hesitant about the product design largely influenced by others input. I'd always run smaller, tighter teams and I have to say this "engineering in the open" is, well, for lack of a better term just plain sweet.

And I think the verdict is in -- there are a lot of people who say turn this on now. Go for it. Boo Yah! But I learned a few things a long time ago :

When you don't know what to do, if you don't have to do anything immediately, then just plain don't. Listen and learn.
Just because you can doesn't mean you should (as Dave also cited in my comments and his blog)
Get the opinions of people who've been doing something longer than you have. While I'm not a blogging carpetbagger in the of reconstruction, I do notice that those who argued against this have been doing this longer than I have. And while I don't everyone's "blogging street cred", I got the general feel that the ones for it have been doing blogging less than the ones who are for it. That's interesting.
No (or darn few) technical problem is about technology -- they're all about the social issues. I had an interesting lunch yesterday with Mark Bernstein and Aaron Swartz and looking back on it a day later makes me realize that virtually all the issues we argued about, including my taking an anti-Open Source position were social.

So while there's nothing architecturally to stop us from doing it, I'm not yet certain that we should and for that reason Francois and I are saying "Winer Watcher: Just Say No". I'm sure we'll think about this again but for now there seem to be enough compelling reasons to not do it. And if you haven't read Burning Bird's very cogent essay on the topic then you probably should do so.

Note: One particularly good idea did come out of this -- a noarchive element for RSS. And hopefully someone from the echo / atom / pie project is reading. Now how that would be implemented when virtually every aggregator has to archive to display, I don't know. But ...

When: 4:48:49 PM | Permalink:

| comment [] | IM Me About This

Feedster, Yahoo and Google or "Feedster: Not Just A Big Pile of PHP Scripts"

Hm.... Dave's entry on Feedster, Yahoo and Google has brought out some skeptics. Here's Dave's comment:

Isn't it obvious that either Google or Yahoo will buy Feedster so their search engine can understand RSS. Then the other guy is going to wonder why they missed the boat.

More...

And interestingly Tara from (the excellent) ResearchBuzz also said this:

If I were Ask Jeeves, I'd take some of that stock and go on a buying spree. Daypop. Technorati. Feedster. Hyperbee. Gigablast. (If any of these are available.)

More...

Now I have to publicly say thanks to both Dave and Tara for the overwhelmingly strong votes of confidence. Now Jeremy had this to say:

What makes Dave think that Yahoo and Google's technology doesn't already "understand" RSS, I wonder? RSS is simple. Really simple. And structured. Hardly the mess that HTML is. It's not a really hard problem if you already have crawling infrastructure and the ability to query structured data.

More...

And just in case you want to see what others had to say, see this link which pulls together most of it:

Feedster Search

The Rebuttal from the Feedster Perspective

So the first question becomes "Why didn't I even mention it here? Is something afoot? Are you ashamed of your technology? Is there really nothing there?". Well the answers are simple:

I didn't mention it because both I and Francois have done this in the past -- sold companies successfully. And we both know just how hard it is to make happen. Whether we talk about it or not isn't going to make it happen although it could give us swelled heads. My preference is just plain relentless focus on building a great product (something interesting and major coming next week). If we do that then this all works out for everyone -- provided we don't get stupid*.
No nothing is afoot. While Jeremy (who works at Yahoo) might know we exist, I doubt anyone else does. Blogging is interesting sure but the big money today is in stuff like Overture and Yahoo's doing the right things there.
No we're not ashamed of our technology -- far from it.
Yes there's real technology underneath (more below).

Every Problem Looks Simple -- Until You Hit the Real World

Now Jeremy says that RSS is just plain simple and, well, on the surface that may be true. But in the real world it isn't simple. RSS is a simple standard that damn few comply with correctly. There's RSS, there's RDF, there's RSS .9x and 2.0 and now Echo. And then there's everything that doesn't validate but users expect to work And oh yeah what about the melange of "standards" that Feedster supports -- blogchalk, geourls, etc.

I guess what I am saying is that every problem looks simple -- until you hit the real world. Let me draw an analogy that is near and dear to Jeremy's heart -- MySQL Performance. On the surface this is simple too -- just fully normalize your data and add an index or two and its all good, right? You'll have great performance, right? Well if so then why is Jeremy writing a (excellent) book on it? Well because in the real world, this isn't a simple problem. Neither is RSS.

More from the Comment I left on Jeremy's Blog

The big issue with any technology is always scaling it for the real world. Part of this is performance and part of this is making the technology deal with real live users. And we've done a good job here (I will pat my own back for this).

So What About Our Techology

I've seen our code referred to as "just a bunch of php scripts". Yeah right (laughs). Let's break it down and see what we're doing behind the scenes:

A user interface that is admittedly written in PHP. So what? Yahoo thought enough of PHP to hire Rasmus and recently another php heavy. So that can't be an issue. Sure the PHP is what you see from the external world but you probably have no idea about the rest of Feedster (read on MacDuff).
A network crawler / change monitor (written in Perl) that understands weblogs.com and a half dozen other change monitoring services.
Various and sundry data extraction / recognition routines for working with and applying metadata.
A back end data warehouse currently holding over 3,000,000 posts and 62,000 plus feeds.
Robust auto discovery routines to find RSS feeds. No I don't mean "read the link tag" -- I mean a real world approach that allows for all the myriad users who don't ever set it -- we know how virtually every major blogging tool generates RSS feeds and we can look for feeds of all sorts.
A fault tolerant RSS parser that allows for all the myriad ways people screw up their RSS feed. The first version of Feedster didn't handle XML errors well (it basically aborted the index for that record). Now we merrily continue on our way recognizing that while users may make mistakes, good software handles it and continues.
An extensible architecture at the user interface, crawler, engine and database level to let us quickly deploy new services and formats. Remember last week when I added Echo / Atom support? Know what took the most time? Figuring out the date and time transformations.
A portable full text search engine, written in C, capable of very significant search operations including boolean, wildcard, similarity, fields and more. Want to search Feedster for only blogs written in english sorted by descending date and looking for an expression with compound booleans including wildcards and precedence? Its all in there (Tara on this). When Francois and I merged Feedster and RSS Search, the engine technology that his company developed was one of if not the major reason I pursued him and worked really hard to get the merger to happen. I've worked with lots of engines including the one that currently powers the US patent office and this one is both rock solid and damn good.
Oh and the search engine is fast. Raw search performance is generally sub 1 second, roughly .6 to .8 seconds per query. Displaying the result set and network i/o are the limits we have not search performance.
A network / protocol / gateway layer for the search engine that is actually similar to how Apache operates with multiple servers that can be pre-launched for performance. This is key to our scaling Feedster into a multiple server / clustered server architecture. As soon as the load grows, we'll be able to add new boxes quite easily (right now Feedster is on 3 boxes by the way). In short we're able to have multiple query servers searching one or more replicated databases and still keep indexing fast (ever seen Feedster results when they're under a half hour old -- its just plain cool).
Soap API (ok we haven't announced it formally yet but its there; how else could we support Apple's Sherlock ?)
Oh and by the way between just Francois and myself, we have about 35 years of full text searching experience.

Now I've seen one person point out to me that we should have just written everything in straight C or Java. Well that's an opinion and its worth what I paid for it. Personally I try not to be a language bigot and say "right tool for the job". Given that we're building a network application for a server farm we control as opposed to software for resale, I think we've actually made exactly the right choices.

So anyone can say whatever they want. And whatever happens, happens. I have no idea. What really matters is this:

I'm very, very happy with Feedster
Francois is very, very happy with Feedster
The industry is widely adopting us right down to building us into software like NewsMonster, SharpReader, BottomFeeder and others
And most of all -- our users are very happy with Feedster

*Heaven knows that I've fallen victim to that disease in the past. Hopefully I'm wiser now.

When: 1:08:07 PM | Permalink:

| comment [] | IM Me About This

ATOM Support Added

Ok all I did was change the picture of the format. Still I did do **something**.

When: 7:46:30 AM | Permalink:

| comment [] | IM Me About This

What If I Winer Watched You?

Although the recent controversy over the "Winer Watcher" (ww) has now ceased due to its shutdown, I'd like to raise the issues again. But before I do, here's a recap.

Dave Winer, the subject of the WW is a prominent blogger who, like other people, sometimes changes his blog posts.
Mark Pilgrim is another prominent blogger who put up an application which tracks every post and shows you all versions of the same post.
Dave has called this stalking while Mark maintains that it is useful since Dave changes his posts and often to Mark's detriment.

Now to start you might be wondering why I'm bringing this up again. Well a Feedster user, one I respect quite a lot, has asked for us to implement this type of WW functionality -- but for anyone. So the question becomes: Evil or Not Evil? And I'm really not going to hash out the rights and wrongs here. That's not my job 'mon so to speak. What I am going to point out is this:

What felt unfair to me is that Mark didn't subject himself to the same microscope. He put up the application for Dave and Dave alone. Why didn't he subject his own posts to the same scrutiny? If Mark never changes his posts then we'll all know that once and for all. And if he does then isn't it just plain "more fair" to know that?
It would be technically trivial for Feedster to implement Winer Watcher. All we need to do is add a version bit to our database since we track updated posts anyway. What we currently do is update our posts table when a post is updated. It would be easy to just store the updated version, only index the latest version and then show on our cached posts page all versions of a post sorted by newest to oldest.
How would you feel if we did this not to just Dave but to anyone? How would you feel if we did it to you ?

Please understand that I'm not saying that we are going to do this or we aren't. Just because technology makes something possible, that doesn't mean that it should be done. My biggest objection to Mark's WW was the lack of fairness -- he subjected one person and one person only to the Microscope and didn't reverse the tool onto himself. Quite honestly we saw the death of privacy a long time ago and now all we have is an illusion. Perhaps that illusion is a good one. I'm not certain. I do think there is a lot to be said for making people accountable for what they said. And if there is no record then there is no accountability. None. I personally think that one of the things that keeps debate int he blog world more civil than it is in other cyberspace forums is that there is a good record. Mailing lists and chatrooms, while they may have records, are never as easy to navigate as are blogs. Does a global WW actually lead to a better, more civilized blogspace ? Is being accountable for what you say a bad thing?

Now let's think of this in another way. What about our elected officials? I mean Howard Dean is now a blogger. How good would it be to have a WW that tracked what Dean says. Or a WW that tracked what any politician said? Or is it that we only want that WW technology when its applied to people we don't like?

Personally I come down on the accountability is good side of the argument. But that's me. I will admit, however, that I love the thought of a WW for our politicians.

Comments requested please. I really do think that this is an important issue.

References

When: 7:41:49 AM | Permalink:

| comment [] | IM Me About This