March 2003 | ||||||
Sun | Mon | Tue | Wed | Thu | Fri | Sat |
1 | ||||||
2 | 3 | 4 | 5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 28 | 29 |
30 | 31 | |||||
Feb Apr |
Update
Well my brain has now turned to a frothy mush so I think I could do more harm than good right now. But seriously, I finished all the code that:
- Grabs changes.xml
- Parses it
- Grabs the RSS file for the url if I know it or guesses at it if I don't
- Parses the rss file
- Indexes it
After running it like 3 times, we went from 1674 RSS feeds to 1950. And the number of indexed posts is now 19,834. I feel really good about this. Tomorrow I'll be running it on a more frequent basis and then it'll be a scheduled job once I add a bit more logging and error handling.
1st thing tomorrow: Generating search results in RSS (most likely). Brent requested it and it just plain makes sense. I'm very not sure how to format it (is each search result an item? Probably). Any thoughts are hugely welcome.
10:48:39 PM Google It! comment [] IM Me About This
Update
I just wrote the 1st pass at connecting to weblogs.com automatically -- and it worked!!! Now I'm writing the RSS url discoverer part so urls that aren't in the database can be used also. The index is getting fresher ! 18716 posts so far indexed. Once this gets done and I can stop babying the load process, I'll make search results subscribable as an RSS feed.
7:56:57 PM Google It! comment [] IM Me About This
Future Directions
Ok. This is sort of a conversation with myself over what features to add next and trying to codify my thinking. Input is quite welcome though, just leave a comment.
- Fact: Users like simple clean search UIs.
- Response: True. All search engines are moving this way.
- Fact: Users enter simple queries
- Response: Means that they will get too many results
- Fact: Search overload happens
- Response: Add a "Filter" widget to the top of a result set so people can reduce the search results to something manageable
- Fact: RSS feeds expire items as they flow off the home page
- Response: One search filter option could be Filter Old, Filter Current, Filter to Only Today (I know the english ain't right)
- Fact: Filters need to be successive
- Response: Sigh but yup.
- Fact: People know the google search variables
- Response: I should support them for use with 3rd party engines / meta searching, etc. Improving on them is worse except in the context where options don't exist.
- Fact: An XML-RPC api is a frothy good thing
- Response: That'll take longer but ok. Note to self: Check out Keith Deven's library for this
- Fact: Language detection is needed
- Response: Good ideas in that post. Thanks all !
- Fact: Need a preferences tab so you can set one or more rss urls to never be shown or searched
- Response: Yup. I sense php sessions coming shortly for faster development.
- Fact: Must move to an AND'd search.
- Response: In progress
- Fact: Take into account changes.xml analysis for generating frequency of change metrics
- Response: Partially done
- Fact: Opml loader needed
- Response: Most of the loader code is written. Dealing with non-valid opml isn't yet. " are delimiters not part of the damn name element! (or that's my thinking)
- Fact: Ability to view a properties sheet for a feed would be useful
- Response: Tie into changes.xml metrics
- Fact: Add and index rss in real time
- Response: Mostly done but needs error handling and prettifying. The separate indexing process is eliminated
- Fact: Restrict search to my own blog would be useful
- Response: Heck I want this if nothing else
- Fact: Generate search results in RSS
- Response: Other than concerns about server load, that's cool
- Fact: Need ability to restrict searches to just entries in my opml file / blogroll.
- Response: Yup. Needed.
- Question: Should I be worrying about integration w/ aggregators?
- Answer: Unknown. Example: Search from within NNW.
- Question: Is integration w/ blogging tools interesting?
- Answer: Probably. Andy has already asked for it for Blozom Need to think here.
That's a start on a rough core dump of what little brain I have left. More later. Clearly I'm thinking in a model of "Oversearch and then subtract". That resonates with me. Comments? Other things are still summarized in comments here not in this post. I'm particularly curious as to what features people think they need immediately versus later.
2:17:47 PM Google It! comment [] IM Me About This
Update
Search engine is partially switched over now. Ranking is completely different (and better I think). Default relationship is still OR (that's changing, still testing). It now reports total hits and limits the total to top 50 (anyone need more for now, contact me separately until the result set navigation is done).
11:21:05 AM Google It! comment [] IM Me About This
R... Update
I just had the nicest email exchange with Cory Doctorow over at Boing Boing. Apparently he dealt with some of these Google trademark issues last week. On his advice I pulled the R.. logo entirely while I figure out a new domain and name. I have a temporary one but I don't know if its right. Email me any thoughts you have on this.
Other:
- It now tells you that a search had no results
- I added a small faq and RSS overview to the about page [_Go_]
- Re-indexed and rebuilt. Something like 17,000 odd posts now. 1500 urls.
- I have the new search engine working but I'm getting false positives on some terms so I'm figuring that out.
Hacker breakfast today: Pizza and Caffeine.
9:48:02 AM Google It! comment [] IM Me About This
Morning and Roogle Metrics
Wow. We've hit Daypop Word Bursts (#1), Blogdex (#5) and there is the continuing traffic from Slashdot. We're seeing MySQL queries peaking at 64 per second now and up to 464,000 total hits with 19,000 unique IPs logged. And, of course, the load on the UserLand back end continues. Wow.
Interestingly with all this, I found an article by Cmdr Taco (who is one of the Slashdot creators) about all their RSS traffic. Cool. [_Go_]
Oh and with all the IMs I'm getting right now, someone recommended I start using DeadAIM. I did. Its cool. [_Go_]
Traffic to my blog has surprising stayed where it was (I'm still at #24). Is there latency in the server or do Slashdotters not read blogs? [_Go_]
And with all this, I haven't even had the time to read the Slashdot thread beyond like the 1st three messages. Can anyone tell me if its positive or negative? We got 200 comments which isn't bad for a Sunday I don't think -- it was the #3 most active Sunday topic.
Surprisingly Dave didn't mention that a Radio user and Radio author (me) was behind all this in his post. How odd. 1 and 2
Oh and as long as folks are bashing me for ripping off Google, at least it has a long history. Take a look at the Reverse Google. [_Go_]
We're no longer on the front page of Slashdot so the wild ride will probably start subsiding now.
Even as someone who has analyzed Slashdot traffic before, this is surprising. [_Go_]
6:57:49 AM Google It! comment [] IM Me About This
Night All
Ok. Its officially now 19 straight hours at the keyboard. The last database update for today is done. The database is up to 14818 posts, 1309 rss urls and all seems well and good. Am I 100% happy? Of course not. There's like a zillion and one things to do but "bit by bit" you know.
I'll be online and grinding on this all day tomorrow I suspect.
12:18:57 AM Google It! comment [] IM Me About This