Eric Hartwell's InfoWeb

InfoScraper
Tools and techniques to extract information from web pages and newsletters

Home

September 26, 2002

11:22:03 AM
InfoScraper design

Frustrated by information overload, but encouraged by the aggregation and filtering capabilities of RSS feeds, I've been looking for a tool to convert existing newsletters and other sources into tidy RSS. I expect that pretty soon most newsletters will offer RSS format in addition to Text and HTML, but I don't want to wait.

I tried a number of existing tools, but none of them do everything I want - especially the ability to convert both text and HTML email.

I decided to use regular expressions instead of XSLT because many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse. Not to mention that some sources are plain text.

first exercise: comics

Assorted observations:

one level of items is not enough - need to have at least 2, conditionally
need to strip some HTML formatting (keep basic <H>, <B>, <I> but not Word markup)
need to drop some items automatically (Your Feedback is Important)
need to combine onChange and filter scripts
want to add DHTML hide/show script to limit initially visible length of posts
Generic scraper should work with web pages and email (HTML and text)
Output should be in some kind of RSS feed format
Should be able to run from Radio Userland, but should not require Radio (prefer generic data->XML tool)
It's better to use RegEx instead of XSLT because:

many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse
some sources are plain text

Patterns should be nestable and/or sequential (and/or)

It should be easy to make multiple passes to extract information from different parts of a document

Matched text to be included or excluded from extracted info
Options to strip styles, tags (or maybe specify tags to retain)
Syntax could follow XSLT ...
The whole thing needs to be table-driven, starting with the feed identifier, the RSS header info, and the collection of patterns for the items.
Naturally, the specification table will be XML. This means we can use an XML parser to search and process the table.
For email, the channel can be determined from the "From" and "Subject" fields, and the <pubDate> from the "Sent" field.
It might be useful to specify a pattern for items to be ignored.
it would be useful to have a way to highlight special keywords, and/or items containing keywords
Search for start pattern
Search for end pattern
Extract body (start-end, inclusive or exclusive).
If the patterns are included in the body, then this step is a simple regular expression: {start}.*{end}

September 2002

Sun Mon Tue Wed Thu Fri Sat

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

29 30

Aug Oct

"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
� Sherlock Holmes to Dr. Watson in "The Adventure of the Copper Beeches" by Arthur Conan Doyle.

"I like deadlines," cartoonist Scott Adams once said. "I especially like the whooshing sound they make as they fly by."

"There is nothing like that feeling of spending days and days banging your head against a wall trying to solve a programming problem then suddenly finding that one tiny obscure and seemingly unrelated piece of the puzzle that unlocks the solution. Oh yeah!"

- Chris Maunder, CodeProject Newsletter 28 Jan 2002

"Management at eSnipe, which is me, is also feeling the pain of the 2002 bear market. So rather than pout about it, I bought some stuff on eBay that I really didn�t need, but made me feel better."

- Tom Campbell, president of eSnipe