InfoScraper
Tools and techniques to extract information from web pages and newsletters



Subscribe to "InfoScraper" in Radio UserLand.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.
 

 

 
 September 26, 2002
  11:22:03 AM  

InfoScraper design

Frustrated by information overload, but encouraged by the aggregation and filtering capabilities of RSS feeds, I've been looking for a tool to convert existing newsletters and other sources into tidy RSS. I expect that pretty soon most newsletters will offer RSS format in addition to Text and HTML, but I don't want to wait.

I tried a number of existing tools, but none of them do everything I want - especially the ability to convert both text and HTML email.

I decided to use regular expressions instead of XSLT because many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse. Not to mention that some sources are plain text.

first exercise: comics

Assorted observations:

  • one level of items is not enough - need to have at least 2, conditionally
  • need to strip some HTML formatting (keep basic <H>, <B>, <I> but not Word markup)
  • need to drop some items automatically (Your Feedback is Important)
  • need to combine onChange and filter scripts
  • want to add DHTML hide/show script to limit initially visible length of posts
  • Generic scraper should work with web pages and email (HTML and text)
  • Output should be in some kind of RSS feed format
  • Should be able to run from Radio Userland, but should not require Radio (prefer generic data->XML tool)
  • It's better to use RegEx instead of XSLT because:
    • many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse
    • some sources are plain text
  • Patterns should be nestable and/or sequential (and/or)
    • It should be easy to make multiple passes to extract information from different parts of a document
  • Matched text to be included or excluded from extracted info
  • Options to strip styles, tags (or maybe specify tags to retain)
  • Syntax could follow XSLT ...
  • The whole thing needs to be table-driven, starting with the feed identifier, the RSS header info, and the collection of patterns for the items.
  • Naturally, the specification table will be XML. This means we can use an XML parser to search and process the table.
  • For email, the channel can be determined from the "From" and "Subject" fields, and the <pubDate> from the "Sent" field.
  • It might be useful to specify a pattern for items to be ignored.
  • it would be useful to have a way to highlight special keywords, and/or items containing keywords
  • Search for start pattern
  • Search for end pattern
  • Extract body (start-end, inclusive or exclusive).
    If the patterns are included in the body, then this step is a simple regular expression: {start}.*{end}


Click here to visit the Radio UserLand website. © Copyright 2002 Eric Hartwell.
Last update: 03/10/2002; 10:59:53 AM.
This theme is based on the SoundWaves (blue) Manila theme.

September 2002
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          
Aug   Oct


"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
— Sherlock Holmes to Dr. Watson in "The Adventure of the Copper Beeches" by Arthur Conan Doyle. 


"I like deadlines," cartoonist Scott Adams once said. "I especially like the whooshing sound they make as they fly by."


"There is nothing like that feeling of spending days and days banging your head against a wall trying to solve a programming problem then suddenly finding that one tiny obscure and seemingly unrelated piece of the puzzle that unlocks the solution. Oh yeah!"

- Chris Maunder, CodeProject Newsletter 28 Jan 2002


"Management at eSnipe, which is me, is also feeling the pain of the 2002 bear market. So rather than pout about it, I bought some stuff on eBay that I really didn’t need, but made me feel better."

- Tom Campbell, president of eSnipe