|
| |
 |
September 26, 2002 |
|
|
11:22:03 AM InfoScraper design
Frustrated by information overload, but encouraged by the aggregation and filtering capabilities of RSS feeds, I've been looking for a tool to convert existing newsletters and other sources into tidy RSS. I expect that pretty soon most newsletters will offer RSS format in addition to Text and HTML, but I don't want to wait.
I tried a number of existing tools, but none of them do everything I want - especially the ability to convert both text and HTML email.
I decided to use regular expressions instead of XSLT because many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse. Not to mention that some sources are plain text.
first exercise: comics
Assorted observations:
- one level of items is not enough - need to have at least 2, conditionally
- need to strip some HTML formatting (keep basic <H>, <B>, <I> but not Word markup)
- need to drop some items automatically (Your Feedback is Important)
- need to combine onChange and filter scripts
- want to add DHTML hide/show script to limit initially visible length of posts
- Generic scraper should work with web pages and email (HTML and text)
- Output should be in some kind of RSS feed format
- Should be able to run from Radio Userland, but should not require Radio (prefer generic data->XML tool)
- It's better to use RegEx instead of XSLT because:
- many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse
- some sources are plain text
- Patterns should be nestable and/or sequential (and/or)
- It should be easy to make multiple passes to extract information from different parts of a document
- Matched text to be included or excluded from extracted info
- Options to strip styles, tags (or maybe specify tags to retain)
- Syntax could follow XSLT ...
- The whole thing needs to be table-driven, starting with the feed identifier, the RSS header info, and the collection of patterns for the items.
- Naturally, the specification table will be XML. This means we can use an XML parser to search and process the table.
- For email, the channel can be determined from the "From" and "Subject" fields, and the <pubDate> from the "Sent" field.
- It might be useful to specify a pattern for items to be ignored.
- it would be useful to have a way to highlight special keywords, and/or items containing keywords
- Search for start pattern
- Search for end pattern
- Extract body (start-end, inclusive or exclusive).
If the patterns are included in the body, then this step is a simple regular expression: {start}.*{end} |
|
|
|
© Copyright
2002
Eric Hartwell.
Last update:
03/10/2002; 10:59:53 AM.
This theme is based on the SoundWaves
(blue) Manila theme. |
|
| September 2002 |
| Sun |
Mon |
Tue |
Wed |
Thu |
Fri |
Sat |
| 1 |
2 |
3 |
4 |
5 |
6 |
7 |
| 8 |
9 |
10 |
11 |
12 |
13 |
14 |
| 15 |
16 |
17 |
18 |
19 |
20 |
21 |
| 22 |
23 |
24 |
25 |
26 |
27 |
28 |
| 29 |
30 |
|
|
|
|
|
| Aug Oct |
|
"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
— Sherlock Holmes to Dr. Watson in "The Adventure of the Copper Beeches" by
Arthur Conan Doyle.
"I
like deadlines," cartoonist Scott Adams once said. "I especially like the
whooshing sound they make as they fly by."
"There is nothing like that feeling of spending days and days banging your head
against a wall trying to solve a programming problem then suddenly finding that
one tiny obscure and seemingly unrelated piece of the puzzle that unlocks the
solution. Oh yeah!"
- Chris Maunder, CodeProject Newsletter 28 Jan 2002
"Management at eSnipe,
which is me, is also feeling the pain of the 2002 bear market. So rather than
pout about it, I bought some stuff on eBay that I really didn’t need, but made
me feel better."
- Tom Campbell, president of
eSnipe
|