|
| |
 |
September 21, 2002 |
|
|
8:11:45 PM More scraping thoughts ...
- Generic scraper should work with web pages and email (HTML and text)
- Output should be in some kind of RSS feed format
- Should be able to run from Radio Userland, but should not require Radio (prefer generic data->XML tool)
- It's better to use RegEx instead of XSLT because:
- many source pages use poorly formed HTML, so an automatic conversion to XHTML may make things worse
- some sources are plain text
- Patterns should be nestable and/or sequential (and/or)
- It should be easy to make multiple passes to extract information from different parts of a document
- Matched text to be included or excluded from extracted info
- Options to strip styles, tags (or maybe specify tags to retain)
- Syntax could follow XSLT ...
|
|
|
|
© Copyright
2002
Eric Hartwell.
Last update:
03/10/2002; 10:59:51 AM.
This theme is based on the SoundWaves
(blue) Manila theme. |
|
| September 2002 |
| Sun |
Mon |
Tue |
Wed |
Thu |
Fri |
Sat |
| 1 |
2 |
3 |
4 |
5 |
6 |
7 |
| 8 |
9 |
10 |
11 |
12 |
13 |
14 |
| 15 |
16 |
17 |
18 |
19 |
20 |
21 |
| 22 |
23 |
24 |
25 |
26 |
27 |
28 |
| 29 |
30 |
|
|
|
|
|
| Aug Oct |
|
"Data! data! data!" he cried impatiently. "I can't make bricks without clay."
— Sherlock Holmes to Dr. Watson in "The Adventure of the Copper Beeches" by
Arthur Conan Doyle.
"I
like deadlines," cartoonist Scott Adams once said. "I especially like the
whooshing sound they make as they fly by."
"There is nothing like that feeling of spending days and days banging your head
against a wall trying to solve a programming problem then suddenly finding that
one tiny obscure and seemingly unrelated piece of the puzzle that unlocks the
solution. Oh yeah!"
- Chris Maunder, CodeProject Newsletter 28 Jan 2002
"Management at eSnipe,
which is me, is also feeling the pain of the 2002 bear market. So rather than
pout about it, I bought some stuff on eBay that I really didn’t need, but made
me feel better."
- Tom Campbell, president of
eSnipe
|