Trivial Thoughts
Thoughts and discussion on programming projects using the Python language.


Python Sites of Note
Software Development



Recent Posts
 6/15/03
 6/2/03
 5/28/03


Subscribe to "Trivial Thoughts" in Radio UserLand.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.
 

 

Saturday, May 24, 2003
 

Quest for Massage, Part 3

Here's how to use ElementTree to parse HTML:

from elementtree import HTMLTreeBuilder, ElementTree
htmlParserObj = HTMLTreeBuilder.TreeBuilder()
treeObj = ElementTree.parse(file, parser=htmlParserObj)

'file' can be either a file name, or an open file object (or file-like object).  Just what I need. 

So, to put this all together and actually parse a web page:

from urllib2 import urlopen
from elementtree import ElementTree, HTMLTreeBuilder
targetURL = "http://www.cnn.com"
urlFileObj = urlopen(targetURL)
htmlParserObj = HTMLTreeBuilder.TreeBuilder()
treeObj = ElementTree.parse(urlFileObj, parser=htmlParserObj)
ElementTree.dump(treeObj)

Once we parse the web page into a tree of elements, we just dump it out to the console so we can see what we got.  Sadly, we get a mess:

...much traceback elided...
AssertionError: end tag mismatch (expected input, got td)

After some investigation, I came to the conclusion that ElementTree's HTML parser needs really well-formed HTML.  To solve this probem, ElementTree provides a way of using the standard HTML fixer-upper utility 'Tidy'.  More on this in part 4.

 


11:16:44 AM  comment []    


Click here to visit the Radio UserLand website. © Copyright 2003 Michael Kent.
Last update: 6/16/2003; 9:23:41 PM.
This theme is based on the SoundWaves (blue) Manila theme.
May 2003
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Apr   Jun

Previous/Next