Python Projects
Projects I'm working on in the Python programming language.





Subscribe to "Python Projects" in Radio UserLand.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.
 

 

Saturday, May 24, 2003
 

Quest for Massage, Part 3

Here's how to use ElementTree to parse HTML:

from elementtree import HTMLTreeBuilder, ElementTree
htmlParserObj = HTMLTreeBuilder.TreeBuilder()
treeObj = ElementTree.parse(file, parser=htmlParserObj)

'file' can be either a file name, or an open file object (or file-like object).  Just what I need. 

So, to put this all together and actually parse a web page:

from urllib2 import urlopen
from elementtree import ElementTree, HTMLTreeBuilder
targetURL = "http://www.cnn.com"
urlFileObj = urlopen(targetURL)
htmlParserObj = HTMLTreeBuilder.TreeBuilder()
treeObj = ElementTree.parse(urlFileObj, parser=htmlParserObj)
ElementTree.dump(treeObj)

Once we parse the web page into a tree of elements, we just dump it out to the console so we can see what we got.  Sadly, we get a mess:

...much traceback elided...
AssertionError: end tag mismatch (expected input, got td)

After some investigation, I came to the conclusion that ElementTree's HTML parser needs really well-formed HTML.  To solve this probem, ElementTree provides a way of using the standard HTML fixer-upper utility 'Tidy'.  More on this in part 4.

 


11:16:44 AM    comment []


Click here to visit the Radio UserLand website. © Copyright 2003 Michael Kent.
Last update: 6/26/2003; 12:13:56 PM.
This theme is based on the SoundWaves (blue) Manila theme.
May 2003
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Apr   Jun