Trivial Thoughts

Python Projects
Projects I'm working on in the Python programming language.

Home

Python Projects

The Heap

Saturday, May 24, 2003

Quest for Massage, Part 3

Here's how to use ElementTree to parse HTML:
from elementtree import HTMLTreeBuilder, ElementTree
htmlParserObj = HTMLTreeBuilder.TreeBuilder()
treeObj = ElementTree.parse(file, parser=htmlParserObj)

'file' can be either a file name, or an open file object (or file-like object). Just what I need.

So, to put this all together and actually parse a web page:
from urllib2 import urlopen
from elementtree import ElementTree, HTMLTreeBuilder
targetURL = "http://www.cnn.com"
urlFileObj = urlopen(targetURL)
htmlParserObj = HTMLTreeBuilder.TreeBuilder()
treeObj = ElementTree.parse(urlFileObj, parser=htmlParserObj)
ElementTree.dump(treeObj)

Once we parse the web page into a tree of elements, we just dump it out to the console so we can see what we got. Sadly, we get a mess:
...much traceback elided...
AssertionError: end tag mismatch (expected input, got td)

After some investigation, I came to the conclusion that ElementTree's HTML parser needs really well-formed HTML. To solve this probem, ElementTree provides a way of using the standard HTML fixer-upper utility 'Tidy'. More on this in part 4.

11:16:44 AM comment []