Quest for Massage, Part 3
Here's how to use ElementTree to parse HTML: from elementtree import HTMLTreeBuilder, ElementTree htmlParserObj = HTMLTreeBuilder.TreeBuilder() treeObj = ElementTree.parse(file, parser=htmlParserObj)
'file' can be either a file name, or an open file object (or file-like object). Just what I need.
So, to put this all together and actually parse a web page: from urllib2 import urlopen from elementtree import ElementTree, HTMLTreeBuilder targetURL = "http://www.cnn.com" urlFileObj = urlopen(targetURL) htmlParserObj = HTMLTreeBuilder.TreeBuilder() treeObj = ElementTree.parse(urlFileObj, parser=htmlParserObj) ElementTree.dump(treeObj)
Once we parse the web page into a tree of elements, we just dump it out to the console so we can see what we got. Sadly, we get a mess: ...much traceback elided... AssertionError: end tag mismatch (expected input, got td)
After some investigation, I came to the conclusion that ElementTree's HTML parser needs really well-formed HTML. To solve this probem, ElementTree provides a way of using the standard HTML fixer-upper utility 'Tidy'. More on this in part 4.
11:16:44 AM
|