Trivial Thoughts
Thoughts and discussion on programming projects using the Python language.


Python Sites of Note
Software Development



Recent Posts
 6/15/03
 6/2/03
 5/28/03


Subscribe to "Trivial Thoughts" in Radio UserLand.

Click to see the XML version of this web page.

Click here to send an email to the editor of this weblog.
 

 

Sunday, May 25, 2003
 

Quest for Massage, Part 4.1

As I noted, the 'tidy' function in the mx extentions requires a true file object as its input. What I want to give it is the results of urlopen.  My initial solution was to use a temporary file.  It worked, but I didn't like the explicit steps I had to go through.

When you have an X, and you need a Y, you are looking at a use for the Adapter Pattern.  So I started working on a adaptor that could take a file-like object, and give me a true file object.  Here's my initial look at it:

import types
import os
class FileAdapter:
    """A FileAdapter instance takes a 'file-like' object having at least a 'read' method
    and, via the file method, returns a true file object."""
    def __init__(self, fileObj):
        self.fileObj = fileObj
        self.chunksize = 1024 * 10
        if type(self.fileObj) != types.FileType:
            if not hasattr(fileObj, "read"):
                raise ValueError, "not a file-like object"
           
            self.tmpFileObj = os.tmpfile()
            while True:
                data = fileObj.read(self.chunksize)
                if len(data) == 0:
                    break
                self.tmpFileObj.write(data)
            del data                
            self.tmpFileObj.flush()
            self.tmpFileObj.seek(0, 0)
            self.fileObj = self.tmpFileObj                
        return
    def file(self):
        return self.fileObj


Note that I'm still using a temporary file, but now my interface to it is much cleaner.  I can use it like this:

realFileObj = FileAdapter(urlopen("http://www.cnn.com")).file()

This takes the file-like object returned by urlopen, reads in all of its data to a temporary file, and returns that file as the real file object.

This gives me what I need to make a url file-like object play nice with tidy.  But you know, maybe I can do better.  It's pretty obvious that tidy works like a filter.  I'd like to treat it like one, from within my program.  That bears thought.


9:06:50 PM  comment []    

Quest for Massage, Part 4

I left off talking about how ElementTree has provisions for running messy HTML through the standard 'Tidy' command to get well-formed HTML.  After some investigation, I still haven't gotten it to work.  So...

M.-A. Lemburg's mx extensions for Python includes a whole bunch of handy stuff - like a version of the Tidy program turned into a Python extention module.  To get it, download the 'experimental' package.

Here's how I'm using it:

from mx.Tidy import Tidy
nerrors, nwarnings, outputdata, errordata =
    Tidy.tidy(input, output=None, errors=None, output_xhtml=1)

Tidy takes a bunch of keyword options (see the docs).  Here, I'm telling it to output XHTML. 'input' can be either an open file object, or a string.  If output is specified, it must be an open file object, which the output will be written to.  If not specified, output will be written as a string to the return value tuple element 'outputdata'.  The same kind of thing happens for error output - if the 'errors' parameter is set to an open file object, error output is written to it.  Otherwise, error output is written as a string to the return value tuple element 'errordata'.

Sadly, experimenting with it had proven that Tidy will only work with actual file objects, not 'file-like' objects.  This means, for example, that I can't give tidy an object returned by urlopen.  I must read in the web page, write it out to a temporary file, pass this temporary file to tidy and collect its output in a temporary file, then pass this temporary file to ElementTree.

Putting all of this together, I get an actual, working HTML-to-ElementTree parser.  Here's the working code:

http://radio.weblogs.com/0124960/stories/2003/05/23/iwannamassagepyV01.html

Kinda crude, and I know I can improve on this.

 


12:59:26 PM  comment []    


Click here to visit the Radio UserLand website. © Copyright 2003 Michael Kent.
Last update: 6/16/2003; 9:23:42 PM.
This theme is based on the SoundWaves (blue) Manila theme.
May 2003
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Apr   Jun

Previous/Next