Seeking a higher level of control and integration in the tools I use a lot.
Personal Software Integration

 Wednesday, November 05, 2003

I'm aggravated! Microsoft's updated support page/Knowledge Base article format will not save in Web Archive format. This is a recent change, and I have responded at lease once using their feedback form. I have heard nothing.

I suspect it is related to the cross-domain stylesheet reference, but I can't be sure without researching it (Update: I don't think that's the problem now--all stylesheets have relative paths). But I'm miffed! I save MSDN/KB articles as archives all the time. MSDN still works, but now the new layout at support.microsoft.com prevents saving in any format except "HTML only", or (*gasp*) plain text. Believe me the "HTML only" format looks pretty crappy upon reloading.

Dismay

Maybe the world at large has not discovered the joys of the MHT format. Or maybe they have no concern when they are thwarted. Maybe the workaround is sufficient for them. My most recent Google search seems to consistently point to the same (mostly worthless) Microsoft KB article (#235589).

Are my interests that obscure? Somehow, I doubt it.

Hope?

A few point to an IE security patch that affects sites in the Restricted Sites group. I doubt that Microsoft.com is in the Restricted sites group by default. I did however, find some possibly-related security settings in the IE options that affect frame navigation, scripting, etc. Interesting...

Background

I use the Web Archive format (*.MHT) extensively with Internet Explorer to save a snapshot of a web page with all parts in a single file. All of the images, stylesheets, and scripts are embedded as message parts in a multipart MIME message. The format is actually not proprietary--it complies with IETF RFC 2557 for MHTML. The page is reconstructed by Internet Explorer as it existed when it was saved.

In many cases, this is better than link rot. I don't have to print out the web page unnecessarily, and I can guarantee that the contents of the page will not change the next time I "visit." Granted, the file size is a bit chunky, and you don't benefit from things like image caching since every web archive file has every element embedded. But hey, gigs are cheap, right?

An older alternative format with Internet Explorer was the "Web Page Complete" format. It saves a .htm page file and creates a child folder with the same name as the page file. All supporting objects (images, etc.) are saved as distinct files in the child folder. Some people like that for scavenging web page parts. I personally hate it because it becomes a file management nightmare. Every time I want to move the web page, I have to also deal with an accompanying folder. Most of the time, Windows Explorer groups folders separately from files, so the two items are seldom close together in a file list. Forget it!

So I use MHT files a lot. Occasionally, however, Internet Explorer gives an error message and refuses to save a web page as a Web Archive. This can happen for a number reasons, very few of which I understand. But in general, there seems to be a problem on some pages with Flash items. That happens to be a lot of pages--ouch! Microsoft acknowledges the error and lists a long line of browser products that exhibit the error--pretty much all of the versions of Internet Explorer since the Web Archive format was introduced. A recent search of Google groups also suggested that there were some security considerations involving spreadsheets from a domain other than where the page was located.

Microsoft addresses a couple of reasons here (sorry, you can't save that article as a Web Archive!). Excel 2000 HTML spreadsheets with multiple worksheets (which Excel creates by default--Sheet1, Sheet2, & Sheet3) don't save.

The article above provides this help: "To work around this problem, save the Web page in Internet Explorer using a different Save as format." Um, what if my problem is that I CAN'T SAVE IN THE WEB ARCHIVE FORMAT??

Possible issues to consider

1.) The site with the desired page might know how to block this type of operation. Don't know how yet, but I haven't ruled it out. MSN.com is one site Microsoft mentions specifically, and I haven't found any good reason why. Heck, the fact that they document MSN.com in particular in a support article seems to be evidence that they intend that phenomenon to persist. Sure, I imagine that the MSN.com folks and the support article folks probably can't get together and address this specific problem, so it's easier to write about it than to make a change.

2.) The page might not be reproduceable without DHTML or the appropriate cookie or some other element that only exists in the rendered version you are viewing. It appears that IE trys to reacquire the page in a separate "duplicate" request, rather than just saving the page you are viewing. Wouldn't that be convenient? Well, maybe not--the MSHTML component in Windows does some HTML correction to reconcile poorly formed HTML. It standardizes case and inserts closing tags. In other words, you wouldn't be saving the true source of the page; you would be saving IE's processed view of the source. Depending on your needs, this may not be desirable. If you're interested in what the processed version of the page looks like, save a page as "Web Page, complete", and look at the difference in capitalization between the saved .htm file and the source displayed by clicking View Source in IE.

Alternatives

A.) Create the file yourself: painful, but I'm beginning to think about it. The format is documented well enough, that one should be able to assemble an IE-compatible MHT file without IE's help in saving it. It requires arcane MIME knowledge though--message headers, sifting through dense RFCs--not for the faint-of-heart or crunched-for-time. B.) Find products that will save MHT files and don't rely on the Internet Explorer API to do it. Don't know of any yet. C.) Test the Chilkat MHT library and hope it doesn't require the IE API either. The Chilkat library is a bit pricey for the individual, but you can buy a license for their entire suite that covers redistribution. The price isn't too bad (you'll have to check their site) if you're writing and selling software that can defray the cost.

Related links

Gordon Weakliem had some problems that included the culprit message, but I doubt that his resolution will help me.


12:50:01 PM    comment []  trackback []

November 2003
Sun Mon Tue Wed Thu Fri Sat
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30            
Oct   Dec

Subscribe to "Personal Software Integration" in Radio UserLand. Click to see the XML version of this web page. Click here to send an email to the editor of this weblog.

Christian Tech Sites

Christian Weblogs

Tech Blogs
Disclaimer
This site's author is not responsible for content on linked websites; a link does not indicate endorsement.