I'm aggravated! Microsoft's updated support page/Knowledge Base article
format will not save in Web Archive format. This is a recent change, and
I have responded at lease once using their feedback form. I have heard
nothing.
I suspect it is related to the cross-domain stylesheet reference, but I
can't be sure without researching it (Update: I don't think that's the
problem now--all stylesheets have relative paths). But I'm miffed! I
save MSDN/KB articles as archives all the time. MSDN still works, but now
the new layout at support.microsoft.com prevents saving in any format
except "HTML only", or (*gasp*) plain text. Believe me the "HTML only"
format looks pretty crappy upon reloading.
Dismay
Maybe the world at large has not discovered the joys of the MHT format.
Or maybe they have no concern when they are thwarted. Maybe the
workaround is sufficient for them. My most recent Google search seems to
consistently point to the same (mostly worthless) Microsoft KB article
(#235589).
Are my interests that obscure? Somehow, I doubt it.
Hope?
A few point to an IE
security patch that affects sites in the Restricted Sites group. I
doubt that Microsoft.com is in the Restricted sites group by default. I
did however, find some possibly-related security settings in the IE
options that affect frame navigation, scripting, etc. Interesting...
Background
I use the Web Archive format (*.MHT) extensively with Internet Explorer to
save a snapshot of a web page with all parts in a single file. All of the
images, stylesheets, and scripts are embedded as message parts in a
multipart MIME message. The format is actually not proprietary--it
complies with IETF RFC 2557
for MHTML. The page is reconstructed by Internet Explorer as it
existed when it was saved.
In many cases, this is better than link rot. I don't have to print out
the web page unnecessarily, and I can guarantee that the contents of the
page will not change the next time I "visit." Granted, the file size is a
bit chunky, and you don't benefit from things like image caching since
every web archive file has every element embedded. But hey, gigs are
cheap, right?
An older alternative format with Internet Explorer was the "Web Page
Complete" format. It saves a .htm page file and creates a child folder
with the same name as the page file. All supporting objects (images,
etc.) are saved as distinct files in the child folder. Some people like
that for scavenging web page parts. I personally hate it because it
becomes a file management nightmare. Every time I want to move the web
page, I have to also deal with an accompanying folder. Most of the time,
Windows Explorer groups folders separately from files, so the two items
are seldom close together in a file list. Forget it!
So I use MHT files a lot. Occasionally, however, Internet Explorer gives
an error message and refuses to save a web page as a Web Archive. This
can happen for a number reasons, very few of which I understand. But in
general, there seems to be a problem on some pages with Flash items. That
happens to be a lot of pages--ouch! Microsoft acknowledges the error and
lists a long line of browser products that exhibit the error--pretty much
all of the versions of Internet Explorer since the Web Archive format was
introduced. A recent search of Google groups also suggested that there
were some security considerations involving spreadsheets from a domain
other than where the page was located.
Microsoft addresses a couple of reasons here (sorry, you
can't save that article as a Web Archive!). Excel 2000 HTML spreadsheets
with multiple worksheets (which Excel creates by default--Sheet1, Sheet2,
& Sheet3) don't save.
The article above provides this help: "To work around this problem, save
the Web page in Internet Explorer using a different Save as format." Um,
what if my problem is that I CAN'T SAVE IN THE WEB ARCHIVE FORMAT??
Possible issues to consider
1.) The site with the desired page might know how to block this type of
operation. Don't know how yet, but I haven't ruled it out. MSN.com
is one site Microsoft mentions specifically, and I haven't found any good
reason why. Heck, the fact that they document MSN.com in particular in a
support article seems to be evidence that they intend that phenomenon to
persist. Sure, I imagine that the MSN.com folks and the support article
folks probably can't get together and address this specific problem, so
it's easier to write about it than to make a change.
2.) The page might not be reproduceable without DHTML or the
appropriate cookie or some other element that only exists in the rendered
version you are viewing. It appears that IE trys to reacquire the page
in a separate "duplicate" request, rather than just saving the page you
are viewing. Wouldn't that be convenient? Well, maybe not--the MSHTML
component in Windows does some HTML correction to reconcile poorly formed
HTML. It standardizes case and inserts closing tags. In other words, you
wouldn't be saving the true source of the page; you would be saving IE's
processed view of the source. Depending on your needs, this may not be
desirable. If you're interested in what the processed version of the page
looks like, save a page as "Web Page, complete", and look at the
difference in capitalization between the saved .htm file and the source
displayed by clicking View Source in IE.
Alternatives
A.) Create the file yourself: painful, but I'm beginning to think about
it. The format is documented well enough, that one should be able to
assemble an IE-compatible MHT file without IE's help in saving it. It
requires arcane MIME knowledge though--message headers, sifting through
dense RFCs--not for the faint-of-heart or crunched-for-time.
B.) Find products that will save MHT files and don't rely on the Internet
Explorer API to do it. Don't know of any yet.
C.) Test the Chilkat MHT library and hope it doesn't require the IE API
either. The Chilkat library is a bit pricey for the individual, but you
can buy a license for their entire suite that covers redistribution. The
price isn't too bad (you'll have to check their site) if you're writing
and selling software that can defray the cost.
Related links
Gordon Weakliem had some problems that included the culprit message, but I
doubt that
his
resolution will help me.