What a pain

Posted on June 24, 2003 @ 15:43 in Research

Sometimes it's the simplest things that will drive you crazy. You won't hear me say that archiving websites is a simple matter, but this is one stupid problem.

I use GNU Wget to archive personal home pages that I've identified as interesting for my research. Apart from Wget's inability to archive pages linked through Flash or Java(script) applets (without recompiling the code yourself), it generally works okay. Wget, however, is a bit too smart for its own good, but you can't really fault it, because it's really Microsoft who's in the wrong here. Let me explain...

I'm trying to archive a website that contains a number of pages without an extension. Normally a webpage has the extension .html, .htm, .php, or some such, but these pages have no extension at all. Instead of "page.html" they're simply called "page". The webserver serving these pages is Apache and if the stuff it serves does not have a recognized extension or content type, then Apache serves these pages as "text/plain," rather than "text/html."

Opera and IE are not picky about the content type that these pages are served with. Even though the webserver says to them, "Hey, the next page is a plain text document," they still render this page as HTML. Mozilla based browsers, however, obey all protocols and when a webserver sends out a page it proclaims to be plain text, then it is rendered as plain text. That means that Mozilla actually shows all the HTML code on screen.

Wget, being another fine open-source software project just like Mozilla, also sticks to the specifications and refuses to parse files served as "text/plain." It's simple to see why this is important. If you have a plain text file on your server, you want it to show up as a plain text file in the browser. Who knows what will go wrong if browsers just decide by themselves to start interpreting everything that a webserver sends out as HTML, instead of following the content type specified by the webserver.

So now I'm stuck with a website with 50 or so pages that I cannot automatically archive, because Microsoft (and Opera) decided that they don't have to play by the rules. If IE hadn't been able to display this pages as HTML, then the author would have corrected his mistake a long time ago, when he'd noticed that his webpages weren't showing up as he intended them. Unless I find a way to get Wget to parse text/plain files, I'm stuck with visiting each file with IE and saving them all by hand, one by one. Sigh.

Comments and Trackbacks

No comments or trackbacks for this entry yet.

Post a comment

Comments and trackbacks have been closed on this site. My apologies.

Since MT-Blacklist inexplicably stopped working I had no other recourse than close comments and trackbacks to stop the spam. I've been meaning to correct this for quite a while, but life got in the way... in a good way I should add.