XHTML and Search Engines
08 February 2008
A Little XHTML Background
In 1999, the W3C released a specification for XHTML 1.0. It was supposed to be identical to HTML 4.01 but expressed in XML rather than SGML.
It hasn't been much of a success, almost all the XHTML you find on the web is masquerading as HTML because Internet Explorer doesn't support the language yet (at the time of writing, the latest version of Internet Explorer is version 7). When it pretends to be HTML the advantages of using XHTML on the client side go away.
Client Support in 2008
Most of the major browsers have supported XHTML for a while and, with Internet Explorer 8 on the horizon, it is just about possible that XHTML support will finally be added to the biggest of the Big Four. Assuming it is, will XHTML be feasible on the WWW?
Most authors care, first and foremost, that their pages work in web browsers, but there are other user agents out there. Coming a close second place to browsers are search engine indexing robots — a very large number of authors want their pages to show up in search engines.
This raises the question: Do search engines support XHTML?
Experiments in Content-Type
I decided to do some experimentation. I created a set of test pages with application/xhtml+xml content types (one well formed, and one with errors that should cause XML parsers to give up and stop processing), along with an index page (served as text/html). Then I waited for search engines to start finding them. If the control page shows up in results, then it is reasonable to assume that all the pages will show up if they are going to.
After publishing my initial results, Mike Davies suggested that the apparent file extension in the URI might be a factor. I've added two new pages, still served as application/xhtml+xml but with .html file extensions. I haven't yet gathered any results from them though.
Here are the most recent results:
Search engine | Indexed pages | ||||||||
---|---|---|---|---|---|---|---|---|---|
Control | XHTML | Not Well-formed |
23 January 2008 | My intial post |
29 January 2008 | This page created |
File extensions and accept headers | |
3 February 2008 | Initial investigation incorrect |
You can follow updates through the news feed of my blog.