extended semantics/crawl before you fly

When I talk about html I usually try to stress its structural function, as it is so often forgotten or ignored. Today though, I'm ready to do some fussing about semantics, in particular why its promise still doesn't deliver half as much as it could (and should). Let's face it, we're not just putting so much time in writing structurally and semantically valuable html only because screenreader users could benefit from it.

google sucks

One of my main internet frustrations of the past couple of years is the lack of progression in the search engine field. With an insanely high market penetration of around 85% Google is industry leader, but their search engine hasn't really evolved all that much. The internet has though, it's been growing ever since it was conceived, making it more difficult to find valuable sources of information with each passing day. I simply spend too much time wading through irrelevant and outdated sites.

In my opinion Google currently lacks two very important elements. First of all there is the date factor. Older articles have had more time to build a strong link base and will often rank higher than more recent articles, increasing the danger of receiving outdated information. A publish date filter is nonexistent, at least to my knowledge. But more importantly (and relevant to this article), Google's search engine lacks solid recognition of content types. When I look for film reviews, I want to receive a list of actual reviews, not pages with the word review on (usually grayed out because none have been submitted yet). And that's our where semantics would come in handy.

we all wish to fly

Obviously I'm not the first one to think of this. Several steps have been taken in the past to extend the semantic power of our html code. Currently there are two (common) methodologies that try to accomplish this: Microformats and html5 microdata. Then there's RDF, but I'm going to leave that out of the discussion now.

Microformats extend html semantics through the use of standardized (not necessarily semantic) class names. The most popular Microformat is the hCard which holds the data of a person or company (name, address, contact data, ...). There are a couple of other formats defined too but they are mostly ignored by the web (though Google does parse some of them). The adoption rate of Microformats is depressingly slim, yet as a developer I can't say I'm all that surprised. Syntax is often fuzzy, unclear and downright impractical.

Then there's html5 (yay, hype!) microdata. You can read the spec yourself, but currently it's still a working draft with hopefully a lot of drafting left to be done. Through the use of four (4!) properties (itemscope, itemtype, itemid, itemprop) you are able to add extra semantics to your html. Two main problems exist here. First of all, it all sounds overly complex for what it's supposed to do. On top of that, most values for the itemprop seem to correspond with the class names you'd normally put on there, which you still need for styling. So it sounds an awful lot like double effort to me.

processability vs findability

The problem as I see it is that we're overreaching here. Of course it would be awesome to automatically and fully process content types on the web. Google is trying to do just that with Google Squared (thanks to Mathias for the heads-up), but I would me more than happy if it would just find my damn search queries.

The complexity of Microformats and microdata lies in trying to provide a full standardized description of a content type, while most people would be happy with the raw data itself. I don't need a full matrix of data comparisons when looking to buy a dvd, I would be thrilled enough if Google could direct me to valid product pages only. Attempts to process everything at once are holding back technological advancements. We're waiting for full-fletched definitions of content types while basic recognition would simply suffice for now.

conclusion

Rather than define a complex model for content types, why not start with defining a simple, standardized and semantic base identifier. For most content types these identifiers would hardly need discussing. Use "event" for events, use "product" for products, use "review" for reviews. Prefix them (maybe), but stop there and try to make that work for a start. After that, there will be plenty of time to try and process all the data within.

In my opinion, classes should suffice for this. Design and meaning are actually linked closely enough to warrant the use of class names. I'm really a big fan of the Microformat ideology, I just think it's overcomplicated and over-descriptive at the moment. Which is a shame, because bad search results are actively ruining my internet experience every single day.