google and microdata

Not too long ago I wrote about the real-life use of html5 microdata and how it takes us one step closer to the ideal of a semantic web. While I'm still pretty excited to see the web expand in this direction, there is at least one serious bump in the road worth mentioning. Bottom line: the easier it is for crawlers and other pieces of software to read our data, the easier it becomes for them to steal our data for their own gain. And currently we have no way to protect ourselves.

like thieves in the night

This is not a new problem of course, there is plenty of software our there today that crawls specific sites and pages in order to harvest data. As long as websites do not provide an API to access their data, this is the only way feasible to accomplish certain tasks. For example, a site like provides a service where users can import their IMDb votes, but IMDb does not offer other sites a way to access this particular data. So icheckmovies asks you for the page url containing your votes and crawls the page looking for the data it needs. As long as the html source does not change, this is a pretty reliable way to extract data online. When IMDb does change the source html (like they did a couple of weeks ago), the service breaks and has to be adapted to match the new html structure.

I'm not sure about the specifics, but legally speaking this is somewhat of a gray area. When the data is public it can be used by others. On the other hand, you can't just copy a whole database of information from another site. That's why big sites like IMDb (or any other database-fueled data site) introduce known errors into their data (Google Maps has a couple of non-existing towns for example). If these errors make it onto other sites, they know they've been robbed of their hard work.

the new google

Search engines like Google Search also crawl your site for data. This is not really a problem because if all goes well they will direct people to your site based on the search criteria they entered. It uses your data simply to produce a search result snippet so users can make some kind of initial decision before they click through to your site. Google generates traffic for our websites, so nobody minds.

But what if Google was going to use the data on your site for other things beside generating links to your site? According to an article published on HBR Google is aiming to produce immediate answers for direct answers, effectively bypassing the sites where it got its information. It's nothing more than an extension on what they are doing with exchange rate calculation and simple math problems, but because Google has access to an almost unlimited amount of data, it can actually start aggregating and analyzing that data to predict the answer to more complex questions. In the end, it's not even stealing your data, but simply using it to predict the correct answer.

google and microdata

Semantics (more specifically microdata) are crucial in this process. It allows machines to understand data that would otherwise be captured in language-dependent full sentences. Google isn't guessing anymore, it knows. And because it knows, it will answer you directly rather than point to a source that might hold the answer to your question. For users of Google, this is superb as this saves a few clicks and they still get the information they were looking for. Other services too will have a much easier time figuring out your data. A site author can change the html all he wants, as long as the microdata implementation remains the same (which in theory it should), services that crawl your pages don't need to be rewritten every time you change something in the source.

As content authors though, we could feel a bit cheated by this. External services are using our carefully marked up data for their own benefits. Google does provide extra links to its sources, but only in a collapsed view which is likely to be ignored by people just looking for the answer. What this means is that we are doing all the hard work while Google is taking all the credit.

Blogs like mine might (at least for some time) escape the first few blows because we offer opinions and contextual articles, not so much single answers to direct questions. Then again, I believe it's probably just a matter of time before we're going to feel the consequences of this. Google could just as well roll out a list of film reviews (with some source links in the footer that nobody is going to click anyway), reliably harvesting its information from sites that use the movie and review microdata formats. That way it shows our reviews without giving us the proper credit for writing them.


What bothers me the most is that content authors gain very little by going the extra step to mark up our data with microdata, we may even lose a part of our audience that way. Sure the people we lose are probably just looking for a simple answer and may not be particularly interested in the rest of our site, but branding works in mysterious ways. Currently there is no way to protect ourselves from this and we are at the mercy of Google and other search engines to provide visible source links and quotes so we are at least given the proper credit for our work.

If search engine developers play this right both engines and content authors could benefit from the semantic web, but if they're going to claim all the credit for the data we are providing, many people are going to be discouraged to keep writing for the web. Not only that, it could hurt the success of the semantic web itself, setting us back several steps in the process to make more sense out of this enormous cluster of information we call the internet.