Linked Data: my challenge

What with Gordon Brown’s recent (just an hour or so ago) announcement of lots of digital goodness at the “Building Britain’s Digital Future” event, the focus sharpens once again on Linked Data.

I’ve been sitting on the sidelines sniping gently at Linked Data since it apparently replaced the Semantic Web as The Next Big Thing. I remained cynical about the SW all the way through, and as of right now I remain cynical about Linked Data as well.

This might seem odd from someone obsessed with – and a clear advocate of – the opening up data. I’ve blogged about, talked about and written papers about what I’ve come to call MRD (Machine Readable Data). I’ve gone so far as to believe that if it doesn’t have an API, it doesn’t – or shouldn’t – exist.

So what is my problem with Linked Data? Surely what Linked Data offers is the holy grail of MRD? Shouldn’t I be embracing it as everyone else appears to be?

Yes. I probably should.

But…Linked Data runs headlong into one of the things I also blog about all the time here, and the thing I believe in probably more than anything else: simplicity.

If there is one thing I think we should all have learned from RSS, simple API’s, YQL, Yahoo Pipes, Google Docs, etc it is this: for a technology to gain traction it has to be not only accessible, but simple and usable, too.

Here’s how I see Linked Data as of right now:

1. It is completely entrenched in a community who are deeply technically focused. They’re nice people, but I’ve had a good bunch of conversations and never once has anyone been able to articulate for me the why or the how of Linked Data, and why it is better than focusing on simple MRD approaches, and in that lack of understanding we have a problem. I’m not the sharpest tool, but I’m not stupid either, and I’ve been trying to understand for a fair amount of time…

2. There are very few (read: almost zero) compelling use-cases for Linked Data. And I don’t mean the TBL “hey, imagine if you could do X” scenario, I mean real use-cases. Things that people have actually built. And no, Twine doesn’t cut it.

3. The entry cost is high – deeply arcane and overly technical, whilst the value remains low. Find me something you can do with Linked Data that you can’t do with an API. If the value was way higher, the cost wouldn’t matter so much. But right now, what do you get if you publish Linked Data? And what do you get if you consume it?

Now, I’m deeply aware that actually I don’t actually know much about Linked Data. But I’m also aware that for someone like me – with my background and interests – to not know much about Linked Data, there is somewhere in the chain a massive problem.

I genuinely want to understand Linked Data. I want to be a Linked Data advocate in the same way I’m an API/MRD advocate. So here is my challenge, and it is genuinely an open one. I need you, dear reader, to show me:

1. Why I should publish Linked Data. The “why” means I want to understand the value returned by the investment of time required, and by this I mean compelling, possibly visual and certainly useful examples

2. How I should do this, and easily. If you need to use the word “ontology” or “triple” or make me understand the deepest horrors of RDF, consider your approach a failed approach

3. Some compelling use-cases which demonstrate that this is better than a simple API/feed based approach

There you go – the challenge is on. Arcane technical types need not apply.

Pushing MRD out from under the geek rock

The week before last (30th June – 1st July 2009), I was at the JISC Digital Content Conference having been asked to take part in one of their parallel sessions.

I thought I’d use the session to talk about something I’m increasingly interested in – the shifting of the message about machine readable data (think API’s, RSS, OpenSearch, Microformats, LinkedData, etc) from the world of geek to the world of non-geek.

My slides are here:

[slideshare id=1714963&doc=dontthinkwebsitesthinkdatafinal-090713100859-phpapp02]

Here’s where I’m at: I think that MRD (That’s Machine Readable Data – I couldn’t seem to find a better term..) is probably about as important as it gets. It underpins an entire approach to content which is flexible, powerful and open. It embodies notions of freely moving data, it encourages innovation and visualisation. It is also not nearly as hard as it appears – or doesn’t have to be.

In the world of the geek (that’s a world I dip into long enough to see the potential before heading back out here into the sun), the proponents of MRD are many and passionate. Find me a Web2.0 application without an API (or one “on the development road-map”) and I’ll find you a pretty unusual company.

These people don’t need preaching at. They’re there, lined up, building apps for Twitter (to the tune of 10x the traffic which visits twitter.com), developing a huge array of services and visualisations, graphs, maps, inputs and outputs.

The problem isn’t the geeks. The problem is that MRD needs to move beyond the realm of the geek and into the realm of the content owner, the budget holder, the strategist, for these technologies to become truly embedded. We need to have copyright holders and funders lined up at the start of the project, prepared for the fact that our content will be delivered through multiple access routes, across unspecified timespans and to unknown devices. We need our specifications to be focused on re-purposing, not on single-point delivery. We need solution providers delivering software with web API’s built in. We need to be prepared for a world in which no-one visits our websites any more, instead picking, choosing and mixing our content from externally syndicated channels.

In short, we now need the relevant people evangelising about the MRD approach.

Geeks have done this well so far, but now they need help. Try searching on “ROI for API’s” (or any combination thereof) and you’ll find almost nothing – very little evidence outlining how much API’s cost to implement, what cost savings you are likely to see from them; how they reduce content development time; few guidelines on how to deal with syndicated content copyright issues.

Partly, this knowledge gap is because many of the technologies we’re talking about are still quite young. But a lot of the problem is about the communication of technology, the divided worlds that Nick Poole (Collections Trust) speaks about. This was the core of my presentation: ten reasons why MRD is important, from the perspective of a non-geek (links go to relevant slides and examples in the slide deck):

  1. Content is still king
  2. Re-use is not just good, it’s essential
  3. “Wouldn’t it be great if…”: Life is easier when everyone can get at your data
  4. Content development is cheaper
  5. Things get more visual
  6. Take content to users, not users to content (“If you build it, they probably won’t come”)
  7. It doesn’t have to be hard
  8. You can’t hide your content
  9. We really is bigger and better than me
  10. Traffic

All this is is a starter for ten. Bigger, better and more informed people than me probably have another hundred reasons why MRD is a good idea. I think this knowledge may be there – we just need to surface and collect it so that more (of the right) people can benefit from these approaches.

hoard.it : bootstrapping the NAW

What seems like a looong time ago I came up with an idea for “bootstrapping” the Non API Web (NAW), particularly around extracting un-structured content from (museum) collections pages.

The idea of scraping pages when there’s a lack of data access API isn’t new: Dapper launched a couple of years ago with a model for mapping and extracting from ‘ordinary’ html into a more programmatically useful format like RSS, JSON or XML. Before that there have been numerous projects that did the same (PiggyBank, Solvent, etc); Dapper is about the friendliest web2y interface so far, but it still fails IMHO in a number of ways.

Of course, there’s always the alternative approach, which Frankie Roberto outlined in his paper at Museums and the Web this year: don’t worry about the technology; instead approach the institution for data via an FOI request…

The original prototype I developed was based around a bookmarklet: the idea was that a user would navigate to an object page (although any templated “collection” or “catalogue” page is essentially the treated the same). If they wanted to “collect” the object on that page they’d click the bookmarklet, a script would look for data “shapes” against a pre-defined store, and then extract the data. Here’s some screen grabs of the process (click for bigger)

Science Museum object page An object page on the Science Museum website
Bookmarklet pop-up User clicks on the bookmarklet and a popup tells them that this page has been “collected” before. Data is separated by the template and “structured”
Bookmarklet pop-up Here, the object hasn’t been collected but the tech spots that the template is the same, so knows how to deal with the “data shape”
Defining fields in the hoard.it interface The hoard.it interface, showing how the fields are defined

I got talking to Dan Zambonini a while ago and showed him this first-pass prototype and he got excited about the potential straight away. Since then we’ve met a couple of times and exchanged ideas about what to do with the system, which we code-named “hoard.it”.

One of the ideas we pushed about early on was the concept of building web spidering into the system: instead of primarily having end-users as the “data triggers”, it should – we reasoned – be reasonably straightforward to define templates and then send a spider off to do the scraping instead.

The hoard.it spider

Dan has taken that idea and run with it. He built a spider in PHP, gave it a set of rules for templates and link-navigation and set it going. A couple of days ago he sent me a link to the data he’s collected – at time of writing, over 44,000 museum objects from 7 museums.

Dan has put together a REST-like querying method for getting at this data. Queries are passed in via URL and constructed in the form attribute/value – the query can be as long as you like, allowing fine-grained data access.

Data is returned as XML – there isn’t a schema right now, but that can follow in further prototypes. Dan has done quite a lot of munging to normalise dates and locations and then squeezed results into a simplified Dublin Core format.

Here’s an example query (click to see results – opens new window):

http://feeds.boxuk.com/museums/xmlfeed/location.made/Japan/

So this means “show me everything where location.made=Japan'”

Getting more fine-grained:

http://feeds.boxuk.com/museums/xmlfeed/location.made/Japan/dc.subject/weapons,entertainment

Yes, you guessed it – this is “things where location.made=Japan and dc.subject=weapons or entertainment”

Dan has done some lovely first-pass displays of ways in which this data could be used:

Also, any query can be appended with “/format/html” to show a simple html rendering of the request:

http://feeds.boxuk.com/museums/xmlfeed/location.made/exeter/format/html

What does this all mean?

The exposing of museum data in “machine-useful” form is a topic about which you’ll have noticed I’m pretty passionate. It’s a hard call, though (and one I’m working on with a number of other museum enthusiasts) – to get museums to understand the value of exposing data in this way.

The hoard.it method is a lovely workaround for those who don’t have, can’t afford or don’t understand why machine-accessible object data is important. On the one hand, it’s a hack – screenscraping is by definition a “dirty” method for getting at data. We’d all much prefer it if there was a better way – preferably, that all museums everywhere did this anyway. But the reality is very, very different. Most museums are still in the land of the NAW. I should also add that some (including the initial 7 museums spidered for the purposes of this prototype) have some API’s that they haven’t exposed. Hoard.it can help those who have already done the work of digitising but haven’t exposed the data in a machine-readable format.

Now that we’ve got this kind of data returned, we can of course parse it and deliver back…pretty much anything, from mobile-formatted results to ecards to kiosk to…well, use your imagination…

What next?

I’m running another mashed museum day the day before the annual Museums Computer Group conference in Leicester, and this data will be made available to anyone who wants to use it to build applications, visualisations or whatever around museum objects. Dan has got a bunch of ideas about how to extend the application, as have I – but I guess the main thing is that now it’s exposed, you can get into it and start playing!

How can I find out more?

We’re just in the process of putting together a simple series of wiki pages with some FAQ’s. Please use that, or the comments on this post to get in touch. Look forward to hearing from you!