A possible next step for hoard.it?

I first wrote about hoard.it, the bootstrapped “API spider” that Dan Zambonini and I built, back in 2008. We followed up the technology with a paper for Museums and the Web 2009, and in that paper talked about some possible future directions for the service. You’ll see if you scroll down the paper that there is a section entitled “next steps” which outlines some of these.

I’ve always been very excited by the possibilities that the hoard.it model might uncover. Not, as you’ll see from the paper, because this is a new or particularly original approach – screenscraping is as old as the hills – but because it offers up some rapid and very easy solutions to some bigger problems. These problems are even older than screenscraping: legacy content, no resource, little expertise, and they’re firmly entrenched in pretty much every web presence there is.

Much as the world seems to be heading along a Linked Data route, it comes back to me time and time again that however beautiful these RDF or RDFa or even API approaches are in theory, they mean little without uptake and critical mass. Simplicity of approach defines RSS as a success, for example, and even though it isn’t perfect by a long shot, it gets things done rapidly and easily and as a consequence has been a cornerstone of the Machine Readable Data web that we see today.

This post isn’t, however, about Linked Data – I’ll leave that for another day (let’s leave it as: I’m ready to be convinced, but I’m not there yet). However, one of the major concerns I have about the LD approach is that however you look at it it requires a fair amount of effort to make the changes required. One day, I and others may be convinced that this effort is worth it, but for now this still leaves the huge problem that we outlined in our paper: legacy systems, legacy sites and legacy content – all of it put together at huge expense and effort, and none of it available in this new MRD world.

I got thinking about the hoard.it approach over the weekend and started to focus in on the ideas articulated in the “template mapping” part of the paper. Lurking at the back of my brain has been the question about who defines the mapping, and whether there might be any way to make this more flexible: make it something an organisation, individual or even a crowd-sourced democracy might be able to define.

The idea I’d like to float is this:

1. A page, section or even entire website has a corresponding file which lays out a simple schema. This schema maps things in the html DOM to data fields using jquery-like (simple, universal, understandable!) DOM pointers. Here’s what a single field definition might look like:

DC.Title > $("div#container ul.top-nav li:eq(0)").text();

This example means, hopefully obviously: “the content at the DOM location div#container ul.top-nav li:eq(0) is the DC.Title field”

2. If the institution creates this file then they can use the “rel” tag somewhere in the head to point to this file and the mapping type – so for instance you might have <link rel=”simple.dublincore.transformation” href=”somelocalfile.txt” /> or even <link rel=”hcard.microformat.transformation” />. This means that any MRD parser (or search engine!) which comes to the page could quickly parse the content, apply the transformation and then return the data in a pre-defined format: XML, RSS, vCard or maybe even RDF (I don’t know – could you? Help me out here!).

3. If the institution doesn’t create this file, then anyone could write a transformation using this simple approach, publish their transformation file on the web and then use it to get data from the page using a simple parser. I’m writing an example parser as we speak – right now all it accepts is the source url (the data) and the definition url (the transformation file) but this could be much more effective with some further thought…

And that, simply, is it. The advantages are:

1. Machine Readable Data (albeit simple MRD) out of the box: no changes (apart from a rel tag if you want it) to the page. No arcane languages, no API development. No time lag (I’d see most applications reading this data in real-time – just an http request to the page which is then parsed)

2. Any changes to the structure of the page (the long-recognised problem with screen-scraping) can be quickly changed in a single defining file for the whole part of the site which has that “shape”. If this file is democratised in some way then the community could spot errors because of changed structure in the page and amend the file accordingly

3. Multiple files can be defined for a single page: ditto, sub-pages or sections can have more specific “cascade-like” files which are specific to that particular content shape

I know of two approaches which are similar but different. Firstly, the YQL-like approach to screen scraping which is very, very elegant but also a) specific to Yahoo! (a company not in best financial and future-proof shape) and b) as far as I know, can’t be specified for collections of pages but rather on a page-by-page basis (let me know if this isn’t the case..). The second approach which is also similar but in different ways is GRDDL. This is more like the open approach I suggest here, but has the problem of a) being based around XSL which therefore means the source document has to be valid XML and b) requires on-page edits, too

These are the only two conceptual approaches that I know of which are similar, but it could well be that there are well-defined lightweight ways of doing this which others have written about.

I’d very much like your feedback on this roughly-shaped idea. I’ll get the parser up and running shortly and it’ll then become clear whether it might be an interesting approach – but if you know of similar ways of doing this stuff, I’d love to hear them.