Museum mash

Most of you will probably have seen this already, but I’m running another museum mashup day in Leicester on 18th June 2008, the day before the annual UK Museums on the Web conference. It’ll be a developer/techie focussed day working with various bits of museum data to see what we can mash together in just one day.

We’re having to be very strict with the number of people allowed and the list is starting to look pretty full…but if you’re up for it, visit www.mashedmuseum.org.uk to find out more, or jump straight to the “I’m interested” form.

On a related note, I knocked out a quick application (the “hoard.it collector”) over the weekend based on data from the hoard.it application which Dan and I unveiled last week. It’s a terribly quick (read: a bit flaky) prototype which lets you “collect” museum object records and then share them so others can see what you’ve saved. Think Listmania but for museum stuff. The main purpose? Not to build something that’s never been done before, but (as ever) to demonstrate that it is possible to do interesting stuff without funding, the usual museum treaclewading or worrying too much about perfection. I’ll hone and polish the idea as time goes on – any ideas, stick ’em in the comments…

Dan and I are planning for the hoard.it data to be made available for the Mashed Museum day, but if you fancy a play before that, or can’t come along, you can read the documentation, do some building, and then let me or Dan know so we can chuck something in the hoard.it application gallery.

hoard.it : bootstrapping the NAW

What seems like a looong time ago I came up with an idea for “bootstrapping” the Non API Web (NAW), particularly around extracting un-structured content from (museum) collections pages.

The idea of scraping pages when there’s a lack of data access API isn’t new: Dapper launched a couple of years ago with a model for mapping and extracting from ‘ordinary’ html into a more programmatically useful format like RSS, JSON or XML. Before that there have been numerous projects that did the same (PiggyBank, Solvent, etc); Dapper is about the friendliest web2y interface so far, but it still fails IMHO in a number of ways.

Of course, there’s always the alternative approach, which Frankie Roberto outlined in his paper at Museums and the Web this year: don’t worry about the technology; instead approach the institution for data via an FOI request…

The original prototype I developed was based around a bookmarklet: the idea was that a user would navigate to an object page (although any templated “collection” or “catalogue” page is essentially the treated the same). If they wanted to “collect” the object on that page they’d click the bookmarklet, a script would look for data “shapes” against a pre-defined store, and then extract the data. Here’s some screen grabs of the process (click for bigger)

Science Museum object page An object page on the Science Museum website
Bookmarklet pop-up User clicks on the bookmarklet and a popup tells them that this page has been “collected” before. Data is separated by the template and “structured”
Bookmarklet pop-up Here, the object hasn’t been collected but the tech spots that the template is the same, so knows how to deal with the “data shape”
Defining fields in the hoard.it interface The hoard.it interface, showing how the fields are defined

I got talking to Dan Zambonini a while ago and showed him this first-pass prototype and he got excited about the potential straight away. Since then we’ve met a couple of times and exchanged ideas about what to do with the system, which we code-named “hoard.it”.

One of the ideas we pushed about early on was the concept of building web spidering into the system: instead of primarily having end-users as the “data triggers”, it should – we reasoned – be reasonably straightforward to define templates and then send a spider off to do the scraping instead.

The hoard.it spider

Dan has taken that idea and run with it. He built a spider in PHP, gave it a set of rules for templates and link-navigation and set it going. A couple of days ago he sent me a link to the data he’s collected – at time of writing, over 44,000 museum objects from 7 museums.

Dan has put together a REST-like querying method for getting at this data. Queries are passed in via URL and constructed in the form attribute/value – the query can be as long as you like, allowing fine-grained data access.

Data is returned as XML – there isn’t a schema right now, but that can follow in further prototypes. Dan has done quite a lot of munging to normalise dates and locations and then squeezed results into a simplified Dublin Core format.

Here’s an example query (click to see results – opens new window):

http://feeds.boxuk.com/museums/xmlfeed/location.made/Japan/

So this means “show me everything where location.made=Japan'”

Getting more fine-grained:

http://feeds.boxuk.com/museums/xmlfeed/location.made/Japan/dc.subject/weapons,entertainment

Yes, you guessed it – this is “things where location.made=Japan and dc.subject=weapons or entertainment”

Dan has done some lovely first-pass displays of ways in which this data could be used:

Also, any query can be appended with “/format/html” to show a simple html rendering of the request:

http://feeds.boxuk.com/museums/xmlfeed/location.made/exeter/format/html

What does this all mean?

The exposing of museum data in “machine-useful” form is a topic about which you’ll have noticed I’m pretty passionate. It’s a hard call, though (and one I’m working on with a number of other museum enthusiasts) – to get museums to understand the value of exposing data in this way.

The hoard.it method is a lovely workaround for those who don’t have, can’t afford or don’t understand why machine-accessible object data is important. On the one hand, it’s a hack – screenscraping is by definition a “dirty” method for getting at data. We’d all much prefer it if there was a better way – preferably, that all museums everywhere did this anyway. But the reality is very, very different. Most museums are still in the land of the NAW. I should also add that some (including the initial 7 museums spidered for the purposes of this prototype) have some API’s that they haven’t exposed. Hoard.it can help those who have already done the work of digitising but haven’t exposed the data in a machine-readable format.

Now that we’ve got this kind of data returned, we can of course parse it and deliver back…pretty much anything, from mobile-formatted results to ecards to kiosk to…well, use your imagination…

What next?

I’m running another mashed museum day the day before the annual Museums Computer Group conference in Leicester, and this data will be made available to anyone who wants to use it to build applications, visualisations or whatever around museum objects. Dan has got a bunch of ideas about how to extend the application, as have I – but I guess the main thing is that now it’s exposed, you can get into it and start playing!

How can I find out more?

We’re just in the process of putting together a simple series of wiki pages with some FAQ’s. Please use that, or the comments on this post to get in touch. Look forward to hearing from you!