Quantity or quality?

This might seem like an odd question, especially given the vast (vast) quantity of effort that goes into digitisation, rights checking, caption authoring and so on. But I’m also a fan of taking a step back at least every so often and asking odd, obvious and possibly stupid questions.

The question is in part prompted by an (apparently controversial) post on Read Write Web (I Don’t Know Much About Art But I Know What’s Online). I say “apparently controversial” because it seemed to kick off a fair-sized discussion on the MCG list, at least one blog post and a bunch of tweets from people who seemed to be a bit cross about it.

FWIW, it seemed to me to be an interesting and mostly fair post, albeit with moments of obvious silliness. Defining a single “museum experience”, for example, is easily as foolish as defining a single “shopping experience” or single “reading experience” or single any experience – it seems blindingly obvious there is no single experience, no single context, no single person. At the same time the point – that there is, really, nothing quite like seeing the real thing, no matter which way you cut it – is a fair one.

All of that aside, the interesting questions asked by the post seemed to be:

1. Is the Holy Grail of collections online to get THE LOT up on the web?

2. What makes for a good online collections experience, especially if you’ve delivered 1) and your collections number tens of thousands?

And of course, underling these two questions is, for me, the interesting one: why? Why do it at all? Why spend hundreds of thousands (actually, millions upon millions..) of pounds digitising collections for distribution to a digital audience?

Clearly, the use-cases for online collections are as varied as anything else but there must be some answers here, right? If you’re a medium-sized museum considering your digitisation strategy, how do you choose what to do? Is it all about quantity, about some kind of “number of collections items online up 400% this year!” box-ticking exercise? And if it isn’t about quantity but quality, how do you go about measuring the impact of your strategy?

I find it hard to see past my own perspective on this one: personally, I’d always prefer a tiny number of objects (hundreds, or even tens!) where each has been given real, personal attention. Seeing enormous great lists of stuff where QUANTITY IS ALL seems somehow to miss the entire point. For me, this isn’t about the mass of objects but is somehow about the “gaps” between the objects: the relationships between them, the relationships to people and, most importantly, the stories. George Cavan’s now-famous matchbox means nothing without the story attached to it: with it, it has a huge and tear-jerking impact.

There again, I’m a punter and not a researcher. Maybe they’d think very differently.

Update: see Frankie Roberto’s post: “..what an art museum experience might feel like online”

hoard.it : bootstrapping the NAW

What seems like a looong time ago I came up with an idea for “bootstrapping” the Non API Web (NAW), particularly around extracting un-structured content from (museum) collections pages.

The idea of scraping pages when there’s a lack of data access API isn’t new: Dapper launched a couple of years ago with a model for mapping and extracting from ‘ordinary’ html into a more programmatically useful format like RSS, JSON or XML. Before that there have been numerous projects that did the same (PiggyBank, Solvent, etc); Dapper is about the friendliest web2y interface so far, but it still fails IMHO in a number of ways.

Of course, there’s always the alternative approach, which Frankie Roberto outlined in his paper at Museums and the Web this year: don’t worry about the technology; instead approach the institution for data via an FOI request…

The original prototype I developed was based around a bookmarklet: the idea was that a user would navigate to an object page (although any templated “collection” or “catalogue” page is essentially the treated the same). If they wanted to “collect” the object on that page they’d click the bookmarklet, a script would look for data “shapes” against a pre-defined store, and then extract the data. Here’s some screen grabs of the process (click for bigger)

Science Museum object page An object page on the Science Museum website
Bookmarklet pop-up User clicks on the bookmarklet and a popup tells them that this page has been “collected” before. Data is separated by the template and “structured”
Bookmarklet pop-up Here, the object hasn’t been collected but the tech spots that the template is the same, so knows how to deal with the “data shape”
Defining fields in the hoard.it interface The hoard.it interface, showing how the fields are defined

I got talking to Dan Zambonini a while ago and showed him this first-pass prototype and he got excited about the potential straight away. Since then we’ve met a couple of times and exchanged ideas about what to do with the system, which we code-named “hoard.it”.

One of the ideas we pushed about early on was the concept of building web spidering into the system: instead of primarily having end-users as the “data triggers”, it should – we reasoned – be reasonably straightforward to define templates and then send a spider off to do the scraping instead.

The hoard.it spider

Dan has taken that idea and run with it. He built a spider in PHP, gave it a set of rules for templates and link-navigation and set it going. A couple of days ago he sent me a link to the data he’s collected – at time of writing, over 44,000 museum objects from 7 museums.

Dan has put together a REST-like querying method for getting at this data. Queries are passed in via URL and constructed in the form attribute/value – the query can be as long as you like, allowing fine-grained data access.

Data is returned as XML – there isn’t a schema right now, but that can follow in further prototypes. Dan has done quite a lot of munging to normalise dates and locations and then squeezed results into a simplified Dublin Core format.

Here’s an example query (click to see results – opens new window):

http://feeds.boxuk.com/museums/xmlfeed/location.made/Japan/

So this means “show me everything where location.made=Japan'”

Getting more fine-grained:

http://feeds.boxuk.com/museums/xmlfeed/location.made/Japan/dc.subject/weapons,entertainment

Yes, you guessed it – this is “things where location.made=Japan and dc.subject=weapons or entertainment”

Dan has done some lovely first-pass displays of ways in which this data could be used:

Also, any query can be appended with “/format/html” to show a simple html rendering of the request:

http://feeds.boxuk.com/museums/xmlfeed/location.made/exeter/format/html

What does this all mean?

The exposing of museum data in “machine-useful” form is a topic about which you’ll have noticed I’m pretty passionate. It’s a hard call, though (and one I’m working on with a number of other museum enthusiasts) – to get museums to understand the value of exposing data in this way.

The hoard.it method is a lovely workaround for those who don’t have, can’t afford or don’t understand why machine-accessible object data is important. On the one hand, it’s a hack – screenscraping is by definition a “dirty” method for getting at data. We’d all much prefer it if there was a better way – preferably, that all museums everywhere did this anyway. But the reality is very, very different. Most museums are still in the land of the NAW. I should also add that some (including the initial 7 museums spidered for the purposes of this prototype) have some API’s that they haven’t exposed. Hoard.it can help those who have already done the work of digitising but haven’t exposed the data in a machine-readable format.

Now that we’ve got this kind of data returned, we can of course parse it and deliver back…pretty much anything, from mobile-formatted results to ecards to kiosk to…well, use your imagination…

What next?

I’m running another mashed museum day the day before the annual Museums Computer Group conference in Leicester, and this data will be made available to anyone who wants to use it to build applications, visualisations or whatever around museum objects. Dan has got a bunch of ideas about how to extend the application, as have I – but I guess the main thing is that now it’s exposed, you can get into it and start playing!

How can I find out more?

We’re just in the process of putting together a simple series of wiki pages with some FAQ’s. Please use that, or the comments on this post to get in touch. Look forward to hearing from you!