Linked Data: my challenge

What with Gordon Brown’s recent (just an hour or so ago) announcement of lots of digital goodness at the “Building Britain’s Digital Future” event, the focus sharpens once again on Linked Data.

I’ve been sitting on the sidelines sniping gently at Linked Data since it apparently replaced the Semantic Web as The Next Big Thing. I remained cynical about the SW all the way through, and as of right now I remain cynical about Linked Data as well.

This might seem odd from someone obsessed with – and a clear advocate of – the opening up data. I’ve blogged about, talked about and written papers about what I’ve come to call MRD (Machine Readable Data). I’ve gone so far as to believe that if it doesn’t have an API, it doesn’t – or shouldn’t – exist.

So what is my problem with Linked Data? Surely what Linked Data offers is the holy grail of MRD? Shouldn’t I be embracing it as everyone else appears to be?

Yes. I probably should.

But…Linked Data runs headlong into one of the things I also blog about all the time here, and the thing I believe in probably more than anything else: simplicity.

If there is one thing I think we should all have learned from RSS, simple API’s, YQL, Yahoo Pipes, Google Docs, etc it is this: for a technology to gain traction it has to be not only accessible, but simple and usable, too.

Here’s how I see Linked Data as of right now:

1. It is completely entrenched in a community who are deeply technically focused. They’re nice people, but I’ve had a good bunch of conversations and never once has anyone been able to articulate for me the why or the how of Linked Data, and why it is better than focusing on simple MRD approaches, and in that lack of understanding we have a problem. I’m not the sharpest tool, but I’m not stupid either, and I’ve been trying to understand for a fair amount of time…

2. There are very few (read: almost zero) compelling use-cases for Linked Data. And I don’t mean the TBL “hey, imagine if you could do X” scenario, I mean real use-cases. Things that people have actually built. And no, Twine doesn’t cut it.

3. The entry cost is high – deeply arcane and overly technical, whilst the value remains low. Find me something you can do with Linked Data that you can’t do with an API. If the value was way higher, the cost wouldn’t matter so much. But right now, what do you get if you publish Linked Data? And what do you get if you consume it?

Now, I’m deeply aware that actually I don’t actually know much about Linked Data. But I’m also aware that for someone like me – with my background and interests – to not know much about Linked Data, there is somewhere in the chain a massive problem.

I genuinely want to understand Linked Data. I want to be a Linked Data advocate in the same way I’m an API/MRD advocate. So here is my challenge, and it is genuinely an open one. I need you, dear reader, to show me:

1. Why I should publish Linked Data. The “why” means I want to understand the value returned by the investment of time required, and by this I mean compelling, possibly visual and certainly useful examples

2. How I should do this, and easily. If you need to use the word “ontology” or “triple” or make me understand the deepest horrors of RDF, consider your approach a failed approach

3. Some compelling use-cases which demonstrate that this is better than a simple API/feed based approach

There you go – the challenge is on. Arcane technical types need not apply.

Pushing MRD out from under the geek rock

The week before last (30th June – 1st July 2009), I was at the JISC Digital Content Conference having been asked to take part in one of their parallel sessions.

I thought I’d use the session to talk about something I’m increasingly interested in – the shifting of the message about machine readable data (think API’s, RSS, OpenSearch, Microformats, LinkedData, etc) from the world of geek to the world of non-geek.

My slides are here:

[slideshare id=1714963&doc=dontthinkwebsitesthinkdatafinal-090713100859-phpapp02]

Here’s where I’m at: I think that MRD (That’s Machine Readable Data – I couldn’t seem to find a better term..) is probably about as important as it gets. It underpins an entire approach to content which is flexible, powerful and open. It embodies notions of freely moving data, it encourages innovation and visualisation. It is also not nearly as hard as it appears – or doesn’t have to be.

In the world of the geek (that’s a world I dip into long enough to see the potential before heading back out here into the sun), the proponents of MRD are many and passionate. Find me a Web2.0 application without an API (or one “on the development road-map”) and I’ll find you a pretty unusual company.

These people don’t need preaching at. They’re there, lined up, building apps for Twitter (to the tune of 10x the traffic which visits twitter.com), developing a huge array of services and visualisations, graphs, maps, inputs and outputs.

The problem isn’t the geeks. The problem is that MRD needs to move beyond the realm of the geek and into the realm of the content owner, the budget holder, the strategist, for these technologies to become truly embedded. We need to have copyright holders and funders lined up at the start of the project, prepared for the fact that our content will be delivered through multiple access routes, across unspecified timespans and to unknown devices. We need our specifications to be focused on re-purposing, not on single-point delivery. We need solution providers delivering software with web API’s built in. We need to be prepared for a world in which no-one visits our websites any more, instead picking, choosing and mixing our content from externally syndicated channels.

In short, we now need the relevant people evangelising about the MRD approach.

Geeks have done this well so far, but now they need help. Try searching on “ROI for API’s” (or any combination thereof) and you’ll find almost nothing – very little evidence outlining how much API’s cost to implement, what cost savings you are likely to see from them; how they reduce content development time; few guidelines on how to deal with syndicated content copyright issues.

Partly, this knowledge gap is because many of the technologies we’re talking about are still quite young. But a lot of the problem is about the communication of technology, the divided worlds that Nick Poole (Collections Trust) speaks about. This was the core of my presentation: ten reasons why MRD is important, from the perspective of a non-geek (links go to relevant slides and examples in the slide deck):

  1. Content is still king
  2. Re-use is not just good, it’s essential
  3. “Wouldn’t it be great if…”: Life is easier when everyone can get at your data
  4. Content development is cheaper
  5. Things get more visual
  6. Take content to users, not users to content (“If you build it, they probably won’t come”)
  7. It doesn’t have to be hard
  8. You can’t hide your content
  9. We really is bigger and better than me
  10. Traffic

All this is is a starter for ten. Bigger, better and more informed people than me probably have another hundred reasons why MRD is a good idea. I think this knowledge may be there – we just need to surface and collect it so that more (of the right) people can benefit from these approaches.

Scraping, scripting, hacking

I just finished my talk at Mashed Library 2009 – an event for librarians wanting to mash and mix their data. My talk was almost definitely a bit overwhelming, judging by the backchannel, so I thought I’d bang out a quick blog post to try and help those I managed to confuse.

My talk was entitled “Scraping, Scripting and Hacking your way to API-less data”, and intended to give a high-level overview of some of the techniques that can be used to “get at data” on the web when the “nice” options of feeds and API’s aren’t available to you.

The context of the talk was this: almost everything we’re talking about with regard to mashups, visualisations and so on relies on data being available to us. In the cutting edge of Web2 apps, everything has got an API, a feed, a developer community. In the world of museums, libraries and government, this just isn’t the case. Data is usually held on-page as html (xhtml if we’re lucky), and programmatic access is nowhere to be found. If we want to use that data, we need to find other ways to get at it.

My slides are here:

[slideshare id=1690990&doc=scrapingscriptinghacking-090707060418-phpapp02]

A few people asked that I provide the URLs I mentioned together with a bit of context. Many of the slides above have links to examples, but here’s a simple list for those who’d prefer that:

Phew. Now I can see why it was slightly overwhelming 🙂

The Brooklyn Museum API – Q&A with Shelley Bernstein and Paul Beaudoin

The concept and importance of museum-based API’s are notions that I’ve written about consistently (boringly, probably) both on this blog and elsewhere on the web. Programmatic and open access to data is – IMO – absolutely key to ensuring the long-term success of online collections.

Many conversations have been going on about how to make API’s happen over the last couple of years, and I think we’re finally seeing these conversations move away from niche groups of enthusiastic developers (eg. Mashed Museum ) into a more mainstream debate which also involves budget holders and strategists. These conversations have been aided by metrics from social media sites like Twitter which indicate that API access figures sometimes outstrip “normal web” browsing by a factor of 10 or more.

On March 4th 2009, Brooklyn Museum announced the launch of their API, the latest in a series of developments around their online collection. Brooklyn occupies a space which generates a fair amount of awe in museum web circles: Shelley Bernstein and team are always several steps in front of the curve – innovating rapidly, encouraging a “just do it” attitude, and most importantly, engaging wholly with a totally committed tribe of users. Many other museum try to do social media. Brooklyn lives social media.

So, as they say – without further ado – here’s Shelley and Paul talking about what they did, how they did it, and why.

Q: First and foremost, could you please introduce yourselves – what your main roles and responsibilities are and how you fit within the museum.

Shelley Bernstein, Chief of Technology. I manage the department that runs the Museum’s helpdesk, Network Administration, Website, gallery technology, and social media.

Paul Beaudoin, Programmer. I push data around on the back-end and build website features and internal tools.

Q: Can you explain in as non-technical language as possible what exactly the Brooklyn API is, and what it lets people do?

SB: It’s basically a way outside programmers can query our Collections data and create their own applications using it.

Q: Why did you decide to build an API? What are the main things you hope to achieve …and what about those age old “social web” problems like authority, value and so-on?

SB: First, practical… in the past we’d been asked to be a part of larger projects where institutions were trying to aggregate data across many collections (like d*hub). At the time, we couldn’t justify allocating the time to provide data sets which would become stale as fast as we could turn over the data. By developing the API, we can create this one thing that will work for many people so it no longer become a project every time we are asked to take part.

Second, community… the developer community is not one we’d worked with before. We’d recently had exposure to the indicommons community at the Flickr Commons and had seen developers like David Wilkinson do some great things with our data there. It’s been a very positive experience and one we wanted to carry forward into our Collection, not just the materials we are posting to The Commons.

Third, community+practical… I think we needed to recognize that ideas about our data can come from anywhere, and encourage outside partnerships. We should recognize that programmers from outside the organization will have skills and ideas that we don’t have internally and encourage everyone to use them with our data if they want to. When they do, we want to make sure we get them the credit they deserve by pointing our visitors to their sites so they get some exposure for their efforts.

Q: How have you built it? (Both from a technical and a project perspective: what platform, backend systems, relationship to collections management / website; also how long has it taken, and how have you run the project?)

PB: The API sits on top of our existing “OpenCollection” code (no relation to namesake at http://www.collectiveaccess.org) which we developed about a year ago. OpenCollection is a set of PHP classes sitting on top of a MySQL database, which contains all of the object data that’s been approved for Web.

All that data originates in our internal collections management systems and digital asset systems. SSIS scripts run nightly to identify approved data and images and push them to our FreeBSD servers for processing. We have several internal workflow tools that also contribute assets like labels, press releases, videos, podcasts, and custom-cropped thumbnails. A series of BASH and PHP scripts merge the data from the various sources and generate new derivatives as required (ImageMagick). Once compiled new collection database dumps and images are pushed out to the Web servers overnight. Everything is scheduled to run automatically so new data and images approved on Monday will be available in the wee hours Tuesday.

The API itself took about four weeks to build and document (documentation may have consumed the better part of that). But that seems like a misleading figure because so much of the API piggy-backs on our existing codebase. OpenCollection itself – and all of the data flow scripts that support it – took many months to build.

Cool diagrams. Every desk should have some.

Cool diagrams. Every desk should have some.

Q: How did you go about communicating the benefits of an API to internal stakeholders?

SB: Ha, well we used your hoard.it website as an example of what can happen if we don’t! The general discussion centered around how we can work with the community and develop a way people can can do this under our own terms, the alternative being that people are likely to do what they want anyway. We’d rather work with, than against. It also helped us immensely that an API had been released by DigitalNZ , so we had an example out there that we could follow.

Q: It’s obviously early days, but how much interest and take-up have you had? How much are you anticipating?

SB: We are not expecting a ton, but we’ve already seen a lot of creativity flowing which you can check out in our Application Gallery. We already know of a few things brewing that are really exciting. And Luke over at the Powerhouse is working on getting our data into d*hub already, so stay tuned.

Q: Can you give us some indication of the budget – at least ballpark, or as a % compared to your annual operating budget for the website?

SB: There was no budget specifically assigned to this project. We had an opening of time where we thought we could slot in the development and took it. Moving forward, we will make changes to the API and add features as time can be allocated, but it will often need to be secondary to other projects we need to accomplish.

Q: How are you dealing with rights issues?

SB: Anything that is under copyright is being delivered at a very small thumbnail size (100px wide on the longest size) for identification purposes only.

Q: What restrictions do you place on users when accessing, displaying and otherwise using your data?

SB: I’m not even going to attempt to summarize this one. Here’s the Terms of Service – everyone go get a good cup of coffee before settling down with it.

Q: You chose a particular approach (REST) to expose your collections. Could you talk a bit about the technical options you considered before coming to this solution, and why you preferred REST to these others?

PB: Actually it’s been pointed out that our API isn’t perfectly RESTful, so let me say first that, humbly, we consider our API REST-inspired at best. I’ve long been a fan of REST and tend to gravitate to it in principal. But when it comes down to it, development time and ease of use are the top concerns.

At the time the API was spec’ed we decided it was more important to build something that someone could jump right into than something meeting some aesthetic ideal. Of course those aren’t mutually exclusive goals if you have all the dev time in the world, but we don’t. So we thought about our users and looked to the APIs that seemed to be getting the most play (Flickr, DigiNZ, and many Google projects come to mind) and borrowed aspects we thought worked (api keys, mindful use of HTTP verbs, simple query parameters) and left out the things we thought were extraneous or personally inappropriate (complicated session management, multiple script gateways). The result is, I think, a lightweight API with very few rules and pretty accommodating responses. You don’t have to know what an XSD is to jump in.

Q: What advice would you give to other museums / institutions wanting to follow the API path?

SB: You mean other than “do it” <insert grin here>? No, really, if it’s right for the institution and their goals, they should consider it. Look to the DigitalNZ project and read this interview with their team (we did and it inspired us). Try and not stress over making it perfect first time out, just try and see what it yields…then adjust as you go along. Obviously, the more institutions that can open their data in this way, the richer the applications can become.

_______

Many, many thanks to Shelley and Paul for putting in the time to answer my questions. You can follow the development of the Brooklyn Museum collections and API over on their blog, or by following @brooklynmuseum on Twitter. More importantly, go build something cool 🙂

(Selling) content in a networked age

I’m just back from Torquay where I’d been asked to speak at the 32nd annual UKSG conference. I first came across UKSG more than a year ago when they asked me to speak at a London workshop they were hosting. Back then, I did a general overview of API’s from a non-technical perspective.

This time around, my presentation was about opening up access to content: the title “If you love your content, set it free?” builds on some previous themes I’ve talked and written about. Presenting on “setting content free” to a room of librarians and publishers is always likely to be difficult. Both groups are – either directly or indirectly – dependent on income from published works. I’m also neither publisher nor librarian, and although I spent some time working for Waterstone’s Online and know bits about the book trade, my knowledge is undoubtedly hopelessly out of date.

Actually, I had two very receptive crowds (thank you for coming if you were there!) and some really interesting debate around the whole notion of value, scarcity and network effects.

[slideshare id=1228656&doc=settingcontentfreeuksg2009final-090331123331-phpapp01]

Like any sector, publishers and librarians have their own language, their own agendas and their own histories of successes and failures. Also like any sector, they are often challenged to spend time thinking about the bigger picture. Day jobs are about rights and DRM, OPAC and tenure. They aren’t (usually) about user experience, big-picture strategy or considering and comparing approaches from other sectors.

What I wanted to do with the presentation was to look at some of the big challenges which face (commercial) material in the networked world by thinking a bit more holistically about people’s relationship with that content, and the modes of use that they apply to the stuff that they acquire via this networked environment.

The – granted, rather challenging – title of the presentation is actually a question cunningly disguised as a statement. Or maybe it’s a statement cunningly disguised as a question. I lost track. The thing I was trying to do with this questatement (and some people missed this, more fool me for being too subtle) was to say: “Look, here’s how many people are talking about content now: they’re making it free and available; they’re encouraging re-use; they’re providing free and open API’s. They’re understanding that users are fickle, content-hungry and often unfussy about the origin of that content. What, exactly, do we do in an environment like this? What are the strategies that might serve us best? Can we still sell stuff, and if so, how?”

The wider proposition (that content fares rather better when it is freed on the network than when it is tethered and locked down) is a source of fairly passionate debate. I’ve written extensively about Paulo Coehlo’s experiments in freeing his books, about API’s, about “copywrong“, about value, authority and authenticity. The suggestion that if you free it up you will see more cultural capital is starting to be established in museums and galleries. The suggestion that you might, just might, increase your financial capital by opening up is for the most part considered PREPOSTEROUS to publishers. Giving away PDF’s increases book sales? Outrageous. Apart from the only example I’ve actually seen documented, of course, which is Coehlo’s, and that seems to indicate a completely different story.

There are fine – and all the finer the closer you examine them – levels of detail. Yes, an academic market is vastly different from a popular one: you don’t have the scale of the crowd, the articles are used in different ways, the works are generally shorter, the audiences worlds apart. But nonetheless, Clay Shirky’s robust (if deeply depressing) angle on the future – sorry, lack of future – of the newspaper industry needs close examination in any content-rich sector. I don’t think anyone can deny that the core proposition he holds up – that the problems that (newspaper) publishing solves (printing, marketing and distribution) are no longer problems in the networked age. I don’t think that what he’s saying is that we won’t have newspapers in the future, and he’s definitely not saying that we won’t need journalists. What he is saying – and this was the angle I focused on in my slides – is that this change is akin to living through a revolution. And with this revolution needs to come revolutionary responses and understanding that the change is far bigger and more profound than almost anyone can anticipate. The open API is one such response (The Guardian “Open Platform” being an apposite example). Free PDF’s / paid books is another. Music streaming and the killing of DRM is another.

Revolutions are uncomfortable. The wholesale examination of an entire industry is horrifically uncomfortable. Just take a look at the music business and you’ll see a group of deeply unhappy executives sitting around the ashes of a big pile of CD’s as they mourn the good ‘ole times. But over there with music, new business models are also beginning to evolve and emerge from these ashes. Spotify is based on streaming, Last.fm is based on social, Seeqpod is a lightweight wrapper for Google searches, The Pirate Bay ignores everyone else and provides stuff for free.

Which ones are going to work? Which ones will make money? Which ones will work but displace the money-making somewhere else? The simple answer, of course, is that no-one really knows. Some models will thrive, others will fail. Some will pave a new direction for the industry, others we’ll laugh at in five years time.

So where can the answers be found? Predictably for me, I think all sectors (including academic publishing!) need to take a punt and do some lightweight experimentation. I think they need to be trying new models of access based around personalisation, attention data and identity. They need to examine who gets paid, how much and when. They need to be setting stuff free in an environment where they can measure – effectively – the impact of this freedom across a range of returns, from marketing to cultural to financial. If they do this then they’re at least going to have some solid intelligence to use when deciding which models to take ahead. And it may be that this particular industry isn’t as challenged as most people assume, and that the existing models can carry on – lock it down, slap on some DRM, charge for access. It’d be far less uncomfortable if this was the case. But at least that decision would be made with some solid knowledge backing it up.

Open Access is one clear way of forging this debate ahead. But once you get under the apparently simple hood of the OA proposition, it actually turns out that not only are many institutions simply ignoring guidelines to produce OA versions of published works but that the payment models are complicated and based on a historical backdrop which to many seems inherently broken. I’d be interested to hear from someone with way more knowledge than me on the successes and failures or market research done on setting content free in this way.

It was clear to me in talking to a range of people at UKSG – librarians, publishers, content providers – that there are huge swathes of evidence missing – surprising, perhaps, from sectors which pride themselves on accuracy and academic rigour. When I asked “how many people aren’t coming to your site because search engines can’t see your content?” or “what is your e-commerce drop-out rate?” or “how much of your stuff do you estimate is illegally pirated?”, very few had coherent – (or even vague) (or any!) – answers.

More telling, perhaps, is that the informal straw poll question I posed to various people during the conference: “Do you feel that this is a healthy industry?” was almost always answered with a negative response. And when I asked why, the near-consistent reply was: “It’s too complicated; too political; too entangled” or from one person: “the internet has killed us”.

I’m really not as naive as I sometimes appear 🙂 I know how terribly, terribly hard it is to unpick enormous, political and emotive histories. When I suggest that “we need to start again”, I’m obviously not suggesting that we can wipe the slate clean and redefine the entire value proposition across a multi-billion dollar, multi-faceted industry. But I think – simply – that awareness of the networked environment, a knowledge of how people really use the web today and an open mind that things might need to change in profound ways are very powerful starting points in what will clearly be an ongoing, fraught and fascinating discussion.

Creative Spaces – just…why?

There’s been a fair bit of buzz around the launch of the NMOLP (National Museums Online Learning Project) – now apparently renamed as “Creative Spaces” for launch.

I’ve known about this project for a long while – when I was at the Science Museum, very initial discussions were taking place at the V&A about how to search and display collections results from more than one institution. The Science Museum were invited to take part in the project, but in the end didn’t because of resourcing and budgetary issues.

My second touch on the project was from the agency end – the ITT briefly crossed my desk at my current employer, Eduserv. We considered bidding, but in the end decided that it wasn’t a project we could deliver satisfactorily given the particulars of the scope and budget.

Back then – and I think now, although someone from NMOLP will have to confirm – the project was divided into two main sections: a series of “webquests” (online learning experiences, essentially) and a cross-museum collections search. The webquests can be seen here, but I’m not going to consider these in this post because I haven’t had time to spend enough time playing to have an opinion yet.

The Creative Spaces site is at http://bm.nmolp.org/creativespaces/ – at first glance, it’s clean and nicely designed, with a bit of a web2.0 bevel thing going on. It’s certainly visually more pleasing than many museum web projects I’ve seen. The search is quick, and there’s at least a surface appearance of “real people” on the site. I hesitate to use the word “community” for reasons that I’ll highlight in a minute.

Design aside, I have some fairly big issues with the approach that is being taken here:

Firstly, this site, much like Europeana (which I’ll get my teeth into in a future post…) seemingly fails to grasp what it is about the web that makes people want to engage. I’m very surprised that we’re this many years into the social web and haven’t learnt about the basic building blocks for online communities, and are apparently unable to take a step back from our institutional viewpoint and think like a REAL user, not a museum one. Try looking at this site with a “normal person” hat on. Now ask yourself: “what do I want to DO here?” or “how can this benefit me?” or “how can I have fun”? Sure, you can create a “notebook” or a “group” (once you’ve logged in, obviously..). The “why” is unclear.

I’m also interested at how underwhelming the technology is. Take a look at www.ingenious.org.uk – a NOF digitise project which I worked on maybe 5-6 years ago. Now, I’m not over-proud of this site – it took ages, nearly killed a few people from stress, and the end result could be better, but hey – it has cross collections search, you can send an e-card, you can save things to your lightbox, you can create a web gallery. And this was more than five years ago. Even then, I was underwhelmed by what we managed to do. NMOLP doesn’t seem to have pushed the boundary beyond this at all, and as museums I think we should always be looking to drive innovation forward.

Secondly, I’m not sure that there is a reason why. Why would I possibly want to create a profile? Where is my incentive? Here’s Wikipedia talking about the Network Effect:

“A more natural strategy is to build a system that has enough value without network effects, at least to early adopters. Then, as the number of users increases, the system becomes even more valuable and is able to attract a wider user base. Joshua Schachter has explained that he built Del.icio.us along these lines – he built an online system where he could keep bookmarks for himself, such that even if no other user joined, it would still be valuable to him

The other day, I had a Twitter conversation with Giv Parvaneh, the Technical Manager at NMOLP regarding this post, which talks about “monetizing” media. He blogged his response here. Now, we had a minor crossed-wires moment (it’s hard to discuss in 140 chrs) – but my point was not that museums should “monetize” everything (although, I DO think that museums should learn about real business practices, but that’s another post altogether). My point was that users need to feel special to take part. They need to be part of a tribe, a trusted group who can do and say things that they find personally useful. They need experiences with integrity. If you’re not sure what I mean, just spend some time on the Brooklyn Museum collections pages. These guys get it – the “posse“, the “tag game“, the openness. Compare this back to what feels like a shallow experience you get on NMOLP. Now ask yourself – “where would I spend MY time?”.

The second major reason is that, once again, we’re failing to take our content to our users. This is a huge shortfalling of Europeana. People want experiences on their own terms, not on ours. Let’s not have another collections portal. Spend your social media money adding and updating entries on Wikipedia, or create an object sharing Facebook application. Or just put everything on Flickr. And, please, please create an API or at the very least an OpenSearch feed. If the issue is something around copyright – go back to your funders and content providers and sit them down in front of Google images for an hour so they can begin to understand how the internet works, before renegotiating terms with them!

The final reason hangs off the search facility. My vested interest here is of course hoard.it – and if you want to hear our rantings about the money spent on big, bad technology projects, then keep an eye out for our Museums and the Web Paper. We aren’t necessarily suggesting that the hoard.it approach should be the technology behind cross-collections searching. But we are suggesting that the approch that NMOLP have taken is expensive, old, clunky and ultimately flawed. Although it is a trifle over-simplistic as a response, why not just spend £20-30k on a Google Search Appliance and simply spider the sites. Why re-develop the wheel and build search from scratch?

If I was less of a grumpy old man, I’d feel bad about being this negative – I like the people involved, I like the institutions, and I understand the reasons why (museum) projects spiral into directions you probably wouldn’t ever choose. But then I remember that this site cost taxpayers just short of £2 million pounds, and that Europeana will cost €120 million. And then I realise that we have an obligation to keep badgering, nagging and criticising until we start to get these things right.

At the end of the day, Frankie sums it all up much more succinctly in his email to the MCG list than I do in this post. He simply asks: why?

If you love something, set it free

Last week, I had the privilege of being asked to be one of the keynote speakers at a conference in Amsterdam called Kom je ook?. This translates as “Heritage Upgrade” and describes itself as “a symposium for cultural heritage institutions, theatres and museums”.

I was particularly excited about this one: firstly, my partner keynoters were Nina Simon (Museum Two) and Shelley Bernstein (Community Manager at the Brooklyn Museum) – both very well known and very well respected museum and social web people. Second (if I’m allowed to generalise): “I like the Dutch” – I like their attitude to new media, to innovation and to culture in general; and third – it looked like fun.

Nina talked about “The Participatory Museum” – in particular she focussed on an oft-forgotten point: the web isn’t social technology per se; it is just a particularly good tool for making social technology happen. The fact that the online medium allows you to track, access, publish and distribute are good reasons for using the web BUT the fact that this happens to populate one space shouldn’t limit your thinking to that space, and shouldn’t alter the fact that this is always, always about people and the ways in which they come together. The changing focus of museum moving from being a content provider to being a platform provider also rang true with me in so many ways. Nina rounded off with a “ten tips for social technology” (slide 12 and onwards).

Shelley gave another excellent talk on the incredible work she is doing at the Brooklyn Museum. She and I shared a session on Web2 at Museums and the Web 2007, and once again it is the genuine enthusiasm and authenticity which permeates everything she does which really comes across. This isn’t “web2 for web2’s sake” – this is genuine, pithy, risky, real content from enthused audiences who really want to take part in the life of the museum. 

My session was on setting your data and content free:

[slideshare id=768086&doc=mikeellisifyoulovesomethingsetitfreefinal-1227110930707512-9&w=425]

Hopefully the slides speak for themselves, but in a nutshell my argument is that although we’ve focussed heavily on the social aspects of Web2.0 from a user perspective, it is the stuff going on under the hood which really pushes the social web into new and exciting territory. It is the data sharing, the mashing, the API’s and the feeds which are at the heart of this new generation of web tools. We can resist the notion of free data by pretending that people use the web (and our sites) in a linear, controlled way, but the reality is we have fickle and intelligent users who will get to our content any which way. Given this, we can either push back against freer content by pretending we can lock it down, or – as I advocate – do what we can to give user access to it.

Museum mash

Most of you will probably have seen this already, but I’m running another museum mashup day in Leicester on 18th June 2008, the day before the annual UK Museums on the Web conference. It’ll be a developer/techie focussed day working with various bits of museum data to see what we can mash together in just one day.

We’re having to be very strict with the number of people allowed and the list is starting to look pretty full…but if you’re up for it, visit www.mashedmuseum.org.uk to find out more, or jump straight to the “I’m interested” form.

On a related note, I knocked out a quick application (the “hoard.it collector”) over the weekend based on data from the hoard.it application which Dan and I unveiled last week. It’s a terribly quick (read: a bit flaky) prototype which lets you “collect” museum object records and then share them so others can see what you’ve saved. Think Listmania but for museum stuff. The main purpose? Not to build something that’s never been done before, but (as ever) to demonstrate that it is possible to do interesting stuff without funding, the usual museum treaclewading or worrying too much about perfection. I’ll hone and polish the idea as time goes on – any ideas, stick ’em in the comments…

Dan and I are planning for the hoard.it data to be made available for the Mashed Museum day, but if you fancy a play before that, or can’t come along, you can read the documentation, do some building, and then let me or Dan know so we can chuck something in the hoard.it application gallery.

hoard.it : bootstrapping the NAW

What seems like a looong time ago I came up with an idea for “bootstrapping” the Non API Web (NAW), particularly around extracting un-structured content from (museum) collections pages.

The idea of scraping pages when there’s a lack of data access API isn’t new: Dapper launched a couple of years ago with a model for mapping and extracting from ‘ordinary’ html into a more programmatically useful format like RSS, JSON or XML. Before that there have been numerous projects that did the same (PiggyBank, Solvent, etc); Dapper is about the friendliest web2y interface so far, but it still fails IMHO in a number of ways.

Of course, there’s always the alternative approach, which Frankie Roberto outlined in his paper at Museums and the Web this year: don’t worry about the technology; instead approach the institution for data via an FOI request…

The original prototype I developed was based around a bookmarklet: the idea was that a user would navigate to an object page (although any templated “collection” or “catalogue” page is essentially the treated the same). If they wanted to “collect” the object on that page they’d click the bookmarklet, a script would look for data “shapes” against a pre-defined store, and then extract the data. Here’s some screen grabs of the process (click for bigger)

Science Museum object page An object page on the Science Museum website
Bookmarklet pop-up User clicks on the bookmarklet and a popup tells them that this page has been “collected” before. Data is separated by the template and “structured”
Bookmarklet pop-up Here, the object hasn’t been collected but the tech spots that the template is the same, so knows how to deal with the “data shape”
Defining fields in the hoard.it interface The hoard.it interface, showing how the fields are defined

I got talking to Dan Zambonini a while ago and showed him this first-pass prototype and he got excited about the potential straight away. Since then we’ve met a couple of times and exchanged ideas about what to do with the system, which we code-named “hoard.it”.

One of the ideas we pushed about early on was the concept of building web spidering into the system: instead of primarily having end-users as the “data triggers”, it should – we reasoned – be reasonably straightforward to define templates and then send a spider off to do the scraping instead.

The hoard.it spider

Dan has taken that idea and run with it. He built a spider in PHP, gave it a set of rules for templates and link-navigation and set it going. A couple of days ago he sent me a link to the data he’s collected – at time of writing, over 44,000 museum objects from 7 museums.

Dan has put together a REST-like querying method for getting at this data. Queries are passed in via URL and constructed in the form attribute/value – the query can be as long as you like, allowing fine-grained data access.

Data is returned as XML – there isn’t a schema right now, but that can follow in further prototypes. Dan has done quite a lot of munging to normalise dates and locations and then squeezed results into a simplified Dublin Core format.

Here’s an example query (click to see results – opens new window):

http://feeds.boxuk.com/museums/xmlfeed/location.made/Japan/

So this means “show me everything where location.made=Japan'”

Getting more fine-grained:

http://feeds.boxuk.com/museums/xmlfeed/location.made/Japan/dc.subject/weapons,entertainment

Yes, you guessed it – this is “things where location.made=Japan and dc.subject=weapons or entertainment”

Dan has done some lovely first-pass displays of ways in which this data could be used:

Also, any query can be appended with “/format/html” to show a simple html rendering of the request:

http://feeds.boxuk.com/museums/xmlfeed/location.made/exeter/format/html

What does this all mean?

The exposing of museum data in “machine-useful” form is a topic about which you’ll have noticed I’m pretty passionate. It’s a hard call, though (and one I’m working on with a number of other museum enthusiasts) – to get museums to understand the value of exposing data in this way.

The hoard.it method is a lovely workaround for those who don’t have, can’t afford or don’t understand why machine-accessible object data is important. On the one hand, it’s a hack – screenscraping is by definition a “dirty” method for getting at data. We’d all much prefer it if there was a better way – preferably, that all museums everywhere did this anyway. But the reality is very, very different. Most museums are still in the land of the NAW. I should also add that some (including the initial 7 museums spidered for the purposes of this prototype) have some API’s that they haven’t exposed. Hoard.it can help those who have already done the work of digitising but haven’t exposed the data in a machine-readable format.

Now that we’ve got this kind of data returned, we can of course parse it and deliver back…pretty much anything, from mobile-formatted results to ecards to kiosk to…well, use your imagination…

What next?

I’m running another mashed museum day the day before the annual Museums Computer Group conference in Leicester, and this data will be made available to anyone who wants to use it to build applications, visualisations or whatever around museum objects. Dan has got a bunch of ideas about how to extend the application, as have I – but I guess the main thing is that now it’s exposed, you can get into it and start playing!

How can I find out more?

We’re just in the process of putting together a simple series of wiki pages with some FAQ’s. Please use that, or the comments on this post to get in touch. Look forward to hearing from you!

Are synapses intelligent?

It’s hard not to be fascinated by the emerging and developing conversations around museums and the Semantic Web. Museums, apart from anything else, have lots of stuff, and a constant problem finding ways of intelligently presenting and cross-linking that stuff. Search is ok if you know what you’re looking for but browse as an alternative is usually a terribly pedestrian experience, failing to match the serendipity and excitement you get in a physical exhibition or gallery.

During the Museums and the Web conference, there was a tangible thread of conversation and thought around the API’d museum, better ways of doing search, and varied opinions about openness and commerce, but always there was the endless tinnitus of the semantic web never far away from people’s consciousnesses.

As well as the ongoing conversation, there were some planned moments as well, among them a workshop run by Eric Miller (ex. W3C sem web guru), Ross Parry‘s presentation and discussion of the “Cultural Semantic Web” AHRC-funded think tank and the coolness of Open Calais being applied to museum collections data by Seb Chan at the Powerhouse (article on ReadWrite Web here – nice one Seb!).

During the week I also spent some time hanging out with George Oates and Aaron Straup Cope from Flickr, and it’s really from their experiences that some thoughts started to emerge which I’ve been massaging to the surface ever since.

Over a bunch of drinks, George told me a couple of fairly mind-blowing statistics about the quantity of data on Flickr: more than 2 billion images which are being uploaded at a rate of more than 3 million a day….

What comes with these uploads is data – huge, vast, obscene quantities of data – tags, users, comments, links. And that vat of information has a value which is hugely amplified because of the sheer volume of stuff.

To take an example: at the individual tag level, the flaws of misspellings and inaccuracies are annoying and troublesome, but at a meta level these inaccuracies are ironed out; flattened by sheer mass: a kind of bell-curve peak of correctness. At the same time, inferences can be drawn from the connections and proximity of tags. If the word “cat” appears consistently – in millions and millions of data items – next to the word “kitten” then the system can start to make some assumptions about the related meaning of those words. Out of the apparent chaos of the folksonomy – the lack of formal vocabulary, the anti-taxonomy – comes a higher-level order. Seb put it the other way round by talking about the “shanty towns” of museum data: “examine order and you see chaos”.

The total “value” of the data, in other words, really is way, way greater than the sum of the parts.

This is massively, almost unconceivably powerful. I talked with Aaron about how this might one day be released as a Flickr API: a way of querying the “clusters” in order to get further meaning from phrases or words submitted. He remained understandably tight-lipped about the future of Flickr, but conceptually this is an important idea, and leads the thinking in some interesting directions.

On the web, the idea of the wisdom of crowds or massively distributed systems are hardly new. We really is better than me.

I got thinking about how this can all be applied to the Semantic Web. It increasingly strikes me that the distributed nature of the machine processable, API-accessible web carries many similar hallmarks. Each of those distributed systems – the Yahoo! Content Analysis API, the Google postcode lookup, Open Calais – are essentially dumb systems. But hook them together; start to patch the entire thing into a distributed framework, and things take on an entirely different complexion.

I’ve supped many beers with many people over “The Semantic Web”. Some have been hardcore RDF types – with whom I usually lose track at about paragraph three of our conversation, but stumble blindly on in true “just be confident, hopefully no-one will notice you don’t know what you’re talking about” style. Others have been more “like me” – in favour of the lightweight, top-down, “easy” approach. Many people I’ve talked to have simply not been given (or able to give) any good examples of what or why – and the enduring (by now slightly stinky, embarassing and altogether fishy) albatross around the neck of anything SW is that no-one seems to be doing it in ways that anyone ~even vaguely normal~ can understand.

Here’s what I’m starting to gnaw at: maybe it’s here. Maybe if it quacks like a duck, walks like a duck (as per the recent Becta report by Emma Tonkin at UKOLN) then it really is a duck. Maybe the machine-processable web that we see in mashups, API’s, RSS, microformats – the so-called “lightweight” stuff that I’m forever writing about – maybe that’s all we need. Like the widely accepted notion of scale and we-ness in the social and tagged web, perhaps these dumb synapses when put together are enough to give us the collective intelligence – the Semantic Web – that we have talked and written about for so long.

Here’s a wonderful quote from Emma’s paper to finish:

“By ‘semantic’, Berners-Lee means nothing more than ‘machine processable’. The choice of nomenclature is a primary cause of confusion on both sides of the debate. It is unfortunate that the effort was not named ‘the machineprocessable web’ instead.”