The semantic web: good ideas poorly supported?

The semantic web and open linked data open-up the vision of scientific data published in machine-readable form. But their adoption faces some challenges, many self-inflicted. Last week I taught a short course on programming context-aware systems as a Swiss doctoral winter school. The idea was to follow the development process from the ideas of context, through modelling and representation, to reasoning and maintenance. Context has a some properties that make it challenging for software development. The data available tends to be heterogeneous, dynamic and richly linked. All these properties can impede the use of normal approaches like object-oriented design, which tend to favour systems that can be specified in a static object model up-front. Rather than use an approach that’s highly structured from its inception, am alternative approach is to use an open, unstructured representation and then add structure later using auxiliary data.This leads naturally into the areas of linked open data and the semantic web. The semantic web is a term coined by Tim Berners-Lee as a natural follow-on from his original web design. Web pages are great for browsing but don’t typically lend themselves to automated processing. You may be able to extract my phone number from my contact details page, for example, but that’s because you understand typography and abbreviation: there’s nothing that explicitly marks the number out from the other characters on the page. Just as the web makes information readily accessible to people, the semantic web aims to make information equally accessible to machines. It does this by allowing web pages to be marked-up using a format that’s more semantically rich than the usual HTML. This uses two additional technologies: the Resource Description Framework (RDF) to assert facts about objects, SPARQL to access to model using queries, and the Web Ontology Language (OWL) to describe the structure of a particular domain of discourse. Using the example above, RDF would mark-up the phone number, email address etc explicitly, using terminology described in OWL to let a computer understand the relationships between, for example, a name, an email address, an employing institution and so on. Effectively the page, as well as conveying content for human consumption, can carry content marked-up semantically for machines to use autonomously. And of course you can also create pages that are purely for machine-to-machine interaction, essentially treating the web as a storage and transfer mechanism with RDF and OWL as semantically enriched transport formats. So far so good. But RDF, SPARQL and OWL are far from universally accepted “in the trade”, for a number of quite good reasons. The first is verbosity. RDF uses XML as an encoding, which is quite a verbose, textual format. Second is complexity: RDF makes extensive use of XML namespaces, which add structure and prevent misinterpretation but make pages harder to create and parse. Third is the exchange overhead, whereby data has to be converted from in-memory form that programs work with into RDF for exchange and then back again at the other end, each step adding more complexity and risks of error. Fourth is the unfamiliarity of many of the concepts, such as the dynamic non-orthogonal classification used in OWL rather than the static class hierarchies of common object-oriented approaches. Fifth is the disconnect between program data and model, with SPARQL sitting off to one side like SQL. Finally there is the need for all these technologies en masse (in addition to understanding HTTP, XML and XML Schemata) to perform even quite simple tasks, leading to a steep learning curve and a high degree of commitment in a project ahead of any obvious returns. So the decision to use the semantic web isn’t without pain, and one needs to place sufficient value on its advantages — open, standards-based representation, easy exchange and integration — to make it worthwhile. It’s undoubtedly attractive to be able to define a structure for knowledge that exactly matches a chosen sub-domain, to describe the richness of this structure, and to have it compose more or less cleanly with other such descriptions of complementary sub-domains defined independently — and to be able to exchange all this knowledge with anyone on the web. But this flexibility comes with a cost and (often) no obvious immediate, high-value benefits. Having taught this stuff, I think the essential problem is one of tooling and integration, not core behaviour. The semantic web does include some really valuable concepts, but their realisation is currently poor and this poses a hazard to their adoption. In many ways the use of XML is a red herring: no sane person holds data to be used programmatically as XML. It is — and was always intended to be — an exchange format, not a data structure. So the focus needs to be on the data model underlying RDF (subject-predicate-object triples with subjects and predicates represented using URIs) rather than on the use of XML. While there are standard libraries and tools for use with the semantic web — in Java these include Jena for representing models, Pellet and other reasoners providing ontological reasoning, and Protégé for ontology development — their level of abstraction and integration with the rest of the language remain quite shallow. It is hard to ensure the validity of an RDF graph against an ontology, for example, and even harder to validate updates. The type systems also don’t match, either statically or dynamically: OWL performs classification based on attributes rather than by defining hard classes, and the classification may change unpredictably as attributes are changed. (This isn’t just a problem for statically-typed programming languages, incidentally: having the objects you’re working with re-classified can invalidate the operations you’re performing at a semantic level, regardless of whether the type system complains.) The separation of querying and reasoning from representation is awkward, rather like the use of SQL embedded into programs: the query doesn’t fit naturally into the host language, which typically has no syntactic support for constructing queries. Perhaps the solution is to step back and ask: what problem does the semantic web solve? In essence it addresses the open and scalable mark-up of data across the web according to semantically meaningful schemata. But programming languages don’t do this: they’re about nailing-down data structures, performing local operations efficiently, and letting developers share code and functionality. So there’s a mis-match between the goals of the two system components, and their strengths don’t complement each other in the way one might like. This suggests that we re-visit the integration of RDF, OWL and SPARQL into programming languages; or, alternatively, that we look at for what features would provide the best computational capabilities alongside these technologies. A few characteristics spring to mind:

Use classification throughout, a more dynamic type structure than classical type systems
Access RDF data “native”, using binding alongside querying
Adopt the XML Schemata types “native” as well
Make code polymorphic in the manner of OWL ontologies, so that code can be exchanged and re-used. This implies basing typing on reasoning rather than being purely structural
Hiding namespaces, URIs and the other elements of RDF and OWL behind more familiar (and less intrusive) syntax (while keeping the semantics)
Allow programmatic data structures, suited to local use in a program, to be layered onto the data graph without forcing the graph itself into convoluted structures
Thinking about the global, non-local data structuring issues
Make access to web data intrinsic, not something that’s done outside the normal flow of control and syntax

The challenges here are quite profound, not least from relatively pedestrian matters like concurrency control, but at least we would then be able to leverage the investment in data mark-up and exchange to obtain some of the benefits the semantic web clearly offers.