Simon Dobson (old posts, page 15)

We have around six fully-funded PhD studentships available across computer science. The School of Computer Science at the University of St Andrews has funding for students to undertake PhD research in any of the School’s research areas: http://www.cs.st-andrews.ac.uk/research We are looking for highly-motivated research students with an interest in these exciting areas. Our only requirements are that the proposed research would be good, we have staff to supervise it, and that you would be good at doing it. We have up to six funded studentships available for students interested in working towards a PhD. The studentships offer costs of fees and an annual tax-free maintenance stipend of about £13,590 per year for 3.5 years. Exceptionally well qualified and able students may be awarded an enhanced stipend of an additional £2,000 per year. Students should normally have or expect at least an upper-2nd class Honours degree or Masters degree in Computer Science or a related discipline. For further information on how to apply, see our postgraduate web pages. The closing date for applications is March 1st 2012 and we will make decisions on studentship allocation by May 1st 2012. (Applications after March 1st may be considered, at our discretion.) Informal enquiries can be directed to pg-admin-cs@st-andrews.ac.uk or to potential supervisors.

The semantic web and open linked data open-up the vision of scientific data published in machine-readable form. But their adoption faces some challenges, many self-inflicted. Last week I taught a short course on programming context-aware systems as a Swiss doctoral winter school. The idea was to follow the development process from the ideas of context, through modelling and representation, to reasoning and maintenance. Context has a some properties that make it challenging for software development. The data available tends to be heterogeneous, dynamic and richly linked. All these properties can impede the use of normal approaches like object-oriented design, which tend to favour systems that can be specified in a static object model up-front. Rather than use an approach that’s highly structured from its inception, am alternative approach is to use an open, unstructured representation and then add structure later using auxiliary data.This leads naturally into the areas of linked open data and the semantic web. The semantic web is a term coined by Tim Berners-Lee as a natural follow-on from his original web design. Web pages are great for browsing but don’t typically lend themselves to automated processing. You may be able to extract my phone number from my contact details page, for example, but that’s because you understand typography and abbreviation: there’s nothing that explicitly marks the number out from the other characters on the page. Just as the web makes information readily accessible to people, the semantic web aims to make information equally accessible to machines. It does this by allowing web pages to be marked-up using a format that’s more semantically rich than the usual HTML. This uses two additional technologies: the Resource Description Framework (RDF) to assert facts about objects, SPARQL to access to model using queries, and the Web Ontology Language (OWL) to describe the structure of a particular domain of discourse. Using the example above, RDF would mark-up the phone number, email address etc explicitly, using terminology described in OWL to let a computer understand the relationships between, for example, a name, an email address, an employing institution and so on. Effectively the page, as well as conveying content for human consumption, can carry content marked-up semantically for machines to use autonomously. And of course you can also create pages that are purely for machine-to-machine interaction, essentially treating the web as a storage and transfer mechanism with RDF and OWL as semantically enriched transport formats. So far so good. But RDF, SPARQL and OWL are far from universally accepted “in the trade”, for a number of quite good reasons. The first is verbosity. RDF uses XML as an encoding, which is quite a verbose, textual format. Second is complexity: RDF makes extensive use of XML namespaces, which add structure and prevent misinterpretation but make pages harder to create and parse. Third is the exchange overhead, whereby data has to be converted from in-memory form that programs work with into RDF for exchange and then back again at the other end, each step adding more complexity and risks of error. Fourth is the unfamiliarity of many of the concepts, such as the dynamic non-orthogonal classification used in OWL rather than the static class hierarchies of common object-oriented approaches. Fifth is the disconnect between program data and model, with SPARQL sitting off to one side like SQL. Finally there is the need for all these technologies en masse (in addition to understanding HTTP, XML and XML Schemata) to perform even quite simple tasks, leading to a steep learning curve and a high degree of commitment in a project ahead of any obvious returns. So the decision to use the semantic web isn’t without pain, and one needs to place sufficient value on its advantages — open, standards-based representation, easy exchange and integration — to make it worthwhile. It’s undoubtedly attractive to be able to define a structure for knowledge that exactly matches a chosen sub-domain, to describe the richness of this structure, and to have it compose more or less cleanly with other such descriptions of complementary sub-domains defined independently — and to be able to exchange all this knowledge with anyone on the web. But this flexibility comes with a cost and (often) no obvious immediate, high-value benefits. Having taught this stuff, I think the essential problem is one of tooling and integration, not core behaviour. The semantic web does include some really valuable concepts, but their realisation is currently poor and this poses a hazard to their adoption. In many ways the use of XML is a red herring: no sane person holds data to be used programmatically as XML. It is — and was always intended to be — an exchange format, not a data structure. So the focus needs to be on the data model underlying RDF (subject-predicate-object triples with subjects and predicates represented using URIs) rather than on the use of XML. While there are standard libraries and tools for use with the semantic web — in Java these include Jena for representing models, Pellet and other reasoners providing ontological reasoning, and Protégé for ontology development — their level of abstraction and integration with the rest of the language remain quite shallow. It is hard to ensure the validity of an RDF graph against an ontology, for example, and even harder to validate updates. The type systems also don’t match, either statically or dynamically: OWL performs classification based on attributes rather than by defining hard classes, and the classification may change unpredictably as attributes are changed. (This isn’t just a problem for statically-typed programming languages, incidentally: having the objects you’re working with re-classified can invalidate the operations you’re performing at a semantic level, regardless of whether the type system complains.) The separation of querying and reasoning from representation is awkward, rather like the use of SQL embedded into programs: the query doesn’t fit naturally into the host language, which typically has no syntactic support for constructing queries. Perhaps the solution is to step back and ask: what problem does the semantic web solve? In essence it addresses the open and scalable mark-up of data across the web according to semantically meaningful schemata. But programming languages don’t do this: they’re about nailing-down data structures, performing local operations efficiently, and letting developers share code and functionality. So there’s a mis-match between the goals of the two system components, and their strengths don’t complement each other in the way one might like. This suggests that we re-visit the integration of RDF, OWL and SPARQL into programming languages; or, alternatively, that we look at for what features would provide the best computational capabilities alongside these technologies. A few characteristics spring to mind:

Use classification throughout, a more dynamic type structure than classical type systems
Access RDF data “native”, using binding alongside querying
Adopt the XML Schemata types “native” as well
Make code polymorphic in the manner of OWL ontologies, so that code can be exchanged and re-used. This implies basing typing on reasoning rather than being purely structural
Hiding namespaces, URIs and the other elements of RDF and OWL behind more familiar (and less intrusive) syntax (while keeping the semantics)
Allow programmatic data structures, suited to local use in a program, to be layered onto the data graph without forcing the graph itself into convoluted structures
Thinking about the global, non-local data structuring issues
Make access to web data intrinsic, not something that’s done outside the normal flow of control and syntax

The challenges here are quite profound, not least from relatively pedestrian matters like concurrency control, but at least we would then be able to leverage the investment in data mark-up and exchange to obtain some of the benefits the semantic web clearly offers.

Ebooks (and more particularly e-journals) seem to need plenty of different affordances to be really useful as replacements for print versions. I’ve been getting a lot of reading done in the west of Ireland, some of which has been work-related and some of which has included looking through electronic journals (and art e-journals, in Linda’s case). It really highlights the benefits and drawbacks of electronic publishing. First the benefits:

Searchability. Being able to search both an individual issue and all a journal’s issues is fantastic both when looking for broad background and when trying to find detail.
Access. Saving a PDF locally (or, better, in Evernote) reduces repeated searching, allows tagging, and facilitates off-line reading. No matter what people have us believe, there are times when we’re off-line either by accident (on “soft” days my mi-fi modem stops working) or design (driving, resting, on the train). And this in turn means I can carry a vast range if material with me wherever I go: far more than in print form.
Availability. Pretty much anything available anywhere in the scientific literature is directly accessible from my study in the wilds of Sligo.

All well-known benefits, of course. The disadvantages only really become apparent in special circumstances:

Limited browsing. Searching is great when you can frame the search terms: otherwise it’s a positive impediment.
Digital restrictions. Some publishers go through hoops to make sure you can’t disseminate their material — embedding a Flash-based reader into a web page to prevent local storage is my current bugbear — and so obviate the benefits of access and tagging.

There’s no sense of these deficiencies outweighing the benefits of searching, access and availability, of course, but it does suggest that e-journals especially need multiple affordances in their interfaces. The problem does seem to be uniquely for journals, too: books intended to be read in toto work very well in the main when converted to ebooks, but collections intended to be dipped into convert less well. Perhaps the problem is the need to be overly precise. Searching works well enough, but only if done at the appropriate granularity, and only if the content is tagged according to the way you think about it, and in fact only if matches the way you think about it at the moment. This last problem, conceptual drift, makes long-term tagging painfully fragile: older content relating to slim touch-screen computers will not be not tagged ipad or tablet computer, for example, and so will be essentially invisible to a modern searcher. Serendipitous finding of material seems to work better in traditional linear form, where one can browse and dip-in as required. Zinio allows this kind of interaction on tablets — I’ve started taking The Economist this way rather than in print. Of course you’re then trapped by digital restrictions, being force to read within the confines of a particular application that doesn’t support user-provided hyperlinking, tagging, annotation or any of the other good things that come with ebooks. I suppose all this is an argument for open ebook and e-journal publishing, using open standards that can be linked to, tagged and manipulated within different applications, and that comes in both “collected” (browsable) and “individual”(searchable) forms to facilitate the different modes of access that seem to be suggested by different user tasks.

The internet is just a mirror, and an amplifier It’s a commonplace to observe that the internet is changing the nature of discourse. Many of the concerns raised in such as discussions are, in my opinion, severely over-blown, but it’s certainly the case that we can observe changes in behaviour in the real world that can be linked back to discussions in the digital world. The mechanisms of this are somewhat less discussed, but they seem to arise from the very nature of the medium and so will be hard to change quickly. The future-watcher Ester Dyson once made the following observation:

The Internet is like alcohol in some sense. It accentuates what you would do anyway. If you want to be a loner, you can be more alone. If you want to connect, it makes it easier to connect.

Perhaps we can generalise this slightly, and say instead that the internet is both a mirror and an amplifier. It identifies distinctions between people, reinforces and amplifies them. How does this happen? It begins from the observation that every niche community is globally large. No matter how specialised one’s interests, they will be shared by others elsewhere in the world. This has been an enormous boon to many people, most notably those who have (or who have relatives who have) rare illnesses: no matter how rare a condition is in the local gene pool, globally there will be a substantial population in a similar position who can provide understanding, advice and support. Often these will include individuals with access to and understanding of the latest research and treatments, often the scientists and doctors themselves: expertise that can be world-beating, and may completely overshadow what’s available locally. What makes such communities possible is the low cost of setting up and maintaining a web site, and the power of search to allow such sites to be found by anyone sufficiently interested in them to spend some time crafting the right search terms. That is, the internet provides for the construction of specialised, distributed, communicating communities such as have never existed before: it’s hard to conceive of doing something similar through the postal service, or even the telephone. So the internet mirrors the human condition, from porn to poetry, and also democratises it. If you want to write poetry, you can hang out in communities that resemble the Bloomsbury Group in terms of their intellectual depth — and you can do so from wherever you live, limited only by your ability to contribute. You can get feedback on your work or comments from others, perhaps more experienced than yourself, and so improve your own understanding and ability in your chosen field. The same thing has happened since that start of the internet: Conscience of a hacker, a poem of praise of meritocratic technical communities, still resonates with many computer people (including myself). There is a side-effect of this specialisation, however, that’s less positive. By forming a dedicated niche exactly suited to the needs of a particular community, the internet often removes those people from the general forums of discussion. Why engage with a general community when one can be with people exactly in tune with what you want? The answer, sometimes, is that what you want and what you need may be different, and that groupthink is far easier and more prevalent in smaller groups than in larger ones. This is where amplification kicks in. If everyone in a community is kind of the same as yourself, it’s harder to stand out — and many people love to stand out. You can do so by being more helpful than anyone else, or by be more perceptive and insightful — or by being more extreme. The very specialisation of internet communities, combined with the ease of communication, the partial or complete anonymity, and the desire to stand out can be very disinhibiting, and can amplify discussion quite quickly to the extremities of what that group considers acceptable. If you hang out in some subcultural chat rooms, what’s acceptable is very elastic and discussion will quickly head to the extremes: more extreme than most people can easily imagine, in many cases. So amplification seems to be a direct result of specialisation — one of the things that makes internet communities so powerful in the first place — and can operate both positively and negatively, making the good better and the bad worse. Actually these effects are becoming more widespread than discussion groups, through the phenomenon of “search bubbles”. Google and other search engines now perform substantial pre-processing on search results before presenting them, for example including factors of localisation, user preferences and history. This means that the results you see in response to a search aren’t the whole story: you’re seeing what Google thinks will be of most relevance and interest to you. This is simply another form of specialisation: more discreet than happens with discussion forums, but still separating out a sub-set of the whole internet’s information that’s closely targeted to you. This certainly suggests that it might be subject to the same amplification effects, as people perceive (wrongly) that a large amount of content on the internet agrees with them and fits their pre-modelled interests and preferences — and prejudices. It’s hard to know what to suggest to deal with these issues. Actually I’m inclined to say that nothing should be done, other than to be aware that the mirror magnifies, as it were. The good of specialist communities almost certainly outweighs the disadvantages of these (or other) communities pushing extremism, but it’d be good if those engaged in such discussions keep the amplifying effects of internet discussions in the backs of their minds for when things start to get weird.

Submissions are now open for the BCS Distinguished Dissertations competition for recently-submitted PhDs in the UK. The 2011 / 2012 CPHC/BCS Distinguished Dissertations competition is now open for submissions via the web site http://www.easychair.org/conferences/?conf=disdis12 Closing date Monday 2nd April 2012. Further details can be found below and on the webpage http://www.bcs.org/server.php?show=nav.5820 The Conference of Professors and Heads of Computing (CPHC), in conjunction with BCS, The Chartered Institute for IT, annually selects for publication the best British PhD/DPhil dissertations in computer science. The scheme aims to make more visible the significant contribution made by Britain — in particular by post-graduate students — to computer science. Publication also serves to provide a model for future students. The selection panel on behalf of BCS/CPHC consists of experienced computer scientists, not more than one from any institution, each normally serving on the panel for three years. The 2012 panel members are: Simon Thompson (Kent, Chair), Teresa Attwood (Manchester), Russell Beale (Birmingham), Simon Dobson (St Andrews), Zoubin Ghahramani (Cambridge), Joemon Jose (Glasgow), Daniel Kroening (Oxford), Ralph Martin (Cardiff), Alexander Romanovsky (Newcastle) and Jon Whittle (Lancaster). Any dissertation is eligible which is submitted for a doctorate in the British Isles in what is commonly understood as Computer Science. (Theses which are basically in some other discipline but which make use, even very extensive use, of computing will not be regarded as eligible.) However, there is a limit of THREE dissertations per year per university, and one per research group within any university. To be considered, a dissertation should:

make a noteworthy contribution to the subject;
reach a high standard of exposition;
place its results clearly in the context of computer science as a whole; and
enable a computer scientist with significantly different interests to grasp its essentials.

It is reasonable to submit a thesis to the scheme if it has all of the above qualities in good measure, and if it is comparable in standard with the top 10% of dissertations in the subject. Long dissertations are not encouraged; if the main text is more than 80,000 words, there should be good justification. The dissertation should be submitted electronically (as a PDF file) by the author’s examiners, or by the Head of Department with the examiner’s advice. The submitted version of the dissertation must be the final version after any required corrections have been made. The competition period for 2012 is for theses accepted from 1st January 2011 until the closing date of 2nd April 2012. A dissertation cannot be submitted to the competition more than once. The dissertation should be accompanied by a written nomination comprising the following information:

a justification, of about 300 words, by one of the examiners — preferably the external — explaining the dissertation’s claim to distinction (against the criteria listed above);
the name of the primary supervisor and the research group within the university to which the student was primarily affiliated;
an assurance that within the competition period the examiners have recommended to the author’s institution that the doctorate should be awarded; and
the names and contact details of three suggested reviewers who are not in the same Department as the nominated thesis and who are independent of the supervision and examining of the thesis. The nominated reviewers must have confirmed that they are willing to provide a review.

An indication should be given if the dissertation is being considered for publication elsewhere. In addition the author’s written agreement that their thesis may be considered for the Distinguished Dissertation competition should be emailed by the author to disdis12@easychair.org. Submissions should be made on-line via http://www.easychair.org/conferences/?conf=disdis12. The first author name submitted should be that of the thesis author; the individual submitting the nomination should list themselves as the second author. The title and abstract should be those of the thesis being nominated. The first file uploaded should be the 300 word nomination; the thesis document should be uploaded as an attachment. If any problems are experienced, or you have any questions, please email disdis12@easychair.org for assistance. The deadline for submission is 2nd April 2012.

Simon Dobson

Studentships available

The semantic web: good ideas poorly supported?

Ebook-ery

The mirror

British Computer Society Distinguished Dissertations competition