We have five more PhD scholarships tenable from September, for students interested in any of our areas of research. The details are available here. Please note that I'm not taking on a PhD student this year, so don't bother applying to work with me.
Beware of books bearing illnesses.
I recently read Hindenburg: the wooden titan, John Wheeler-Bennet's biography of the presidency of the man who ushered Hitler into power. It's an interesting book, written just after Hindenburg's death and so before the second world war and the realisation of exactly what had been brought forth. You can also find contemporary book reviews online which are likewise fascinating in what they don't know. It really brings home AJP Taylor's comment (in The struggle for mastery in Europe) that that hardest thing about the study of history to to remember that events now long in the past once lay in the future.
What was unexpectedly fascinating was the bookplate in the front of the volume I borrowed from the university library, which was published in 1937. (I think it's a first edition.) The top half is fairly standard, but the bottom part describes something you wouldn't expect to find in a library book:
"A person shall not return any public library book which he knows to have been exposed to infection from a notifiable disease ... [including] smallpox, cholera, diphtheria, ..."
And so on. The list of diseases includes things against which we are now routinely vaccinated (diphtheria, tuberculosis), those we haven't seen in the UK in my lifetime (typhus), and those most people would now brush-off with a course of antibiotics (flu, pneumonia). Except, of course, that at the time there was far less vaccination and no antibiotics at all, since mass production didn't start until the 1940's.
So as well as being an observation of a period in history from a point in time from which its significance was only poorly understood, this book is a time capsule from a period when diseases were a lot bigger and broader threat than they are now.
Do we now have some post-modern programming languages?
Programming languages change all the time. It's never been the case that one could learn just one and then use it: programming is too rich and complicated for that to be a solution for the majority of programmers. However, we seem to be seeing a change in what we think of as programming languages and how they're used to build larger software systems.
At the risk of gross over-simplification (and ridicule), let's divide languages into various "eras". In the beginning, the early languages were built around some explicit notion of the machine on which they were running. Assemblers were devoted to a single machine code. Higher-level languages like Forth abstracted away from the machine code but still exposed the guts of the machine, and in particular its data bus width and memory access mechanisms.
The next step in abstraction -- to pre-modernism -- was to provide a machine that was still recognisably a Von Neumann system but that hid the details of the underlying machine code and architecture. C and Pascal fall into this category: their operations and abstractions are still largely those of the physical machine, but at a sufficient remove to achieve portability and a usable programming model. But we still have a view of the machine's underlying limitations. Modern languages like Perl, Java and Haskell offer a model of computation, and especially a model of memory, that's considerably removed from the underlying machine and enormously simplifies the programmer's task, at least for "suitable" programs.
(Interestingly enough languages don't fall into eras in the sequence of their design or popularity. Lisp considerably pre-dates C but is definitely modern using the terminology above.)
What's perhaps not obvious is that all these languages share a common design philosophy. They are all based on a small set of primitive notions combined with some ways of combining those notions, to get compound abstractions that resemble their primitive underpinnings. In C we get a small number of base types, some compound structs and unions, and pointers, and most interesting compound structures one builds will contain pointers all over the place. In Java we can build classes that create and combine objects; in Haskell we get lists and higher-order functions as the basic building blocks. The point is that the basic concepts are general, as are the composition operators, and the larger capabilities are in some sense a consequence of these initial choices and characterise the programs it's easy (or hard) to write.
This philosophy makes perfect sense, of course, which explains why it's so widespread. Programmers have to get to grips with a small set of basic concepts and composition methods. Everything else flows from this, and the extensions to the language -- libraries and the like -- will be the same as code one could write oneself: they're time-savers, but not fundamentally different from your own code. As a consequence the basic concepts have to be well-chosen to support the range of programs people will want to write. Some will be easier to write than others -- that's what makes the choice of language more than simply a matter of fashion -- but there's an acceptance that pretty much anything is writeable.
However, we're now seeing languages emerge that aren't like this. Instead they're very much targeted at specific domains and problem sets. We've always had domain-specific languages (DSLs), of course, but the current crop seem to me to be rather different. They've abandoned the design philosophy of a small, basic set of abstractions and powerful composition, in favour of large basic abstractions and limited composition and extension. This significantly changes development, and the balance of power between the language designer and the programmer.
XSLT is the language that's brought this most to mind, and especially the transition from XSLT 1.0 to XSLT 2.0.
First we have to defend the notion that XSLT is a programming language. I first thought of it as just a transformation system for documents, and so it is: but it's targeted at such a general set of applications that it takes on the flavour of a DSL. It's essentially a functional language with pattern-matching and recursion, whose main data structure is the XML document. It lets one write programs that are driven by these documents: another way to look at it is that it gives executable semantics to a set of XML tags, allowing them to perform computation.
XSLT's design doesn't start with a small set of primitives. Instead it provides a collection of tags that can be used to perform the common document operations. If you want to number a list, for example, XSLT contains tags that support the common numbering formats: numbers, letters with brackets, and so forth. What?, you wanted something else? Well, you can probably do it, but it'll be indescribably hard because you'll be writing a program in a language for which full computation is an afterthought. Instead, XSLT 2.0 adds new numbering formats that you can select using a helpful set of tags, and this is done within the XSLT processor, not using the XSLT processor. Meaningful extension requires the intervention of the language designer.
This is a major philosophical change, a post-modern approach to language design. Don't focus on getting the core abstractions right: instead, build a system -- any system -- that demonstrates initial functionality, and then extend it with new features to meet new demands regardless of how they fit. There's no overarching conceptual scheme to the system. Instead it's judged purely on the functions it provides, regardless of how they fit together. (Larry Wall described Perl as post-modern, but I don't think he meant quite what I'm getting at here -- although there are similarities.)
Post-modernism works for three reasons. Firstly, no-one is expecting to write very large amounts of code in these systems. They're intended for sub-systems, not for systems themselves, and so the need for generality is limited. Secondly, there's a premium on speed of development rather than on maintenance, which in turn puts a premium on getting some result out quickly rather than on ensuring that, in the future, we can get any result out we need. The sub-systems are intended to change on a short timescale, so maintenance and extensibility is perceived to be less of an issue than agility. Thirdly, a lot of these languages target people who aren't programmers -- or at least don't think of themselves as programmers -- who are focused on something other than the code.
Post-modernism isn't wrong, and appears in middleware too. It's also important to realise the benefits: it makes it easier to put together larger systems, and increases the ambition of projects one can undertake compared to having to build so much from scratch, But it may be misguided. The people who spent the 1960's writing all the COBOL code that's still running today never thought it'd still be being maintained -- and the consequences of that lead to code that's now impossible to change. Simple solutions have a tendency to grow into more complex ones -- "just one more feature and we're done" -- which can push the costs of inappropriate choices out along the project lifecycle.
More importantly, for researchers post-modernism is a seductive siren call that lets you step away from tackling some difficult choices. Instead of picking a small set of concepts, choose any set and then extend them, in any way, to build up your system. I think this obscures some insights one might otherwise gain from simplifying your set of concepts.
I think it's also worth remembering that increasing the size and number of abstractions isn't without cost: not for the computer or the compiler, but for the programmer. The more programmers have to remember, and in particular the less well the things to be remembered fit together as a conceptual whole, the harder it is to use the system to its fullest advantage. People instead stay with small sub-sets of the language or -- worse -- settle for programs producing results that aren't those they want, just to stay within their language comfort zone. This sort of simplification is a step backwards, and we need to be careful of the consequences.
The School of Computer Science has a number of opportunities available for Brazilian students through this programme.
The Brazilian government is funding studentships for Brazilian nationals with a range of international universities and disciplines. The studentships are "co-tutelle", being split between an international and a Brazilian host university, and aim to improve international collaboration as well as enhancing the student experience.
A number of Schools at St Andrews are part of this programme, ranging across the physical and biological sciences in areas in which we have international-quality research. In Computer Science, we have a wide range of project proposals on offer covering all our main research activities. Interested students should contact the supervisors offering the proposed projects to check on suitability and availability. All students will be subject to the University's usual rules on quality and eligibility.
We are introducing a new MSci degree programme with direct entry to second year.
We currently offer two traditional Scottish four-year degrees, in Computer Science and Internet Computer Science. A number of factors have contributed to a desire to change this. We'd like to offer a straight-line programme from undergraduate to masters, more in line with European norms (the "Bologna process"). For non-Scottish UK students (also known as "RUKs"), we'd like to offer direct entry into second year so that their more advanced A Level qualifications shorten their degree and reduce fee and accommodation costs accordingly. A straight-through programme will also let us present a more integrated collection of modules and improve our targeting for specialised degrees.
All this means we're introducing a five-year MSci degree in the next application cycle, for 2013 entry. Details are available here.
We have around six fully-funded PhD studentships available across computer science.
The School of Computer Science at the University of St Andrews has funding for students to undertake PhD research in any of the School's research areas:
We are looking for highly-motivated research students with an interest in these exciting areas. Our only requirements are that the proposed research would be good, we have staff to supervise it, and that you would be good at doing it. We have up to six funded studentships available for students interested in working towards a PhD. The studentships offer costs of fees and an annual tax-free maintenance stipend of about £13,590 per year for 3.5 years. Exceptionally well qualified and able students may be awarded an enhanced stipend of an additional £2,000 per year. Students should normally have or expect at least an upper-2nd class Honours degree or Masters degree in Computer Science or a related discipline.
For further information on how to apply, see our postgraduate web pages. The closing date for applications is March 1st 2012 and we will make decisions on studentship allocation by May 1st 2012. (Applications after March 1st may be considered, at our discretion.) Informal enquiries can be directed to firstname.lastname@example.org or to potential supervisors.
The semantic web and open linked data open-up the vision of scientific data published in machine-readable form. But their adoption faces some challenges, many self-inflicted.
Last week I taught a short course on programming context-aware systems as a Swiss doctoral winter school. The idea was to follow the development process from the ideas of context, through modelling and representation, to reasoning and maintenance.
Context has a some properties that make it challenging for software development. The data available tends to be heterogeneous, dynamic and richly linked. All these properties can impede the use of normal approaches like object-oriented design, which tend to favour systems that can be specified in a static object model up-front. Rather than use an approach that's highly structured from its inception, am alternative approach is to use an open, unstructured representation and then add structure later using auxiliary data.This leads naturally into the areas of linked open data and the semantic web.
The semantic web is a term coined by Tim Berners-Lee as a natural follow-on from his original web design. Web pages are great for browsing but don't typically lend themselves to automated processing. You may be able to extract my phone number from my contact details page, for example, but that's because you understand typography and abbreviation: there's nothing that explicitly marks the number out from the other characters on the page. Just as the web makes information readily accessible to people, the semantic web aims to make information equally accessible to machines. It does this by allowing web pages to be marked-up using a format that's more semantically rich than the usual HTML. This uses two additional technologies: the Resource Description Framework (RDF) to assert facts about objects, SPARQL to access to model using queries, and the Web Ontology Language (OWL) to describe the structure of a particular domain of discourse. Using the example above, RDF would mark-up the phone number, email address etc explicitly, using terminology described in OWL to let a computer understand the relationships between, for example, a name, an email address, an employing institution and so on. Effectively the page, as well as conveying content for human consumption, can carry content marked-up semantically for machines to use autonomously. And of course you can also create pages that are purely for machine-to-machine interaction, essentially treating the web as a storage and transfer mechanism with RDF and OWL as semantically enriched transport formats.
So far so good. But RDF, SPARQL and OWL are far from universally accepted "in the trade", for a number of quite good reasons.
The first is verbosity. RDF uses XML as an encoding, which is quite a verbose, textual format. Second is complexity: RDF makes extensive use of XML namespaces, which add structure and prevent misinterpretation but make pages harder to create and parse. Third is the exchange overhead, whereby data has to be converted from in-memory form that programs work with into RDF for exchange and then back again at the other end, each step adding more complexity and risks of error. Fourth is the unfamiliarity of many of the concepts, such as the dynamic non-orthogonal classification used in OWL rather than the static class hierarchies of common object-oriented approaches. Fifth is the disconnect between program data and model, with SPARQL sitting off to one side like SQL. Finally there is the need for all these technologies en masse (in addition to understanding HTTP, XML and XML Schemata) to perform even quite simple tasks, leading to a steep learning curve and a high degree of commitment in a project ahead of any obvious returns.
So the decision to use the semantic web isn't without pain, and one needs to place sufficient value on its advantages -- open, standards-based representation, easy exchange and integration -- to make it worthwhile. It's undoubtedly attractive to be able to define a structure for knowledge that exactly matches a chosen sub-domain, to describe the richness of this structure, and to have it compose more or less cleanly with other such descriptions of complementary sub-domains defined independently -- and to be able to exchange all this knowledge with anyone on the web. But this flexibility comes with a cost and (often) no obvious immediate, high-value benefits.
Having taught this stuff, I think the essential problem is one of tooling and integration, not core behaviour. The semantic web does include some really valuable concepts, but their realisation is currently poor and this poses a hazard to their adoption.
In many ways the use of XML is a red herring: no sane person holds data to be used programmatically as XML. It is -- and was always intended to be -- an exchange format, not a data structure. So the focus needs to be on the data model underlying RDF (subject-predicate-object triples with subjects and predicates represented using URIs) rather than on the use of XML.
While there are standard libraries and tools for use with the semantic web -- in Java these include Jena for representing models, Pellet and other reasoners providing ontological reasoning, and Protégé for ontology development -- their level of abstraction and integration with the rest of the language remain quite shallow. It is hard to ensure the validity of an RDF graph against an ontology, for example, and even harder to validate updates. The type systems also don't match, either statically or dynamically: OWL performs classification based on attributes rather than by defining hard classes, and the classification may change unpredictably as attributes are changed. (This isn't just a problem for statically-typed programming languages, incidentally: having the objects you're working with re-classified can invalidate the operations you're performing at a semantic level, regardless of whether the type system complains.) The separation of querying and reasoning from representation is awkward, rather like the use of SQL embedded into programs: the query doesn't fit naturally into the host language, which typically has no syntactic support for constructing queries.
Perhaps the solution is to step back and ask: what problem does the semantic web solve? In essence it addresses the open and scalable mark-up of data across the web according to semantically meaningful schemata. But programming languages don't do this: they're about nailing-down data structures, performing local operations efficiently, and letting developers share code and functionality. So there's a mis-match between the goals of the two system components, and their strengths don't complement each other in the way one might like.
This suggests that we re-visit the integration of RDF, OWL and SPARQL into programming languages; or, alternatively, that we look at for what features would provide the best computational capabilities alongside these technologies. A few characteristics spring to mind:
- Use classification throughout, a more dynamic type structure than classical type systems
- Access RDF data "native", using binding alongside querying
- Adopt the XML Schemata types "native" as well
- Make code polymorphic in the manner of OWL ontologies, so that code can be exchanged and re-used. This implies basing typing on reasoning rather than being purely structural
- Hiding namespaces, URIs and the other elements of RDF and OWL behind more familiar (and less intrusive) syntax (while keeping the semantics)
- Allow programmatic data structures, suited to local use in a program, to be layered onto the data graph without forcing the graph itself into convoluted structures
- Thinking about the global, non-local data structuring issues
- Make access to web data intrinsic, not something that's done outside the normal flow of control and syntax
Ebooks (and more particularly e-journals) seem to need plenty of different affordances to be really useful as replacements for print versions.
I've been getting a lot of reading done in the west of Ireland, some of which has been work-related and some of which has included looking through electronic journals (and art e-journals, in Linda's case). It really highlights the benefits and drawbacks of electronic publishing.
First the benefits:
- Searchability. Being able to search both an individual issue and all a journal's issues is fantastic both when looking for broad background and when trying to find detail.
- Access. Saving a PDF locally (or, better, in Evernote) reduces repeated searching, allows tagging, and facilitates off-line reading. No matter what people have us believe, there are times when we're off-line either by accident (on "soft" days my mi-fi modem stops working) or design (driving, resting, on the train). And this in turn means I can carry a vast range if material with me wherever I go: far more than in print form.
- Availability. Pretty much anything available anywhere in the scientific literature is directly accessible from my study in the wilds of Sligo.
- Limited browsing. Searching is great when you can frame the search terms: otherwise it's a positive impediment.
- Digital restrictions. Some publishers go through hoops to make sure you can't disseminate their material -- embedding a Flash-based reader into a web page to prevent local storage is my current bugbear -- and so obviate the benefits of access and tagging.
Perhaps the problem is the need to be overly precise. Searching works well enough, but only if done at the appropriate granularity, and only if the content is tagged according to the way you think about it, and in fact only if matches the way you think about it at the moment. This last problem, conceptual drift, makes long-term tagging painfully fragile: older content relating to slim touch-screen computers will not be not tagged ipad or tablet computer, for example, and so will be essentially invisible to a modern searcher.
Serendipitous finding of material seems to work better in traditional linear form, where one can browse and dip-in as required. Zinio allows this kind of interaction on tablets -- I've started taking The Economist this way rather than in print. Of course you're then trapped by digital restrictions, being force to read within the confines of a particular application that doesn't support user-provided hyperlinking, tagging, annotation or any of the other good things that come with ebooks.
I suppose all this is an argument for open ebook and e-journal publishing, using open standards that can be linked to, tagged and manipulated within different applications, and that comes in both "collected" (browsable) and "individual"(searchable) forms to facilitate the different modes of access that seem to be suggested by different user tasks.
The internet is just a mirror, and an amplifier
It's a commonplace to observe that the internet is changing the nature of discourse. Many of the concerns raised in such as discussions are, in my opinion, severely over-blown, but it's certainly the case that we can observe changes in behaviour in the real world that can be linked back to discussions in the digital world. The mechanisms of this are somewhat less discussed, but they seem to arise from the very nature of the medium and so will be hard to change quickly.
The future-watcher Ester Dyson once made the following observation:
The Internet is like alcohol in some sense. It accentuates what you would do anyway. If you want to be a loner, you can be more alone. If you want to connect, it makes it easier to connect.Perhaps we can generalise this slightly, and say instead that the internet is both a mirror and an amplifier. It identifies distinctions between people, reinforces and amplifies them.
How does this happen? It begins from the observation that every niche community is globally large. No matter how specialised one's interests, they will be shared by others elsewhere in the world. This has been an enormous boon to many people, most notably those who have (or who have relatives who have) rare illnesses: no matter how rare a condition is in the local gene pool, globally there will be a substantial population in a similar position who can provide understanding, advice and support. Often these will include individuals with access to and understanding of the latest research and treatments, often the scientists and doctors themselves: expertise that can be world-beating, and may completely overshadow what's available locally. What makes such communities possible is the low cost of setting up and maintaining a web site, and the power of search to allow such sites to be found by anyone sufficiently interested in them to spend some time crafting the right search terms. That is, the internet provides for the construction of specialised, distributed, communicating communities such as have never existed before: it's hard to conceive of doing something similar through the postal service, or even the telephone.
So the internet mirrors the human condition, from porn to poetry, and also democratises it. If you want to write poetry, you can hang out in communities that resemble the Bloomsbury Group in terms of their intellectual depth -- and you can do so from wherever you live, limited only by your ability to contribute. You can get feedback on your work or comments from others, perhaps more experienced than yourself, and so improve your own understanding and ability in your chosen field. The same thing has happened since that start of the internet: Conscience of a hacker, a poem of praise of meritocratic technical communities, still resonates with many computer people (including myself).
There is a side-effect of this specialisation, however, that's less positive. By forming a dedicated niche exactly suited to the needs of a particular community, the internet often removes those people from the general forums of discussion. Why engage with a general community when one can be with people exactly in tune with what you want? The answer, sometimes, is that what you want and what you need may be different, and that groupthink is far easier and more prevalent in smaller groups than in larger ones.
This is where amplification kicks in. If everyone in a community is kind of the same as yourself, it's harder to stand out -- and many people love to stand out. You can do so by being more helpful than anyone else, or by be more perceptive and insightful -- or by being more extreme. The very specialisation of internet communities, combined with the ease of communication, the partial or complete anonymity, and the desire to stand out can be very disinhibiting, and can amplify discussion quite quickly to the extremities of what that group considers acceptable. If you hang out in some subcultural chat rooms, what's acceptable is very elastic and discussion will quickly head to the extremes: more extreme than most people can easily imagine, in many cases.
So amplification seems to be a direct result of specialisation -- one of the things that makes internet communities so powerful in the first place -- and can operate both positively and negatively, making the good better and the bad worse.
Actually these effects are becoming more widespread than discussion groups, through the phenomenon of "search bubbles". Google and other search engines now perform substantial pre-processing on search results before presenting them, for example including factors of localisation, user preferences and history. This means that the results you see in response to a search aren't the whole story: you're seeing what Google thinks will be of most relevance and interest to you. This is simply another form of specialisation: more discreet than happens with discussion forums, but still separating out a sub-set of the whole internet's information that's closely targeted to you. This certainly suggests that it might be subject to the same amplification effects, as people perceive (wrongly) that a large amount of content on the internet agrees with them and fits their pre-modelled interests and preferences -- and prejudices.
It's hard to know what to suggest to deal with these issues. Actually I'm inclined to say that nothing should be done, other than to be aware that the mirror magnifies, as it were. The good of specialist communities almost certainly outweighs the disadvantages of these (or other) communities pushing extremism, but it'd be good if those engaged in such discussions keep the amplifying effects of internet discussions in the backs of their minds for when things start to get weird.
Submissions are now open for the BCS Distinguished Dissertations competition for recently-submitted PhDs in the UK.
The 2011 / 2012 CPHC/BCS Distinguished Dissertations competition is now open for submissions via the web site
Closing date Monday 2nd April 2012. Further details can be found below and on the webpage
The Conference of Professors and Heads of Computing (CPHC), in conjunction with BCS, The Chartered Institute for IT, annually selects for publication the best British PhD/DPhil dissertations in computer science.
The scheme aims to make more visible the significant contribution made by Britain -- in particular by post-graduate students -- to computer science. Publication also serves to provide a model for future students.
The selection panel on behalf of BCS/CPHC consists of experienced computer scientists, not more than one from any institution, each normally serving on the panel for three years. The 2012 panel members are: Simon Thompson (Kent, Chair), Teresa Attwood (Manchester), Russell Beale (Birmingham), Simon Dobson (St Andrews), Zoubin Ghahramani (Cambridge), Joemon Jose (Glasgow), Daniel Kroening (Oxford), Ralph Martin (Cardiff), Alexander Romanovsky (Newcastle) and Jon Whittle (Lancaster).
Any dissertation is eligible which is submitted for a doctorate in the British Isles in what is commonly understood as Computer Science. (Theses which are basically in some other discipline but which make use, even very extensive use, of computing will not be regarded as eligible.) However, there is a limit of THREE dissertations per year per university, and one per research group within any university.
To be considered, a dissertation should:
- make a noteworthy contribution to the subject;
- reach a high standard of exposition;
- place its results clearly in the context of computer science as a whole; and
- enable a computer scientist with significantly different interests to grasp its essentials.
The dissertation should be submitted electronically (as a PDF file) by the author's examiners, or by the Head of Department with the examiner's advice. The submitted version of the dissertation must be the final version after any required corrections have been made. The competition period for 2012 is for theses accepted from 1st January 2011 until the closing date of 2nd April 2012. A dissertation cannot be submitted to the competition more than once.
The dissertation should be accompanied by a written nomination comprising the following information:
- a justification, of about 300 words, by one of the examiners -- preferably the external -- explaining the dissertation's claim to distinction (against the criteria listed above);
- the name of the primary supervisor and the research group within the university to which the student was primarily affiliated;
- an assurance that within the competition period the examiners have recommended to the author's institution that the doctorate should be awarded; and
- the names and contact details of three suggested reviewers who are not in the same Department as the nominated thesis and who are independent of the supervision and examining of the thesis. The nominated reviewers must have confirmed that they are willing to provide a review.
Submissions should be made on-line via http://www.easychair.org/conferences/?conf=disdis12. The first author name submitted should be that of the thesis author; the individual submitting the nomination should list themselves as the second author. The title and abstract should be those of the thesis being nominated. The first file uploaded should be the 300 word nomination; the thesis document should be uploaded as an attachment. If any problems are experienced, or you have any questions, please email email@example.com for assistance.
The deadline for submission is 2nd April 2012.