The next challenges for situation recognition

As a pervasive systems research community we’re doing quite well at automatically identifying simple things happening in the world. What is the state of the art, and what are the next steps? Pervasive computing is about letting computers see and respond to human activity. In healthcare applications, for example, this might involve monitoring an elderly person in their home and checking for signs of normality: doors opening, the fridge being accessed, the toilet flushing, the bedroom lights going on and off at the right times, and so on. A collection of simple sensors can provide the raw observational data, and we can monitor this stream of data for the “expected” behaviours. If we don’t see them — no movement for over two hours during daytime, for example — then we can sound an alarm and alert a carer. Done correctly this sort of system can literally be a life-saver, but also makes all the difference for people with degenerative illnesses living in their own homes. The science behind these systems is often referred to as activity and situation recognition, both of which are forms of context fusion. To deal with these in the wrong order: context fusion is the ability to take several streams of raw sensor (and other) data and use it to make inferences; activity recognition is the detection of simple actions in this data stream (lifting a cup, chopping with a knife); and situation recognition is the semantic interpretation of a high-level process (making tea, watching television, in a medical emergency). Having identified the situation we can then provide an appropriate behaviour for the system, which might involve changing the way the space is configured (dimming the lights, turning down the sound volume), providing information (“here’s the recipe you chose for tonight”) or taking some external action (calling for help). This sort of context-aware behaviour is the overall goal. The state of the art in context fusion uses some sort of uncertain reasoning including machine learning and other techniques that are broadly in the domain of artificial intelligence. These are typically more complicated than the complex event processing techniques used in financial systems and the like, because they have to deal with significant noise in the data stream. (Ye, Dobson and McKeever. Situation recognition techniques in pervasive computing: a review. Pervasive and Mobile Computing. 2011.) The results are rather mixed, with a typical technique (a naive Bayesian classifier, for example) being able to identify some situations well and others far more poorly: there doesn’t seem to be a uniformly “good” technique yet. Despite this we can now achieve 60-80% accuracy (by F-measure, a unified measure of the “goodness” of a classification technique) on simple activities and situations. That sounds good, but the next steps are going to be far harder. To see what the nest step is, consider that most of the systems explored have been evaluated under laboratory conditions. These allow fine control over the environment — and that’s precisely the problem. The next challenges for situation recognition come directly from the loss of control of what’s being observed. Let’s break down what needs to happen. Firstly, we need to be able to describe situations in a way that lets us capture human processes. This is easy to do to another human but tricky to a computer: the precision with which we need to express computational tasks gets in the way. For example we might describe the process of making lunch as retrieving the bread from the cupboard, the cheese and butter from the fridge, a plate from the rack, and then spreading the butter, cutting the cheese, and assembling the sandwich. That’s a good enough description for a human, but most of the time isn’t exactly what happens. One might retrieve the elements in a different order, or start making the sandwich (get the bread, spread the butter) only to remember that you forgot the filling, and therefore go back to get the cheese, then re-start assembling the sandwich, and so forth. The point is that this isn’t programming: people don’t do what you expect them to do, and there are so many variations to the basic process that they seem to defy capture — although no human observer would have the slightest difficulty in classifying what they were seeing. A first challenge is therefore a way of expressing the real-world processes and situations we want to recognise in a way that’s robust to the things people actually do. (Incidentally, this way of thinking about situations shows that it’s the dual of traditional workflow. In a workflow system you specify a process and force the human agents to comply; in a situation description the humans do what they do and the computers try to keep up.) The second challenge is that, even when captured, situations don’t occur in isolation. We might define a second situation to control what happens when the person answers the phone: quiet the TV and stereo, maybe. But this situation could be happening at the same time as the lunch-making situation and will inter-penetrate with it. There are dozens of possible interactions: one might pause lunch to deal with the phone call, or one might continue making the sandwich while chatting, or some other combination. Again, fine for a human observer. But a computer trying to make sense of these happenings only has a limited sensor-driven view, and has to try to associate events with interpretations without knowing what it’s seeing ahead of time. The fact that many things can happen simultaneously enormously complicates the challenge of identifying what’s going on robustly, damaging what is often already quite a tenuous process. We therefore need techniques for describing situation compositions and interactions on top of the basic descriptions of the processes themselves. The third challenge is also one of interaction, but this time involving multiple people. One person might be making lunch whilst another watches television, then the phone rings and one of them answers, realises the call is for the other, and passes it over. So as well as interpenetration we now have multiple agents generating sensor events, perhaps without being able to determine exactly which person caused which event. (A motion sensor sees movement: it doesn’t see who’s moving, and the individuals may not be tagged in such a way that they can be identified or even differentiated between.) Real spaces involve multiple people, and this may place limits on the behaviours we can demonstrate. But at the very least we need to be able to describe processes involving multiple agents and to support simultaneous situations in the same or different populations. So for me the next challenges of situation recognition boil down to how we describe what we’re expecting to observe in a way that reflects noise, complexity and concurrency of real-world conditions. Once we have these we can explore and improve the techniques we use to map from sensor data (itself noisy and hard to program with) to identified situations, and thence to behaviour. In many ways this is a real-world version of the concurrency theories and process algebras that were developed to describe concurrent computing processes: process languages brought into the real world, perhaps. This is the approach we’re taking in a European-funded research project, SAPERE, in which we’re hoping to understand how to engineer smart, context-aware systems on large and flexible scales.