Duck Soup: Face Recognition, Bad People and Bad Data

We worry about face recognition just as we worried about databases - we worry what happens if they contain bad data and we worry what bad people might do with them

It’s easy to point at China, but there are large grey areas where we don't yet have a clear consensus of what ‘bad’ would actually mean, and how far we worry because this is different rather than just because it’s just new and unfamiliar

Like much of machine learning, face recognition is quickly becoming a commodity tech that many people can and will use to build all sorts of things. ‘AI Ethics’ boards can go a certain way but can’t be a complete solution, and regulation (which will take many forms) will go further. But Chinese companies have their own ethics boards and are already exporting their products.

Way back in the 1970s and early 1980s, the tech industry created a transformative new technology that gave governments and corporations an unprecedented ability to track, analyse and understand all of us. Relational databases meant that for the first time things that had always been theoretically possible on a small scale became practically possible on a massive scale. People worried about this, a lot, and wrote books about it, a lot.

Specifically, we worried about two kinds of problem:

We worried that these databases would contain bad data or bad assumptions, and in particular that they might inadvertently and unconsciously encode the existing prejudices and biases of our societies and fix them into machinery. We worried people would screw up.

And, we worried about people deliberately building and using these systems to do bad things

That is, we worried what would happen if these systems didn’t work and we worried what would happen if they did work.

We’re now having much the same conversation about AI in general (or more properly machine learning) and especially about face recognition, which has only become practical because of machine learning. And, we’re worrying about the same things - we worry what happens if it doesn’t work and we worry what happens if it does work. We’re also, I think, trying to work out how much of this is a new problem, and how much of it we’re worried about, and why we’re worried.

First, ‘when people screw up’.

When good people use bad data

People make mistakes with databases. We’ve probably all heard some variant of the old joke that the tax office has misspelled your name and it’s easier to change your name than to get the mistake fixed. There’s also the not-at-all-a-joke problem that you have the same name as a wanted criminal and the police keep stopping you, or indeed that you have the same name as a suspected terrorist and find yourself on a no-fly list or worse. Meanwhile, this spring a security researcher claimed that he’d registered ‘NULL’ as his custom licence place and now gets hundreds of random misdirected parking tickets.

These kinds of stories capture three distinct issues:

The system might have bad data (the name is misspelled)…

Or have a bug or bad assumption in how it processes data (it can’t handle ‘Null’ as a name, or ‘Scunthorpe’ triggers an obscenity filter)

And, the system is being used by people who don’t have the training, processes, institutional structure or individual empowerment to recognise such a mistake and react appropriately.

Of course, all bureaucratic processes are subject to this set of problems, going back a few thousand years before anyone made the first punch card. Databases gave us a new way to express it on a different scale, and so now does machine learning. But ML brings different kinds of ways to screw up, and these are inherent in how it works.

So: imagine you want a software system that can recognise photos of cats. The old way to do this would be to build logical steps - you’d make something that could detect edges, something that could detect pointed ears, an eye detector, a leg counter and so on… and you’d end up with several hundred steps all bolted together and it would never quite work. Really, this was like trying to make a mechanical horse - perfectly possible in theory, but in practice the complexity was too great. There’s a whole class of computer science problems like this - thing that are easy for us to do but hard or impossible for us to explain how we do. Machine learning changes these from logic problems to statistics problems. Instead of writing down how you recognise a photo of X, you take a hundred thousand examples of X and a hundred thousand examples of not-X and use a statistical engine to generate (‘train’) a model that can tell the difference to a given degree of certainty. Then you give it a photo and it tells you whether it matched X or not-X and by what degree. Instead of telling the computer the rules, the computer works out the rules based on the data and the answers (‘this is X, that is not-X) that you give it. (...)

This works fantastically well for a whole class of problem, including face recognition, but it introduces two areas for error.

First, what exactly is in the training data - in your examples of X and Not-X? Are you sure? What ELSE is in those example sets?

My favourite example of what can go wrong here comes from a project for recognising cancer in photos of skin. The obvious problem is that you might not have an appropriate distribution of samples of skin in different tones. But another problem that can arise is that dermatologists tend to put rulers in the photo of cancer, for scale - so if all the examples of ‘cancer’ have a ruler and all the examples of ‘not-cancer’ do not, that might be a lot more statistically prominent than those small blemishes. You inadvertently built a ruler-recogniser instead of a cancer-recogniser.

The structural thing to understand here is that the system has no understanding of what it’s looking at - it has no concept of skin or cancer or colour or gender or people or even images. It doesn’t know what these things are any more than a washing machine knows what clothes are. It’s just doing a statistical comparison of data sets. So, again - what is your data set? How is it selected? What might be in it that you don’t notice - even if you’re looking? How might different human groups be represented in misleading ways? And what might be in your data that has nothing to do with people and no predictive value, yet affects the result? Are all your ‘healthy’ photos taken under incandescent light and all your ‘unhealthy’ pictures taken under LED light? You might not be able to tell, but the computer will be using that as a signal.

Second, a subtler point - what does ‘match’ mean? The computers and databases that we’re all familiar with generally give ‘yes/no’ answers. Is this licence plate reported stolen? Is this credit card valid? Does it have available balance? Is this flight booking confirmed? How many orders are there for this customer number? But machine learning doesn’t give yes/no answers. It gives ‘maybe’, ‘maybe not’ and ‘probably’ answers. It gives probabilities. So, if your user interface presents a ‘probably’ as a ‘yes’, this can create problems.

You can see both of these issues coming together in a couple of recent publicity stunts: train a face recognition system on mugshots of criminals (and only criminals), and then take a photo of an honest and decent person (normally a politician) and ask if there are any matches, taking care to use a fairly low confidence level, and the system says YES! - and this politician is ‘matched’ against a bank robber.

To a computer scientist, this can look like sabotage - you deliberately use a skewed data set, deliberately set the accuracy too low for the use case and then (mis)represent a probabilistic result as YES WE HAVE A MATCH. You could have run the same exercise with photos of kittens instead of criminals, or indeed photos of cabbages - if you tell the computer ‘find the closest match for this photo of a face amongst these photos of cabbages’, it will say ‘well, this cabbage is the closest.’ You’ve set the system up to fail - like driving a car into a wall and then saying ‘Look! It crashed!’ as though you’ve proved something.

But of course, you have proved something - you’ve proved that cars can be crashed. And these kinds of exercises have value because people hear ‘artificial intelligence’ and think that it’s, well, intelligence - that it’s ‘AI’ and ‘maths’ and a computer and ‘maths can’t be biased’. The maths can’t be biased but the data can be. There’s a lot of value to demonstrating that actually, this technology can be screwed up, just as databases can be screwed up, and they will be. People will build face recognition systems in exactly this way and not understand why they won’t produce reliable results, and then sell those products to small police departments and say ‘it’s AI - it can never be wrong’.

These issues are fundamental to machine learning, and it’s important to repeat that they have nothing specifically to do with data about people. You could build a system that recognises imminent failure in gas turbines and not realise that your sample data has biased it against telemetry from Siemens sensors. Equally, machine learning is hugely powerful - it really can recognise things that computers could never recognise before, with a huge range of extremely valuable uses cases. But, just as we had to understand that databases are very useful but can be ‘wrong’, we also have to understand how this works, both to try to avoid screwing up and to make sure that people understand that the computer could still be wrong. Machine learning is much better at doing certain things than people, just as a dog is much better at finding drugs than people, but we wouldn’t convict someone on a dog’s evidence. And dogs are much more intelligent than any machine learning.

by Benedict Evans | Read more:

Image: uncredited

Wednesday, September 11, 2019

Face Recognition, Bad People and Bad Data