AI's Problems: More Than You Know

Zach Wilkins
Nov 29, 2020
6 min read

Nowadays, AI is a term that gets the willy-nilly treatment. Everyone and their mom seems to prescribe its use like aspirin. "A little AI never hurt nobody!" "An artificial intelligence a day keeps the monotonous jobs away!" Let's talk about what AI is in the first place, then debunk a few common myths, and finally discuss some big problems the field is facing.

At the highest level, artificial intelligence is anything that makes machines autonomous in some way or having even the loosest resemblance to human intelligence. This can include human-written scripts that a robot can follow or simple chatbots that recognize common terms. Within this, the subfield of machine learning exists. Machine learning largely talks about the method in which the entity gains its "intelligence". Here, the machine uses data to train models of how to complete the task as best as possible. Classifying an image, recognizing the sentiment of a sentence, finding flaws in a piece of steel, predicting COVID hotspots--all of these tasks can be automated/modeled with machine learning. Data science overlaps with machine learning, but not necessarily the rest of AI. That's usually reserved for engineers and computer scientists. Data science includes making tools, statistics, programming/scripting, and generally solving problems with data. This article written by IBM does a very good job at explaining all of this with some nice visuals. I don't find it necessary to explain further for my purposes, but this will give a more detailed rundown for those who are curious.

Over the past decade, especially the last 5 years, the hype of AI has been bonkers. 2020 Democratic presidential hopeful Andrew Yang ran with under the fears the AI would be taking people's jobs away much sooner than we thought. The automated driving research of Tesla, Uber, Google, and others has terrified the trucking industry. TikTok is under fire because user data could be feeding AI models in China (which is certainly what the other social media giants are doing, but for domestic profit). We've even got AI doctor's assistants in the works. Generative adversarial networks (GANs, invented in 2014) can fake pictures and videos almost perfectly, spreading vast misinformation. From the sounds of it, AI is taking over the world. Every time I look up, there are better and better models published constantly. This post documents popular computer vision models (predicting images) over the past 5 years. I'd look around and find some totally new design with slightly better accuracy than what could have been done previously.

Focusing on just the AI piece of all this, there are lots of reports that it's just not working nearly as well in practice. Many shortcomings have been known for quite some time, evidence is coming out that I'd never have expected could be a problem. I'm starting to believe that AI is going to fall short of where decisionmakers want it to be. In some cases, a path to a solution isn't obvious.

Bad Data

Andrew Ng, founder of deeplearning.ai and previous head of Google Brain, has been finding that human-level performance as a threshold to beat is not providing great results in practice. In the manufacturing industry, coding flaws in the first place can be very iffy. I've worked with a steel dataset before and found that discoloration and glare can make the metal look defective, when it's really just fine. I'd have failed at classifying, but even professionals can have trouble. The same idea of flawed training data flows into cancer classification, finding sentiment amongst text in YouTube comments, flagging spam, and even being able to classify a cat versus a dog! So, one problem of AI is that it's not always trained on correct information. Garbage-in-garbage-out kind of stuff. The problem of bad data exists in many machine learning problems. For simple classification problems like "bird vs. not bird", issues are easily rectified with training data. However, some outcomes are very grey.

The Goal Line

Mentioned in the Ng article above, experts are wondering more and more about whether beating human-level performance (HLP) is enough. If doctors can classify cancer correctly 98% of the time, does it help if a model can do 99.5%? One argument for doctors is that they are actually more sophisticated: they can recognize other ailments, use the outside information like history of trauma to inform their decision, and they do better at classification when positives are very rare. AI gets worse when it comes to solving multiple problems at once or having soft outcomes. "Is a pterodactyl a bird or a reptile? Well, it shares properties of both, but it's a dinosaur." If a model had never seen a dinosaur, it wouldn't know it that's even an option. Are we supposed to live in a world where models can't cross domains or do we dump them? Ng argues that some classifications can serve as a supplement, a second set of eyes. For other cases, models take the first pass at solving problems, and in cases of slight uncertainty, humans come in to do the rest of the coding. The question remains: what is the actual performance threshold for using AI? Is it beating HLP? Is it some metric like mean squared error or area under the curve? Is it beating the previous model?

Developing for Papers

When developing for papers, there are a few problems. One is that the benchmark for success is very controversial, another is that replication is becoming a joke. For new computer vision models, popular datasets like CIFAR are used as benchmarks. If your model architecture tops the leaderboard for CIFAR, then it warrants a new paper. The problem is that these models don't always perform so well on other problems. In a class with renowned linguist Noam Chomsky, he once decreed that machines couldn't possibly learn language, but could maybe emulate it under the best conditions. A colleague, Professor Sandiway Fong, presented the flaws of the syntax tree that Google's BERT had. It wasn't doing a good job with even simple sentences and their syntax. Secondarily, the way models must be trained are becoming supremely expensive. The introduction of corporations into academia has made it very hard to replicate any recent results. An article of MIT Technology Review recently came out that focused on a scathing incident of Google getting published in Nature even though they could little information about the model architecture, the data used to train it, or the code that built it all. Worse, these exercises are so expensive that only the richest companies in the world can afford to train them. Quoted in the article is that "only 15% of AI papers share their code" and "the language generator GPT-3 estimated to have cost OpenAI $10 to $12 million—and that’s just the final model". There are definitely pros to having this research done at all, as models are improved to take up less resources and are tweaked by the masses. Currently, though, there are serious troubles with AI in papers.

Curse of The Black Box

Data shift is rarely mentioned, and underspecification is very freshly recognized. Data shift is the answer to a natural question of even the greenest analysts: "what if the underlying system changes, will my model still work?" For stock trading algorithms, a pent up fear of the market might decrease volume, thus looking nothing like previously observed data. For data that deals with images, maybe skill between camera operators or quality of photos change in practice. This could render current models as junk or at least biased in some way. Underspecification is even more worrying. The idea is that effects being modeled can have many possible causes, and modeling that system can be more difficult than we understand. What this means for machine learning is that multiple sets of parameters may make a nearly perfect model on training data, but these parameters may fail catastrophically with totally new data in the future. The consequence is similar to data shift, but the cause is that we didn't do a good job in the first place, even if the underlying system of data and relationships stays the same. A good and bad model could look nearly identical with seemingly random tweaks in parameters. The researchers in this article believe there needs to be much more stress testing of models, but we fundamentally do not understand how minute differences may play huge roles in practice. There may not be a solution, but underspecification is a rampant harvester of dreams. I've not even mentioned how models can learn inherent biases in the data, which often earns them a label of "racist models". Usually, the issue is that creators feed the model information that can lead to biases, like simply including race and/or ethnicity, or showing many more images of white people over Black people. It's hard to investigate why a model picks up those biases and it's almost impossible to recognize how it uses that information. That's one feature about machine learning that sows distrust.

There are so many issues with machine learning and AI once you look beyond the hype. It's very difficult to imagine that we go too much further. These issues have intractable solutions, or even none at all. AI works very well in many situations, but its growth is stalling, and I can't see a way out for it. I predict a souring on AI from many businesses, maybe taking some heat off the field for a while. Understanding the black box nature is huge, but we're far off from where we need to be. At this point, I think we should take a step back and work on these issues. Further adoption of AI in complex settings just raises more alarms to shut off.

Bad Data

The Goal Line

Developing for Papers

Curse of The Black Box

Comments