Windows Into AI: How New Interpretability Tools Might Enable Trustworthy AI

Designed by Isabelle Qiu

Now is a time for researchers, governments, and industry to act. They should take this opportunity to invest in our tools for interpreting AI and, when feasible, implement high standards based on these tools.

“The clock is ticking,” sighs Chief Nursing Officer McCabe, blinking away tears as she recalls a sick patient’s wait for COVID-19 treatment. 1 Her patient was not unique. Since March 2020, surging case counts have made many hospitals overflow, forcing some medical staff to choose which patients had to wait for treatment. 2 Traditionally, staff made these choices themselves; more recently, these decisions have been increasingly informed by machines.

During early waves of the COVID-19 pandemic, many hospitals turned to artificial intelligence (AI) tools to decide in which order they would treat patients. 3 Around the world, healthcare facilities started to use software that quickly scanned patients’ chest x-rays and made predictions about which patients had COVID-19. If these algorithms predicted that someone had COVID-19, that patient would be moved up in a waitlist.

In important ways, this decision helped. It partly made up for staff shortages, likely saving lives.

But along with these benefits, the rushed implementation of AI tools into COVID-19 prognosis brought a major downside: it introduced algorithmic bias into life-and-death decisions. 4 While confirming individual cases of bias would require further research, it’s likely that patients were denied quick access to life-saving treatments on the basis of their race or gender.

We may be tempted to respond by abandoning these algorithms outright, but that would be very costly and likely counterproductive as an approach to supporting vulnerable communities. After all, understaffed hospitals without rapid prognosis tools will need to leave more patients — disproportionately minorities — with dangerously long wait times. 5 Before abandoning an AI tool, we should consider whether it’s feasible to sufficiently improve it and impose minimum requirements.

For years, engineers’ understanding of leading AI algorithms has been limited — constraining their ability to reduce algorithmic bias — but recent research discoveries suggest new ways forward. 6 By offering new ways to make sense of AI tools’ inner workings, these findings suggest novel ways engineers can identify and eliminate certain forms of algorithmic discrimination. 

In light of these possibilities, now is a time for researchers, governments, and industry to act. They should take this opportunity to invest in our tools for interpreting AI and, when feasible, implement high standards based on these tools. On top of their immediate benefits, such actions would also build our institutions’ experience with using AI interpretability tools to improve AI — experience that will help prepare us to address other AI challenges.

As we wait for AI interpretability methods to mature, algorithmic bias will continue to cause harm, so policymakers and industry leaders should complement long-term investments with practices that will reduce bias today. Among other steps, they should expand access to and use of diverse datasets, implement existing methods for detecting and reducing algorithmic bias, and diversify engineering staff.

Before we get too far into these solutions, let’s take a closer look at the problem.

Opaque AI and its Downsides

Users might assume that the engineers who design AI algorithms surely know how to eliminate their bias. But that is not always the case, because today’s leading AI algorithms are machine learning (ML) models — they learn on their own, and even engineers often don’t know precisely what they’ve learned.

As a refresher, cutting-edge ML models are a kind of AI algorithm inspired by the human brain. They’re structured as neural networks, which partly means they’re made up of many small pieces (“neurons”) that send signals to each other. When someone runs an ML model, information starts at one column of neurons, is passed and processed through a bunch of intermediary columns, and gets churned out at a final column. Collectively, these neurons can do things like read a scan of someone’s lungs and output whether the patient has COVID-19. To succeed at this, neural networks benefit from having many neurons (e.g. millions of them), and from having neurons connect to each other in just the right ways — a result achieved through automated learning. 6

The complex structure of machine learning models — and the involvement of automated learning — often means that, although an ML model produces meaningful results, no one knows how it gets there. Engineers can follow each small step of the algorithm, but a big-picture understanding is typically absent. As a rough analogy, it’s as if someone had learned how to make a cake, but we didn’t understand the steps in the recipe, or the ingredients.

Engineers’ limited knowledge about ML models means that, when something is wrong with an algorithm (e.g. it’s biased because of biased historical data), they might not know how to fix it. Building on our previous analogy: if someone made a cake that tasted funny, and you didn’t understand the recipe, teaching them how to fix it (without messing up the cake in other ways) would be hard.

Going back to COVID-19, the opaqueness of machine learning models has made it harder for developers and users of prognosis tools to identify and eliminate problems. And there have been plenty of problems. A study published in Nature reviewed 62 ML models designed to detect COVID-19 using chest scans. 7 Researchers report a bleak finding: “none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases.” Flaws in the ML models they studied included demographically unrepresentative data, a problem that has historically contributed to major inequities in healthcare provision. 8

The life-and-death limitations of current ML models are especially clear in the context of hospitals’ triage. But similar problems crop up in other uses of AI; whenever complex ML models are used — whether that’s to screen job applications, 9 to inform court sentencing decisions, 10 or to identify military targets 11  — engineers and operators lack a human-interpretable understanding of precisely how these algorithms make decisions. This will cause growing harms as high-stakes applications of AI expand. 

With little insight into ML models, people can only try to correct errors retroactively. But in high-pressure settings like hospitals, time is often too short for such retroactive error correction. In these critical decision situations, avoiding algorithmic harms requires that we have enough insight into ML models to address their flaws proactively.

Ways Forward – Recent Discoveries in Understanding AI

Fortunately, some AI researchers have now spent years developing techniques to better understand what machine learning models are doing. Some of these techniques aim for post-hoc explanations. In other words, they try to provide case-by-case explanations for AI tools’ decisions, after each decision is made. Another branch of work in understanding AI is research in AI transparency: trying to understand how these algorithms work in a more general way.

Within this second branch of work — AI transparency — recent research has revealed the existence of what are called “multimodal neurons” within some image labeling ML models. 12 In short, these neural networks seem to have neurons that detect abstract concepts. Going back to our cake analogy, researchers are learning to identify some of the fancier ingredients involved in algorithms’ “recipes.”

For example, a single neuron in a model seems to track the concept of Halloween. It activates in response to photos of the word “Halloween.” 13 That’s not all; it also activates in response to a bunch of sights that are conceptually related to Halloween: the word “haunted,” horror movie poster fonts, spooky masks, the word “spooky,” gravestones, and Jack O’Lanterns. 

There are plenty more exciting findings. Other neurons in this ML model seem to keep track of other abstract concepts, including the emotion of surprise, Lady Gaga, and history.

(Researchers figured this out by asking questions like: Out of the photos and text in our data, which most activate a specific neuron? And out of all possible inputs — even if they’re not in actual data — which most activate a specific neuron?)

Similarly, at least some of these image classification models seem to be identifying group identities that are federally protected from discrimination, including male, female, and elderly categories. These findings are very recent, so there is likely still much to learn. Still, they point to new tools for identifying and mitigating algorithmic discrimination.

Multimodal neurons may make it possible to directly, precisely detect whether a model is making decisions on the basis of a characteristic that is federally protected from discrimination, such as sex or race, and to reduce or even eliminate those biases.

The finding is promising in part because researchers identified these neurons in the step right before ML models’ outputs (in the second-to-last column of neurons). If researchers found something similar in neural networks that are applied to other tasks — such as COVID-19 prognosis — this would be very informative: the connection between one of these neurons and the neural network’s output would tell us exactly how categories such as “female” influence an algorithm’s recommendations. 

At its best, that would enable both accountability and improvement. Engineers could look at the connection between such neurons and an algorithms’ output to directly assess how the algorithm uses gender information. They could then tweak these connections so that certain demographic information influences the output in desired ways. For example, engineers could set the connections between a “female” neuron and an algorithm’s output to have zero weight, making neurons that identify women have zero influence on the algorithm’s final recommendation.

That wouldn’t be a miracle cure; identifying the relevant neurons could be very time-intensive, or even impossible for ML models that are too simple to have neurons that straightforwardly track relevant demographic categories. Still, it would be a powerful addition to engineers’ arsenals of tools for countering algorithmic bias. Unlike some blunter approaches, this approach zeroes in precisely on the parts of an algorithm that are discriminatory, so it has the potential to more thoroughly eliminate algorithmic bias, while leaving other useful parts of algorithms intact. 14

How Government and Industry Can Help

To help mitigate algorithmic bias in healthcare, researchers should use methods similar to those used by Goh et al. to: try to identify representations of protected categories in a wide range of ML models in high-stakes applications, modify existing AI tools so they represent protected group identities in more interpretable ways, and test ways to address related biases while minimizing downsides. 15 They should also seek ways to distinguish between eliminations and obfuscations of these biases — removing clear representations of protected categories, for example, may make certain kinds of discrimination happen through proxies, rather than eliminating them. 

As a start, governments, businesses, and academic institutions should support such work. The National Science Foundation could issue a “Dear Colleague” letter encouraging and committing to funding high-quality grant proposals of this variety. Advanced research projects agencies could fund additional projects in this area. AI labs could direct more of their researchers to apply these techniques. And research universities could encourage the same. 

Once researchers’ techniques here have matured sufficiently for best practices to be clear, industry or governments should establish these as industry standards. AI developers should proactively refine and implement cost-competitive approaches to identifying and eliminating algorithmic discrimination.

If industry fails to take this initiative, advocates and governments should use refined AI interpretability tools to enforce existing anti-discrimination laws in the context of AI. After all, established civil rights law applies to AI-supported healthcare decisions just as much as it applies to plain old healthcare decisions. 16 Depending on the technical landscape, other regulatory options — like creating a review process for algorithms before they’re implemented in high-risk settings — could also be promising. AI developers might even support such regulation; it could prove to be a low-cost way to increase (deserved) trust in their products. 

Throughout the above, stakeholders should ensure that AI transparency research — not only the creation of case-by-case explanations, although that is also valuable — features prominently in efforts to create trustworthy AI.

As we invest in AI transparency, we should also recognize and address its limitations. The above techniques remain far from shovel-ready, and they will take years to mature; in the meantime, we should take additional, more reliable steps to tackle the immediate problems of algorithmic bias. Following researchers’ recommendations, government and industry actors should improve the quality, accessibility, and especially the diversity of COVID-19 databases. 17 18 They should ensure that AI developers train ML models with such representative data, and that engineers use promising practices that have already been tried for reducing algorithmic bias (such as testing algorithms for bias before use and modifying the parts that most contribute to unfair outcomes). 19 On top of this, relevant employers should diversify the staff that creates and uses these algorithms.

Beyond AI in COVID-19 Prognosis

The technological and institutional tools we develop to address algorithmic bias in healthcare could serve as a model and inspiration for progress on other AI issues. Similar techniques could help address algorithmic bias in other AI applications, such as job application screening software. In addition, better understanding the inner workings of ML models could help us predict when some ML model will fail to generalize what it has learned to a new environment. And more tentatively, AI transparency tools might help us directly access information that is implicitly stored in ML models — learn what machine learning models have learned, and use that knowledge to help solve problems.

As Chief Nursing Officer McCabe said, the clock is ticking. AI is likely to continue playing increasingly influential roles in healthcare decisions and other high-stakes areas. Whether we gain the insight into ML models to make this influence fair and beneficial is up to us.


Rewired is a digital magazine where technology and society meet. We’re committed to curating stories that amplify diverse perspectives and bridge disciplines.
Sign up to receive updates about upcoming issues and submission openings via email.