9 minute read

Machine Learning Misses the Mark on Equality

Robert Williams had just pulled into his Farmington Hills driveway after another long day’s work, relieved to spend time with his wife and two daughters. Nothing seemed amiss.

But something was very wrong. A police car waiting down the street inched forward, trapping Williams against his own house. Two officers approached, cuffing him in front of his family as they cried out in bewilderment and distress. His wife’s pleas to learn where he was being taken went unanswered, though one of the officers snarkily replied, “Google it.”

Williams was driven to a nearby detention center, where law enforcement took his mugshot, fingerprints, and DNA. He was held overnight, interrogated by two detectives, and assigned a court date. But, once in court, prosecutors dropped the charges. His arrest, they admitted, was based on “insufficient evidence.”

Later, state police revealed their “insufficient evidence” was the product of a facial recognition machine learning algorithm. It had analyzed grainy video footage from a robbed Detroit retail store and determined that Williams was the offender, even though Williams recalled he looked nothing like the robber.

Williams’s wrongful arrest may be the first of its kind, but it won’t be the last. One in four law enforcement agencies has unregulated access to facial recognition software, and more than half of Americans are in police facial recognition databases. It also isn’t surprising that Williams—a Black man—was the first case; according to the National Institute

of Standards and Technology, a majority of facial recognition software is racially biased.

To be fair to the Michigan police who arrested Williams, it seems outlandish that machine learning could be racist.

Machine learning algorithms are just lines of computer code that minimize error between values in a dataset—called training data—to generate predictive models.

Machine learning algorithms themselves can’t discriminate. However, the data that programmers feed to algorithms can train them to.

Error is calculated however the coder wants it to be, and the code runs until error reaches zero or the coder instructs it to stop.

For instance, one form of error used in machine learning algorithms is the difference between a value and an average value. This allows programmers to quantify the error of entire datasets. For every value in a dataset, an error can be squared and summed with the other errors, producing a sum of squared errors (SSE).

Some machine learning algorithms minimize dataset error by forming smaller subsets and recalculating the SSE for each.

Adding the resulting SSEs allows the algorithm to generate a total SSE unique to each split. The split that produces the smallest total SSE is considered the best and is recorded for later.

Imagine using a machine learning model to analyze a dataset describing crime in North Carolina. If we fed the variable “population density” to a machine learning algorithm, it would split the data into different groups for every density value until it found the split that minimized the total SSE.

But the dataset includes other variables, such as year, probability of arrest, and police officers per capita. Machine learning algorithms account for the different variables in training data by minimizing the total SSE for eachone, then selecting the best split out of all the variables.

The algorithm repeatedly forms subsets and splits the data within each one until error reaches zero or a set endpoint. The best splits are organized into a model that can be fed new data to predict outcomes.

These complex calculations have no obvious racist influences; machine learning algorithms themselves can’t discriminate. However, the data that programmers feed to algorithms can train them to.

Machine learning models have no ethical code. If predicting an outcome by race minimizes error and improves predictive accuracy, the algorithm will do it. The same is true when predicting by sex, religion, nose length, or any characteristic.

30

In practice, this means that demographically skewed datasets can train discriminatory algorithms. It doesn’t matter whether people use machine learning for facial recognition or medical diagnosis—biased training data creates biased models.

For instance, imagine a Bostonbased technology company, Boston SuperTech, using machine learning algorithms for hiring. They use

“ a database of employees’ demographic information, ratings of their old résumés, and their time spent at the company to train a basic Categorization and Regression Tree (CART) model. Then they leverage the CART to predict which incoming job applicants are likely to spend at least ten years at the company.

None of this sounds unreasonable on its face. Maybe staying at the company appeals to some groups of people more than others, or maybe people with stronger résumés are more likely to climb the corporate ladder.

But if Boston SuperTech applied this algorithm in hiring, nearly every minority applicant would be turned away. The model racially discriminates by using ZIP code as a stand-in for race. ZIP codes 02124, 02125, and 02111 are the only majority-minority districts represented in Boston SuperTech’s employee database. The CART model would classify applicants from all of them as unlikely long-term hires.

Minorities have been underrepresented among computer science majors for over two decades, so most of Boston SuperTech’s previous long-term hires were White. When Boston SuperTech gave the CART algorithm their training

data, the algorithm took advantage of the indirect association between race and corporate loyalty to improve predictive accuracy.

Just like in Robert Williams’s case, machine learning translated neutral

To demonstrate the ease with which machine learning can discriminate, I created my own facial recognition random forest model and tested it with images of White and Black faces.

intentions into blatantly unjust outcomes. Seemingly reasonable corporate practices— like using machine learning to predict corporate loyalty—can embed discrimination in hiring.

In this hypothetical, Boston SuperTech could generate a visual output of their CART model and discover that it uses ZIP code as a proxy for race.

However, scientists can’t visualize most machine learning models as easily. More advanced modeling methods, like random forests, are more accurate than simple CARTlike models but are much less interpretable.

Random forests predict outputs by aggregating hundreds to thousands of individual decision trees’ predictions to generate a “forest” of trees. But this makes it extremely difficult to understand why a random forest predicts the way it does. One random forest can generate thousands of decision trees, each with its own set of rules. And random forest decision treesare generated differently than CARTs, making them orders of magnitude more complex.

To figure out if a random forest is discriminating the same way Boston SuperTech’s CART model did, a scientist would need to follow many rules and outputs for thousands of individual trees. It’s simply not feasible.

So scientists generate databases designed to test algorithm fairness, then run them through allegedly discriminatory models. If outputs differ from expected nondiscriminatory results, then a model is probably discriminating. But creating a database and accessing a company’s modeling software aren’t easy.

It has taken scientists months or years to root out discriminatory hiring, ad delivery, and facial recognition models created by Amazon, Facebook, and Google, respectively. This difficulty is why scientists will probably never know just how many discriminatory algorithms exist.

More than half of hiring managers and recruiters report that artificial intelligence—a broad term used to describe machine learning concepts—is helpful in sourcing and screening job candidates. Yet the effort needed to uncover bias means that only a fraction of companies’ algorithms will be investigated.

31

THE DATA DILEMMA PART TWO: A MINI EXPERIMENT

To demonstrate the ease with which machine learning can discriminate, I created my own facial recognition random forest model and tested it with images of White and Black faces.

My model identifies faces by reading the spatial distribution of color across an image and comparing it to images I’ve told it are faces. If the color distributions of a new image are similar enough to one it already knows is a face, then it predicts that the new image is a face as well.

For the purposes of this experiment, I trained my model only on White faces. Because White and Black faces have different color distributions, I hypothesized that my model would recognize new White faces more often than it would recognize new Black faces.

As expected, when I exposed my model to fifteen new White faces and fifteen new Black faces, it recognized 87 percent of the White faces, but only 33 percent of the Black faces. A

Outside of scientific research, machine learning should be applied sparingly.

chi-squared test confirmed that my model’s facial recognition depends on race (p=0.003).

Of course, in the real world, facial recognition models aren’t trained just on White faces.

However, any disparity in the racial makeup of training data could result in discrimination, even slight disparities that aren’t obvious to the human eye. If my model were more accurate, it would recognize faces according to smaller differences in skin tone. These more detailed predictions would be more difficult to test for discrimination; testing databases would need to include many more than thirty images and each image set would need to be categorized by more specific criteria.

Regardless, my experiment shows that if machine learning models are trained on the wrong dataset, their predictions will be biased, even if the code itself isn't.

Machine learning is extraordinarily useful. It can be applied in all manner of scientific research, from ecological surveys to biological protein modeling. Doing away with it entirely would be disastrous.

But in many social, political, and economic contexts, machine learning just isn’t the right tool. Its very nature is to decide outcomes according to broad group trends in whatever time period itstraining data is from.

For instance, if a machine learning model were created in 1970 to help colleges admit biology majors using data from the past five years, it would probably recommend they admit mostly men. If an identical model were created in 2008, it would probably recommend they admit mostly women. This is because

between 1965 and 1970, approximately 30 percent of biology majors were women, while between 2003 and 2008, about 60 percent were.

And blatant bigotry pervades many historical group trends. Facebook’s discriminatory ad delivery program learned to show “home for sale” ads to White users because minorities were less likely to engage with these types of ads, but this pattern of behavior can be traced back to slavery, redlining, and discriminatory lending. These practices stymied minority wealth accumulation, reducing their home-buying potential today.

Outside of scientific research, machine learning should be applied sparingly. Corporations and government agencies must be required by law to prove that their algorithms do not discriminate, or else be forbidden from using them at all. And even when implemented, institutions should interpret model outputs in the context of other data before making decisions.

For now, people are better decision-makers than algorithms.

This article is from: