Machine learning is increasingly being used to predict individuals’ attitudes, behaviors, and preferences across an array of applications — from personalized marketing to precision medicine. Unsurprisingly, given the speed of change and ever-increasing complexity, there have been several recent high-profile examples of “machine learning gone wrong.”
A chatbot trained using Twitter was shut down after only a single day because of its obscene and inflammatory tweets. Machine learning models used in a popular search engine struggle to differentiate human images from those of gorillas, and show female searchers ads for lower paying jobs relative to male users. More recently, a study compared the commonly used crime risk analysis tool COMPAS against recidivism predictions from 400 untrained workers recruited via Amazon Mechanical Turk. The results suggest that COMPAS has learned implicit racial biases, causing it to be less accurate than the novice human predictors.
When models don’t perform as intended, people and process are normally to blame. Bias can manifest itself in many forms across various stages of the machine learning process, including data collection, data preparation, modeling, evaluation, and deployment. Sampling bias may produce models trained on data that is not fully representative of future cases. Performance bias can exaggerate perceptions of predictive power, generalizability, and performance homogeneity across data segments. Confirmation bias can cause information to be sought, interpreted, emphasized, and remembered in a way that confirms preconceptions. Anchoring bias may lead to over-reliance on the first piece of information examined. So how can we mitigate bias in machine learning?
In our federally-funded project (with Rick Netemeyer, David Dobolyi, and Indranil Bardhan), we are developing a patient-centric mobile/IoT platform for those at early risk of cardiovascular disease in the Stroke Belt — a region spanning the southeastern United States, where the incident rates for stroke are 25% to 40% higher than the national average. As part of the project, we built machine learning models based on various types of unstructured inputs including user-generated text and telemetric and sensor-based data. One critical component of the project involved developing deep learning text analytics models to infer psychometric dimensions — such as measures of numeracy, literacy, trust, and anxiety — which have been shown to have a profound impact on health outcomes including wellness, future doctor visits, and adherence to treatment regimens. The idea is that if a doctor could know that a patient was, for example, skeptical of the health profession, they could tailor their care to overcome that lack of trust. Our models predict these psychometric dimensions based on the data we collected.
Given that cardiovascular disease is disproportionately more likely to affect the health of disparate populations, we knew alleviating racial, gender, and socio-economic biases from our text analytics models would be vitally important. Borrowing from the concept of “privacy by design” popularized by the European Union’s General Data Protection Regulation (GDPR), we employed a “fairness by design” strategy encompassing a few key facets. Companies and data scientists looking to similarly design for fairness can take the following steps:
1. Pair data scientists with a social scientist. Data scientists and social scientists speak somewhat different languages. To a data scientist, “bias” has a particular technical meaning — it refers to the level of segmentation in a classification model. Similarly, the term “discriminatory potential” refers to the extent to which a model can accurately differentiate classes of data (e.g., patients at high versus low risk of cardiovascular disease). In data science, greater “discriminatory potential” is a primary goal. By contrast, when social scientists talk about bias or discrimination, they’re more likely to be referring to questions of equity. Social scientists are generally better equipped to provide a humanistic perspective on fairness and bias.
In our Stroke Belt project, from the start, we made sure to include psychologists, psychometricians, epidemiologists, and folks specialized in dealing with health-disparate populations. This allowed us to have a better awareness of demographic biases that might creep into the machine learning process.
2. Annotate with caution. Unstructured data such as text and images often is generated by human annotators who provide structured category labels that are then used to train machine learning models. For instance, annotators can label images containing people, or mark which texts contain positive versus negative sentiments.
Human annotation services have become a major business model, with numerous platforms emerging at the intersection of crowd-sourcing and the gig economy. Although the quality of annotation is adequate for many tasks, human annotation is inherently prone to a plethora of culturally ingrained biases.
In our project, we anticipated that this might introduce bias into our models. For example, given two individuals with similar levels of health numeracy, one of them is much more likely to be scored lower by annotators if his/her writing contains misspellings or grammatical mistakes. This can cause biases to seep into the trained models, such as overemphasizing the importance of misspellings relative to more substantive cues when predicting health numeracy.
One effective approach we have found is to include potential bias cases in annotator training modules to increase awareness. However, in the Stroke Belt project, we circumvented annotation entirely, instead relying on self-reported data. While this approach is not always feasible, and may come with its share of issues, it allowed us to avoid annotation-related racial biases.
3. Combine traditional machine learning metrics with fairness measures. The performance of machine learning classification models is typically measured using a small set of well-established metrics that focus on overall performance, class-level performance, and all-around model generalizability. However, these can be augmented with fairness measures designed to quantify machine learning bias. Such key performance indicators are essential for garnering situational awareness — as the saying goes, “if it cannot be measured, it cannot be improved.” By utilizing fairness measures, in the recidivism prediction study mentioned earlier, researchers noted that existing models were heavily skewed in their risk assessments for certain groups.
In our project, we examined model performance within various demographic segments, as well as underlying model assumptions, to identify demographic segments with higher susceptibility to bias in our context. Important fairness measures incorporated were within- and across-segment true/false, positive/negative rates and the level of reliance on demographic variables. Segments with disproportionately higher false positive or false negative rates might be prone to over-generalizations. For segments with seemingly fair outcomes at present, if demographic variables are weighed heavily relative to others and act as primary drivers of predictions, there might be potential for susceptibility to bias in future data.
4. When sampling, balance representativeness with critical mass constraints. For data sampling, the age-old mantra has been to ensure that samples are statistically representative of the future cases that a given model is likely to encounter. This is generally a good practice. The one issue with representativeness is that it undervalues minority cases — those that are statistically less common. While at the surface this seems intuitive and acceptable — there are always going to be more- and less-common cases — issues arise when certain demographic groups are statistical minorities in your dataset. Essentially, machine learning models are incentivized to learn patterns that apply to large groups, in order to become more accurate, meaning that if a particular group isn’t well represented in your data, the model will not prioritize learning about it. In our project, we had to significantly oversample cases related to certain demographic groups in order to ensure that we had a critical mass of training samples necessary to meet our fairness measures.
5. When building a model, keep de-biasing in mind. Even with the aforementioned steps, de-biasing during the model building and training phase is often necessary. Several tactics have been proposed. One approach is to completely strip the training data of any demographic cues, explicit and implicit. In the recidivism prediction study discussed earlier, the novice human predictors weren’t provided with any race information. Another approach is to build fairness measures into the model’s training objectives, for instance, by “boosting” the importance of certain minority or edge cases.
In our project, we found that it was helpful to train our models within demographic segments algorithmically identified as being highly susceptible to bias. For example, if segments A and B are prone to superfluous generalizations (as quantified by our fairness measures), learning patterns within these segments provides some semblance of demographic homogeneity and alleviates majority/minority sampling issues, thereby forcing the models to learn alternative patterns. In our case, this approach not only enhanced fairness measures markedly (by 5% to 10% for some segments), but also boosted overall accuracy by a couple of percentage points.
A few months back, we were at a conference where the CEO of a major multinational lamented about “the principle of precaution overshadowing the principle of innovation.” This is a concern voiced within C-suites and machine learning groups worldwide — in regards to both privacy and bias. But fairness by design isn’t about prioritizing political correctness above model accuracy. With careful consideration, it can allow us to develop high-performing models that are accurate and conscionable. Buying in to the idea of fairness by design entails examining different parts of the machine learning process from alternative vantage points, using competing theoretical lenses. In our Stroke Belt project, we were able to develop models with higher overall performance, greater generalizability across various demographic segments, and enhanced model stability — potentially making it easier for the health care system to match the right person with the right intervention in a timely manner.
By making fairness a guiding principle in machine learning projects, we didn’t just build fairer models — we built better ones, too.