The data used to train AIs reflects society’s stereotypes and biases, such as against underrepresented groups. Keeping sensitive data, such as health data, private while ensuring it is not biased requires adapting training methods.
Several scandals have erupted in recent years involving decision-making systems based on artificial intelligence (AI) that produce racist or sexist results.
This was the case, for example, with Amazon’s recruitment tool that exhibited bias against women, or the system guiding hospital care in an American hospital that systematically favored white patients over black patients. In response to the problem of bias in AI and machine learning algorithms, legislation has been proposed, such as the AI Act in the European Union, or the National AI Initiative Act in the United States.
A widely-used argument about the presence of bias in AI and machine learning models is that they simply reflect a ground truth: the biases are present in real-world data. For example, data from patients with a disease that specifically affects men results in an AI that is biased toward women, but that AI is not necessarily incorrect.
While this argument is valid in some cases, there are many cases where the data was collected incompletely and does not reflect the diversity of the field reality, or data that includes statistically rare cases and that will be underrepresented or even not represented in machine learning models. This is the case, for example, of Amazon’s recruiting tool which exhibited a bias against women: because women working in a sector are statistically few in number, the resulting AI simply rejects female applications.
What if, rather than reflecting or even exacerbating a current dysfunctional reality, AI could be virtuous and serve to correct biases in society, for a more inclusive society? This is what researchers are proposing with a new approach: “federated learning”.
Towards a decentralized AI
AI-based decision support systems are data-driven. In traditional machine learning approaches, data from multiple sources must first be fed to a repository (e.g., a cloud server) that centralizes them, before running a machine learning algorithm on this centralized data.
But this raises data protection issues . Indeed, in accordance with current legislation, a hospital does not have the right to outsource its patients’ sensitive medical data, a bank does not have the right to outsource its customers’ private banking transaction information.
Therefore, to better preserve data confidentiality in AI systems, researchers are developing approaches based on so-called “distributed” AI , where data remains on the sites that hold the data, and where machine learning algorithms run in a distributed manner on these different sites – also known as ” federated learning” .
Concretely, each data owner (participant in federated learning) trains a local model based on its own data, then transmits the parameters of its local model to a third-party entity that performs the aggregation of the parameters of all the local models (for example, via a weighted average according to the volume of data of each participant). This last entity then produces a global model that will be used by the different participants to make their predictions.
Thus, it is possible to build global knowledge from the data of each other, without revealing one’s own data and without accessing the data of others. For example, patients’ medical data remains in each hospital center that has it, and it is the federated learning algorithms that are executed and coordinated between these different sites.
With such an approach, it will be possible for a small hospital center in a less populated geographical area than large metropolises – and therefore having less medical data than in large hospital centers, and consequently, having a priori a less well-trained AI – to benefit from an AI reflecting global knowledge, trained in a decentralized manner on the data from different hospital centers.
Other similar application cases can be mentioned, involving several banks to build a global fraud detection AI, several smart buildings to determine appropriate energy management, etc.
Biases in decentralized AI are more complex to understand
Compared to the classic centralized machine learning approach, decentralized AI and its federated learning algorithms can, on the one hand, exacerbate bias even more , and on the other hand, make bias treatment more difficult .
Indeed, the local data of participants in a federated learning system can have very heterogeneous statistical distributions (different data volumes, different representations of certain demographic groups, etc.). A participant contributing to federated learning with a large volume of data will have more influence on the global model than a participant with a small volume of data. If the latter is in a certain geographical area that represents a particular social group, this will unfortunately not be, or very little, reflected in the global model.
Furthermore, the presence of bias in the data of one of the participants in a federated learning system can cause this bias to propagate to the other participants via the global model. Indeed, even if a participant has ensured to have unbiased local data, he will inherit the bias present in others.
And more difficult to correct
Moreover, the techniques classically used to prevent and correct bias in the centralized case cannot be directly applied to federated learning. Indeed, the classical approach to bias correction mainly consists in preprocessing the data before machine learning so that the data have certain statistical properties and are therefore no longer biased?
However, in the case of decentralized AI and federated learning, it is not possible to access the participants’ data, nor to have knowledge of the global statistics of the decentralized data.
In this case, how to deal with bias in decentralized AI systems?
Measuring AI bias without access to decentralized data
A first step is to be able to measure the biases of decentralized data among participants in federated learning, without having direct access to their data.
With my colleagues, we have designed a new method to measure and quantify bias in federated learning systems , based on the analysis of the parameters of the local models of federated learning participants. This method has the advantage of being compatible with the protection of participant data, while allowing the measurement of several bias metrics.
Capturing the interdependence between multiple types of biases, and correcting them in decentralized AI
But there can also be multiple types of demographic bias, which are expressed by different sensitive attributes (gender, race, age, etc.), and we have shown that mitigating one type of bias can have the side effect of increasing another type of bias. It would be unfortunate if a solution to mitigate race bias, for example, led to an exacerbation of gender bias.
We then proposed a multi-objective method for the comprehensive measurement of biases and the joint and coherent treatment of several types of biases occurring in federated learning systems.
Author Bio: Sara Bouchenak is Professor of Computer Science at INSA Lyon – University of Lyon