Personal data: nothing to hide, but a lot to lose


Our personal data circulates on the Internet: name, addresses, bank or social security details, real-time location… and related cases are making a permanent place in public debate, from the Facebook-Cambridge Analytica scandal to data theft to the Red Cross , to the recent blockages of hospitals by ransomware (or ransomware ) and the banning of the TikTok application for civil servants in several countries .

But while it is increasingly understood that our personal data is “valuable” and offers unprecedented opportunities for commercialization and innovation, it is sometimes difficult to understand or explain why it should be protected.

What are the risks associated with disclosing my personal data?

The first risk concerns the loss of control over our own data. This is what happens, for example, when we authorize tracking by sites or applications: we authorize the recording of our activities on the Web or on our smartphone (pages visited, geolocation) and the exchange of this data, and , once this agreement has been given, we no longer have any power over the circulation of our data.

This information is most often used for profiling which makes it possible to feed the personalized advertising economy now governed by auction platforms that value data relating to user profiles against advertising placements.

But, this information can also be misused. Knowing your location can help a burglar act, for example, and knowing your interests or political opinion can expose you to influence operations .

The Cambridge Analytica scandal is an example of this, with the exploitation of the personal data of millions of Facebook users for targeted disinformation campaigns in order to influence voting intentions. More recently, the revelations of Le Monde on disinformation companies indicate that this practice is not an isolated case.

Another risk concerns phishing :  if personal information is present in a fraudulent email or SMS, it will seem more realistic to you and will lower your vigilance barriers. Phishing is often used to infect the target with ransomware : cybercriminals use personalized information to gain recipients’ trust and trick them into opening attachments, or clicking on malicious links or documents, which allows in a second time to lock the data of the victim and to prohibit access to it. A ransom is then demanded to unlock them.

Although the most publicized ransomware attacks involve organizations, such as hospitals, individuals are also affected .

In the case of identity theft, a malicious person uses personal information that allows us to be identified (“log in”) without our consent: for example, by creating a fake profile on a platform and by writing comments under the identity of the victim in order to damage his reputation.

At another level, the mass surveillance exercised by some States captures the personal information of their citizens in order to hinder freedom of expression or to file individuals, for example. Increased surveillance can lead to a feeling of lack of privacy and thus curb the behavior of individuals.

In Europe, the GDPR (General Data Protection Regulation) limits the collection of personal data, in particular by governments, which must justify sufficient reason for any surveillance.

Each of us has a unique digital footprint

These issues affect all of us. Indeed, in an increasingly digital world where we generate data daily through our Internet browsing, our smartphones, or our connected watches, we all have a “unique digital footprint”.

Simply put, it’s usually possible to re-identify someone just from the “traces” we leave behind on our digital devices.

For example, the random observation of only four places visited represents a unique signature for 98% of individuals . This uniqueness is generalizable in a large number of human behaviors.

Hiding the identity of the owner of personal data only behind a pseudonym is not sufficient protection against the risk of re-identification, it is necessary to anonymize the data.

Synthetic data, federated learning: new methods to protect personal data

Like the members of a “black bloc” trying to be indistinguishable from each other by dressing identically in a heated demonstration, the anonymization of data aims to prevent a person from standing out from the rest of the population considered, in order to limit the information that a cyberattacker could extract.

In the case of geolocation data, one could, for example, modify the data so that several users share the same places visited, or introduce noise to add uncertainty to the places actually visited.

But this anonymization has a cost because it “distorts” the data and reduces its value: too much modification of the raw data distorts the information conveyed in the anonymized data. Moreover, to ensure the absence of a re-identifying fingerprint, the necessary modifications are very significant and often incompatible with a number of applications.

Finding the right compromise between protection and usefulness of anonymized information remains a challenge. At a time when some see data as the new oil of the 21st century  , the stakes are high because anonymous data is no longer considered personal data and escapes the GDPR, which means that it can be shared without the owner’s consent.

This difficulty in finding an acceptable compromise between protection and usefulness of data through anonymization mechanisms has led to changes in practices. New paradigms of personal data protection have emerged.

A first trend is to generate synthetic data reproducing the same statistical properties as the real data.

This artificially generated data is therefore not linked to a person and would no longer be regulated by the GDPR. A large number of companies see in this solution the promise of less limited information sharing. In practice, the residual risks of synthetic generation models are not negligible and are still under study .

Another solution that limits the risk of sharing personal data is federated learning . In conventional machine learning, data is centralized by one entity to train a model.

In federated learning, each user is assigned a model that they train locally on their own data. It then sends the result to an entity which takes care of aggregating all the local models. Iteratively, this decentralized learning allows a learning model to be created without disclosing personal data.

This new personal data protection paradigm is generating a lot of enthusiasm . However, several limitations remain, in particular on the robustness against malicious actors who would like to influence the training process. A participant could, for example, modify their own data to cause the model to err on a particular classification task.

Author Bio: Antoine Boutet is a Lecturer, Privacy, IA, at the CITI laboratory, Inria at INSA Lyon – University of Lyon