Ethics in an age of data breaches - World leading higher education information and services

In August 2015, a hacking group released data from AshleyMadison.com, a website designed to attract funds from men seeking an extramarital affair.

Before the year was out, academics were drawing on the Ashley Madison breach data.

I’ve found five journal articles or scholarly papers that draw on the data.

Grieser, William, Rachel Li, and Andrei Simonov. ‘Integrity, Creativity, and Corporate Culture’. SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, 19 April 2017. https://papers.ssrn.com/abstract=2741049.

Grieser, Li and Simonov (all based in the USA) used email domain names to compare the proportion of staff in the Ashley Madison breach data with occurrences of corporate fraud.

Griffin, John M., Samuel Kruger, and Gonzalo Maturana. ‘Do Personal Ethics Influence Corporate Ethics?’ SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, 26 July 2017. https://papers.ssrn.com/abstract=2745062.

Griffin, Kruger and Maturana (all based in the USA) identified Chief Executive Officers and Chief Financial Officers in the Ashley Madison breach data and compared that data with corporate infraction data.

Chohaney, Michael L., and Kimberly A. Panozzo. ‘Infidelity and the Internet: The Geography of Ashley Madison Usership in the United States’. Geographical Review 108, no. 1 (1 January 2018): 69–91. https://doi.org/10.1111/gere.12225.

Chohaney and Panozzo (based in the USA) grouped Ashley Madison breach data by US Metropolitan Statistical Area (roughly analogous to large cities) and related this to patterns of affluence and other aspects of those areas.

Chen, Chen, Ying Xia, and Bohui Zhang. ‘The Price of Integrity’. SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, 12 July 2018. https://papers.ssrn.com/abstract=3107405.

Chen, Xia and Zhang (all based in Australia) used email domain names to compare the proportion of staff in the Ashley Madison breach data with a firms financing costs.

Arfer, Kodi B., and Jason J. Jones. ‘American Political-Party Affiliation as a Predictor of Usage of an Adultery Website’. Archives of Sexual Behavior, 12 July 2018. https://doi.org/10.1007/s10508-018-1244-1.

Arfer and Jones (based in the USA) compared Ashley Madison membership with voter registration records.

What concerns me about these papers was the ethics process that sat behind them. I couldn’t imagine the process that an Institutional Review Board (in the case of US researchers) or a Human Research Ethics Committee (in the case of the Australian researchers) had used to authorise the use of such sensitive illegal data.

The Ashley Madison data was obtained illegally. The breach data included more than 30 million unique email addresses. The records included: Dates of birth, Email addresses, Ethnicities, Genders, Names, Passwords, Payment histories, Phone numbers, Physical addresses, Security questions and answers, Sexual orientations, Usernames, and Website activity.

Some of the papers gave an indication of the ethics approval process for their projects. Griesler, Li and Simonov said:

“We use anonymized data on individual users and do not conduct any analysis at the user level. Furthermore, we do not disclose in any way the names of corporations with employee email accounts in the database. We have received exemption from Institutional Review Board and approval by the universities with which we are associated because of the anonymization process, public availability of the data, and the aggregate nature of the measures that enter our analysis.”

Griffin, Kruger and Maturana stated:

“We have privately discussed the use of the data with attorneys who believe that the data is permissible to use for research purposes because the data is now in the public domain and available for research use in the same way that it is available to and used by the press.”

Chohaney and Panozzo wrote:

“Using the stolen Ashley Madison user account information as a dataset for academic research raises several ethical concerns. The dissemination of user identities compromised an unknown number of relationships, tarnished the reputation of several public figures, and even triggered several suicides… Individuals revealed to be seeking homosexual partners faced capital punishment in countries where homosexuality warrants severe physical punishment or the death penalty… Therefore, we handled and processed these data with the utmost concern for personal security and privacy. No individual user identities or locations can be derived from the information presented in this article.”

Chen, Xia and Zhang did not discuss the ethical issues relating to the data.

Arfer and Jones wrote:

“Using data from the 2016 [Ashley Madison] leak for scientific research raises ethical questions. The Impact Team’s original act of obtaining and publishing the data would generally be regarded as unethical, for reasons ranging from illegally accessing [Ashley Madison’s] private files to compromising the privacy of [Ashley Madison’s] many users. Does it follow that using the published data is inherently unethical, even when following the usual guidelines for ethical research? An analogous issue exists for the more serious case of the murderous Nazi research on hypothermia: The original experiments should not have been conducted, so should we refrain from citing the experimenters’ reports…? We believe that using data that were originally collected unethically is itself ethically permissible. To forbid such use would be closing the stable door after the horse has bolted. In the case of [Ashley Madison] in particular, not only have the data already been publicly available since 2015, it has been widely discussed in the news…, with some reports even describing how to obtain and use the data … We cannot undo the past, but we can make the most of the present by getting what social and scientific value we can out of undesirable events, whether those events are natural disasters, disease epidemics, or human wrongdoing.”

Kodi B. Arfer was also kind enough to engage with the commentary on Neuroskeptic’s original blog post where I found out about this issue (and kudos to them for doing that). Kodi made it clear that:

“…we did get [Institutional Review Board] approval, from a UCLA [Institutional Review Board]. They certified the study as exempt from review (because it didn’t involve new data collection or interacting with subjects), although my understanding is that there was actually a full committee review (and a lot of back-and-forth with a lawyer) due to the sensitive nature of the data…”

…

“This isn’t “data collection” in the sense that [Institutional Review Boards] use to define human-subjects research because it doesn’t involve interacting with or measuring people; rather, it’s the use of existing measurements.”

Several of the researchers refer to the ‘public availability of the data’ and the fact that it was ‘in the public domain’. That concerns me. We aren’t talking about Open Data here. I don’t know that we should be condoning the use of illegally obtained data on the basis that it happens to be publicly available.

I’m concerned that, if breach data is not formally included in the remit of Institutional Review Boards and Human Research Ethics Committees, it will eventually occur to unethical researchers that they can just pay hackers to obtain the data they want, and release it to the public.

This is important because there is a growing swag of breached data out there. At the time of writing, the website Have I Been Pwned can tell you if your details are contained within the 5,429,399,504 accounts from 313 breached websites recorded on the site. That’s five billion hacked accounts, people!

All the data breaches on the site have been verified by the site owner, Troy Hunt. He doesn’t pay for data, but people (both trusted and untrusted) do send it to him. He often works with the organisations involved, but not always.

Troy has thought about these issues a lot. Have I been pwned allows you to check if your personal data has been compromised in data breaches. For most breaches, you can type in your email address and the site will confirm whether your personal data is included in the breach. He didn’t do that for the Ashley Madison breach, as this system allows you to determine if anyone’s data is in the breach. For Ashley Madison (and similar breaches, such as Adult Friend Finder), he invoked the concept of ‘sensitive’ data.

Privacy legislation in many jurisdictions recognises that some data is more sensitive than other data, and therefore the law requires you to be more careful with that data. In Australia (where I am from), the legal definition of sensitive data includes racial or ethnic origin; political opinions; membership of a political association; religious beliefs or affiliations; philosophical beliefs; membership of a professional or trade association; membership of a trade union; sexual preferences or practices; or criminal record. Australia also has special rules for financial data and for health data. Other jurisdictions vary on what is considered sensitive.

Troy is a security consultant. He doesn’t work at a university. You can see this when he says:

“I’ve been asked a few times now what the process for flagging a breach as sensitive is and the answer is simply this: I make a personal judgement call.” – The Ethics of Running a Data Breach Search Service.

He doesn’t need to go through an Institutional Review Board or a Human Research Ethics Committee. We do. We work within a structure that provides rules and guidelines. I think that this gives us two different issues to talk about:

How do we run the system:
What is the role of ethics committees and institutional review boards in an age of breached data? Do we need to tighten up the framework?
How do we police ourselves:
As a researcher, what attitude should you take to using breach data?

Here are some questions that might help to shape your thinking around this:

Would you pay for access to breach data for research purposes?
Would you use breach data for research if it was provided to you anonymously (ie not yet public)?
Would you use breach data for research if it was provided to you by the hackers (ie not yet public, and from an illegal source)?
Would you use breach data for research if it was provided to you by the organisation that was breached (ie not yet public, and from the data owner)?
Would you use breach data for research if it was provided to you by a trusted third party (ie not yet public, and from another researcher, for example)?
Would you collect the data yourself, for research purposes, if you found that it was ‘leaking’ (eg publicly accessible, but only in a manner that the site owner did not anticipate, such as URI-manipulation).
Would you only use breach data that had already been released to the public by someone else (ie anonymous or known source, no payment involved).
Would you link multiple sets of breached data for research purposes?

There is a heap of breach data out there. If you wanted to, you could build a research niche just using this data alone. I wouldn’t, though.