Confidential health records from UK BioBank project exposed online

3 months ago 75

Confidential health data has been exposed online on dozens of occasions, a Guardian investigation can reveal, raising questions about the safeguarding of patient records by one of the UK’s flagship medical research projects.

UK Biobank, which holds the medical records of 500,000 British volunteers, is one of the world’s most comprehensive stores of health information and is credited with driving breakthroughs in cancer, dementia and diabetes research. But scientists approved to access Biobank’s sensitive data appear to have sometimes been cavalier about its security.

The files, which seem to have been inadvertently posted online by researchers using the data, do not include names or addresses, but they may still pose privacy concerns. One dataset found by the Guardian contained millions of hospital diagnoses and associated dates for more than 400,000 participants.

With the consent of a Biobank volunteer, the Guardian was able to pinpoint what appeared to be extensive hospital diagnosis records for the volunteer, using only their month and year of birth and details of a major surgery they had undergone.

One data expert said the scale and persistence of the problem was “shocking” at a time when AI and social media were making it ever easier to cross-reference information online.

UK Biobank rejected the concerns, saying that no identifying data, such as names and addresses, were provided to researchers.

In a statement, Prof Sir Rory Collins, the chief executive of UK Biobank, said: “We have never seen any evidence of any UK Biobank participant being re-identified by others.”

’They said they would hold our data securely’

Founded in 2003 by the Department of Health and medical research charities, UK Biobank holds genome sequences, scans, blood samples and lifestyle information of 500,000 volunteers. Last month, the government extended Biobank’s access to volunteers’ GP records.

Scientists at universities and private companies across the world apply for access and, until late 2024, were free to download data directly on to their own computer systems.

Before this point, data had been inadvertently published online and Biobank appears to still be grappling with the problem.

The issue emerged because journals and funders increasingly require researchers to publish the code they have used to analyse large datasets. When intending to upload code, some researchers have also accidentally published partial or entire Biobank datasets to GitHub, a popular online code-sharing platform. UK Biobank prohibits researchers from sharing data outside their systems and says it has introduced further training for all researchers.

In the past year, the data leaks appear to have become a more urgent concern to UK Biobank. Between July and December 2025, it issued 80 legal notices to GitHub, which has complied with requests to remove data from the internet. Yet much still remains available.

Some of the data files contain just patient IDs, or test results for small numbers, others are more extensive. One dataset found online by the Guardian in January contained hospital diagnoses and associated diagnosis dates for about 413,000 participants, along with their sex and month and year of birth.

A data expert, who reviewed the file said: “It sent shivers down my spine to even open. I deleted the file immediately. It was very detailed and felt like a gross invasion of privacy even to glance at.”

To test the risk of re-identification, the Guardian approached several Biobank volunteers, two of whom had undergone medical procedures in the timeframe within the data and agreed to share these details with an external data scientist.

One volunteer, who provided treatment dates for a fracture and seizure, could not be located in the dataset. A second volunteer, a woman in her 70s, shared her month and year of birth and the month and year she had a hysterectomy. Only one person in the dataset matched these details. The apparent match was corroborated by five other diagnoses from the records that the volunteer had not initially disclosed.

“Effectively you were rehearsing the main parts of my medical history to me without me having given you any information at all. I didn’t expect that,” the volunteer said.

The woman said she was not too concerned about her own data being exposed and intended to remain a participant, saying that she viewed UK Biobank’s work as “extremely important”. But, she added: “I’m more concerned about whether Biobank has broken its agreement with people. They said they would hold our data securely … I just feel as though that has to come into the equation.”

UK Biobank said the re-identification scenario tested by the Guardian did not highlight a privacy risk because without additional information it would be impossible to identify individuals.

A Biobank spokesperson said: “As we have communicated to our participants, including on our website: ‘If a participant puts information that reveals something about their health and identity, such as genealogy data, on a public website, this could make it possible for their identity to be discovered by cross-referencing UK Biobank research data.’

“You have simply demonstrated why we tell participants not to do this.”

The spokesperson added that Biobank had taken extensive measures to protect participants’ privacy, including proactively searching GitHub, contacting researchers directly and issuing legal takedown notices, actions which they said had led to about 500 repositories being removed. Many of these, it said, contained only patient IDs, not health data.

‘There are tensions between driving research with data and protecting privacy’

Privacy experts said UK Biobank’s approach appeared at odds with the reality that many people, reasonably, shared some health information online and that in an age of AI this could readily be identified and cross-referenced.

“Are these people aware that the internet exists?” asked Prof Felix Ritchie, an economist at the University of the West of England. “The idea that they can rely on their volunteers never putting any other information out there about themselves is an entirely unreasonable thing to expect.”

Dr Luc Rocher, associate professor at the Oxford Internet Institute, who reviewed several Biobank datasets found online, said that removing identifiers often did not guarantee anonymity and that simply knowing a person’s birthday and, say, the date they broke a leg might be enough to pinpoint their record with high confidence.

“Once identified, that record could reveal sensitive information such as a psychiatric diagnosis, an HIV test result, or a history of drug abuse,” they said.

Prof Niels Peek, professor of data science and healthcare improvement at the University of Cambridge, said the scale of the problem was “shocking”. “If it had happened once or 10 times I’d probably say: ‘It’s not great that it’s happened but at the same time zero risk is impossible,’” he said. “Hundreds. That’s a little bit too much.”

In Peek’s view, Biobank’s actions show it has taken the issue seriously and “done everything that one can reasonably expect”. But, he added: “The scale and persistence with which this has happened demonstrates that there are huge tensions between the ambition to drive health research with data at scale and the legal and ethical imperative to protect people’s privacy.”

Experts questioned whether Biobank will be able to fully regain control of the data released online. Despite researchers and GitHub having taken down most of the offending repositories in response to Biobank’s requests, many of the relevant files remained available on a code archive website.

Additional reporting by Luke Hoyland

Read Entire Article