Sensitive US census data is vulnerable to theft and exposure

Share
Hackers at work - Getty Images
Computer scientists have designed a “reconstruction attack” that shows US Census data could be stolen or leaked using a laptop and machine learning code

A team of computer scientists say US citizens could have their identities stolen and exploited in a reverse-engineering exercise which attackers can accomplish using machine learning algorithms on a regular computer laptop.

The "reconstruction attack" is part of a study led by the Aaron Roth of the University of Pennsylvania School of Engineering and Applied Science, who is the Henry Salvatori Professor of Computer and Cognitive Science in Computer and Information Science (CIS); and Michael Kearns, the National Center Professor of Management and Technology in CIS. The study was published in the Proceedings of the National Academy of Sciences (PNAS). 

The researchers used machine learning and a standard laptop to demonstrate how protected information about individual respondents can be reverse-engineered from US Census Bureau statistics, potentially compromising the privacy of the US population.

This study establishes a benchmark for unacceptable susceptibility to exposure and highlights the likelihood of identity theft or discrimination resulting from this attack. The researchers also demonstrate how an attacker can determine the probability that a reconstructed record corresponds to the data of a real person.

“Over the last two decades, it has become clear that practices in widespread use for data privacy — anonymising or masking records, coarsening granular responses or aggregating individual data into large-scale statistics — do not work,” says Kearns. “In response, computer scientists have created techniques to probably guarantee privacy.”

“The private sector,” says Roth, “has been applying these techniques for years. But the Census’ long-running statistical programs and policies have additional complications attached.”

Data is critical for political, economic, and social purposes

The US Census Bureau is required by the Constitution to conduct a full population survey every decade, and the data collected is critical for various political, economic, and social purposes, including apportioning House seats, drawing district boundaries, allocating federal funding for state and local programs, disaster relief, welfare programs, infrastructure development, and demographic research.

While Census information is publicly available, strict laws are in place to protect individual privacy. To protect privacy, publicly available statistics aggregate respondents' survey answers, ensuring mathematical precision in the population's overall picture without directly revealing individuals' personal information.

However, attackers can use these aggregated statistics to reverse-engineer sets of records consistent with confirmed statistics, a process known as "reconstruction." In response to these risks, the Census conducted an internal reconstruction attack between the 2010 and 2020 surveys to evaluate the need for changes in reporting. The findings led to the implementation of "differential privacy," a provable protection technique that preserves the integrity of the larger data set while concealing individual data.

Differential privacy, invented by Cynthia Dwork, a computer science professor at Harvard University and a collaborator on the study, introduces strategic amounts of false data, known as "noise," to conceal individual data. While the noise's impact on statistical correctness is negligible at large scales, it can cause complications in demographic statistics describing small populations.

Experts suggest that the trade-off between accuracy and privacy is complex. While some social scientists argue that publishing aggregate statistics poses no inherent risk, Roth and Kearns' work has proven that the likelihood of reconstructing individual records is higher than previously thought. 

“What’s novel about our approach is that we show that it’s possible to identify which reconstructed records are most likely to match the answers of a real person,” says Kearns. “Others have already demonstrated it’s possible to generate real records, but we are the first to establish a hierarchy that would allow attackers to, for example, prioritise candidates for identity theft by the likelihood their records are correct.”

On the matter of complications posed by adding error to statistics that play such a significant role in the lives of the US population, the researchers say they are being realistic. “The Census is still working out how much noise will be useful and fair to balance the trade-off between accuracy and privacy,” says Roth. “And, in the long run, it may be that public policymakers decide that the risks posed by non-noisy statistics are worth the transparency.” 

Share

Featured Articles

Cisco Talos: Tracking Ransomware’s 35 Year Evolution

Martin Lee, Technical Lead for Security Research, Cisco Talos highlights how the ransomware landscape has shifted across the last 35 years

Resilience: Firms Fail to Grasp Cyber Financial Impact

Resilience and YouGov survey reveals 74% of mid to large UK businesses face cybercrime, while ransomware understanding lags behind data breach concerns

SonicWall and CrowdStrike Unite for SMB Security Service

SonicWall partners with endpoint protection specialist CrowdStrike to offer managed detection and response capabilities through managed service providers

FS-ISAC CISO Talks Cyber Strategies for Financial Providers

Cyber Security

Darktrace Reports 692% Surge in Black Friday Cyber Scams

Cyber Security

KnowBe4 Launches AI Agents to Counter Phishing Threats

Technology & AI