How Census Is Building a Citizenship Database Covering Everyone Living in the U.S.
While the 2020 decennial count is underway, the Census Bureau is working on a separate effort to identify the percentage of the U.S. population that has legal citizenship. The result will be a Census-owned database of every person living in the U.S. with a statistical “citizenship estimate” linked to each individual.
The Trump administration initially pushed to include a citizenship question on the 2020 survey of America. However, in June of last year, the Supreme Court ruled 5 – 4 to prevent the administration from asking the question, citing poor justification for its inclusion.
A month after the ruling, President Trump signed Executive Order 13880, requiring the bureau to produce data on the citizen voting-age population, or CVAP, by the end of March 2021, and mandating relevant agencies share databases to help Census achieve that end.
Next year, the bureau will release a publicly-available statistical modeling of citizen and non-citizen populations throughout the country, anonymized using a cutting-edge masking system. The effort will also create a dataset with a citizenship estimate for every person in the U.S., which — by law and by practice — should never be seen outside of the Census Bureau.
In an internal document obtained by Nextgov, bureau officials note the Census Unedited File — which is used to determine apportionments, including congressional representatives — will not contain any citizenship data. Instead, the bureau will create a separate micro-data file, or MDF, with the best citizenship estimate associated with each census respondent.
That micro-data file, along with the Census Edited File — an updated version of the CUF that corrects and backfills missing information — will be put through the 2020 Disclosure Avoidance System, “which will do the final record linkage and place a confidentiality protected citizenship variable on the same MDF as will be used to produce the redistricting data,” according to the documents.
While the citizenship status of individuals will not be made public, Census will be publishing CVAP tables that break down citizenship estimates at the block level — the most granular level of census data. Those tables are scheduled for release by March 31, 2021.
However, keeping that amount of public data anonymized is no simple thing. With surprisingly few bits of correlated data, a once-anonymous person can easily be identified. This becomes much easier when coupled with information publicly available on the internet, such as social media profiles.
To prevent criminals and other malicious actors from reverse engineering identities, Census is employing a new disclosure avoidance system for all 2020 census data shared publicly.
“Our decision to deploy a modernized disclosure avoidance system for the 2020 census was driven by research showing that methods we used to protect the 2010 census and earlier statistics can no longer adequately defend against today’s privacy threats,” John Abowd, Census’ associate director for research and methodology and chief scientist, and Victoria Velkoff, chief of the American Community Survey Office, wrote in an October 2019 blog post explaining the new system developed by cryptographers and data scientists.
The new differential privacy system injects “noise” into the datasets by using an algorithm that makes targeted changes to the data to prevent outside actors — malicious or otherwise — from reverse engineering identities.
Census has been using various forms of differential privacy — also known as formal privacy — since 2008, though never at the scale it will be used for on 2020 census data. In the past, Census only added uncertainty to select statistics with a high risk for deanonymization to avoid adding so much noise that the statistics become unreliable.
For the coming count, uncertainty will be added to entire published datasets using state-of-the-art mathematical models.
“The new method allows us to precisely control the amount of uncertainty that we add according to privacy requirements,” Abowd and Velkoff wrote. “And, by documenting the properties of this uncertainty, we can help data users determine if published estimates are sufficiently accurate for their specific applications. In this manner, we can determine the data’s ‘fitness for use.’”
With the public datasets anonymized, it will be up to Census to protect the raw data.
While the disclosure avoidance system is designed to ensure personal data remains anonymous, Robert Groves, provost of Georgetown University, who led the Census Bureau during the 2010 decennial count, said two things will ensure the raw, nonanonymized database is never used to target individuals: law and culture.
Groves, in an interview with Nextgov after reviewing the documents, cited a legal provision known as “functional separation.”
“Once you enter a statistical agency environment, it’s a one-way street,” he explained. “As soon as that Homeland Security dataset enters behind the firewall of Census, the laws of Census apply. It’s no longer a Homeland Security dataset, in a sense. It is controlled by the Census Bureau. And, under the Title 13 law, it is absolutely crystal clear that the combined dataset never exits Census with individual person records on it. Only statistics can exit.”
That protection extends to the highest levels.
“Even if it’s requested by the president, it’s absolutely illegal,” Groves confirmed when asked. “And even if it were an executive order directing Census to do this, the statute would trump the order.”
Beyond the law, Groves said the culture of statisticians and public servants working at the Census Bureau would make it almost impossible for the data to leak out unnoticed.
“If there’s anything I believe most strongly, it’s if there’s any illegal act that is proposed or promulgated, the staff at the Census Bureau would call [reporters] within 30 seconds. They are devoted to supplying the country statistical information under the law,” he said, adding that that devotion is rooted in necessity.
“The reason those laws exist is if individual records were freely given for enforcement procedures from the decennial census, then the cooperation from the public with the census is decimated,” Groves said. “These statistical agencies work with a social confidence — a trust with the public that the laws will be followed — and the laws were established to enhance that trust.”
While the Census Bureau won’t be able to ask each individual in the U.S. about their citizenship status, leveraging access to data held by other agencies will enable statisticians to match census respondents with information they have shared with the government to build a “best citizenship” estimate for each individual.
The bureau has been working on the algorithm to produce that estimate since April 2018 and planned to finalize the “final specifications and modeling details” before the end of March, according to an internal document.
The bureau did not respond to repeated requests for comments and updates on the status of that work or a comprehensive breakdown of which federal databases are actively being shared for this work.
However, the document offers a look into the main databases being used and the additional data sources most likely to be tapped.
Bureau officials believe about 90% of the U.S. population will be covered by data from two sources: the Social Security Administration’s Numerical Identification System, or Numident, which stores Social Security numbers; and, the IRS’ Individual Taxpayer Identification Numbers, or ITINs, which are used as a substitute for those without Social Security numbers. Approximately 94% of SSN records include citizenship information.
However, if officials determine these sources are not sufficient, agencies control a host of other datasets that could be added to the mix, including databases managed by the Center for Medicare and Medicaid Services, the departments of State and Housing and Urban Development, and Homeland Security Department components like U.S. Citizenship and Immigration Services and Immigration and Customs Enforcement.
In the briefing document, Census officials said additional data from Homeland Security, State and other departments “are expected to provide the [personally identifiable information] that enables record linkage for much of the balance of the resident population.” However, that comes with a caveat: “Provided that the PII on the 2020 Census is as reliable as it was in 2010.”
DHS released a privacy impact statement in December outlining how it would share information with Census, though bureau officials did not respond to requests for confirmation that the DHS databases have been accessed or integrated into the citizenship estimates.
That data will be quantified using the finalized algorithm to produce a best estimate for citizenship.
“For a single person, they’ll collect multiple data sources on citizenship. Inevitably, those sources won’t agree. Then, the question is what do you do to estimate the best response for citizenship for that particular person. They will estimate that with modeling across the various databases,” Groves said. “They’ll also use the same sort of model if, despite all their efforts, for you they can’t find a record that you’re a citizen or you’re not a citizen, they will impute your citizenship to that model.”
Groves said we won’t know how accurate those estimates are until well after the fact.
“No one’s ever done this before,” he said. “No one, at this point, I think it’s fair to say, knows what the quality of the resulting estimates will be. We just don’t know that. We’ll know it after this, through evaluation studies. But this is just a good-faith statistical effort.”
“Unfortunately, we don’t have a lot of track record on this,” he added. “These datasets, to my knowledge, have never been assembled the way they’re trying to assemble them.”