Twitter has been abuzz with talk about the 2020 United States Census. Particularly the microdata. Upzone New Jersey’s tweet gained a bunch of traction last week, and it was one of several which made the rounds on my timeline.
Most people were pointing to a post by IPUMS outlining how differential privacy was going to change the released census microdata. The IPUMS post is alarming. The claims that the block-level data is unfit for basic research is scary. Having been both a researcher using data and an employee of a data collection group I take this with a grain of salt.
There are two groups at play here: researchers and data collectors. Each has its own biases. In a perfect world, a data collection group would like to share the data they collected with nobody. On the flip side, the perfect world of many researchers is one where they get raw versions of everything to minimize uncontrollable biases. Neither of these two groups should be taken at face value. I personally find both camps’ arguments to be lacking.
Twitter is the favored hangout of researchers procrastinating on their work and advertising their newest papers. There are some people who have a background in data collection, but they are a vast minority. The discussion on Twitter, therefore, reflects their point of view. That point of view has some flaws.
For four and a half years I answered emails from data users from all over the world who used geographic data from The DHS Program.1 Like the census, our data was subject to various means to prevent the identification of our respondents. Every day, I would get an email from a researcher asking me for access to the deanonymized data. Even when I told them no, I was routinely asked to make an exception in our policy for them. The more creative researchers tried to find ways for them to get the data and protect confidentiality at the same time. At its core, the researchers who did not accept my no, from BA students at small colleges to professors at prestigious world-renowned universities, believed that their research project was extremely important and would outweigh any confidentiality issues.
I see similar parallels to my experience in watching American researchers get told no due to confidentiality by the Census Bureau for the first time. The number of people who are saying something to the effect of “forget confidentiality, I need to work on my next paper” on Twitter is striking and saddening.
Data collection groups
If you have never worked on the inside of a group that collects confidential data, it is hard to describe the bunker mentality. There is an element of fear of someone finding a reverse your data anonymization procedure your data that underpins quite a few decisions.
What also leads to fear is that you know your data better than anyone in the world. The risk assessment that the Census Bureau comes from an internal red team who was able to rebuild the raw census data. Reasonable people should be skeptical of this argument. I am one of the 10-30 people in the world who could try to reverse The DHS Program’s displacement procedure with some success because of how well I know how it works.2 Naturally, I am going to be able to think of more possible attacks than anyone outside of the data collection group.
Also, the Census Bureau holds all of the cards here. Anyone can try to deanonymize the census. The Census Bureau is under no obligation to confirm if that group is onto something. A steely silence to anyone who claims to reverse the census anonymization procedure is the best response.
None of the data that the decennial census collects is all that sensitive
If you listen to the stories that the Census Bureau spins, the census decennial data can be used for all sorts of evil things if the raw tabulations were ever released. The problem with their argument is that it does not hold much water. One of the reasons why is that the census just does not collect that sensitive of information in the scheme of things.
The Census Bureau’s decennial census only collects the following data
That is it. The long-form census has been a thing of the past for decades and has been replaced by the American Community Survey. Compared to The DHS Program, which collects sensitive information such as the number of sexual partners a person has had, age of first sex, HIV status, FGC prevalence, etc., the census is small potatoes.
If the argument that was going on was over the American Community Survey I might accept that there were risks. The more data points you have on a person, the easier it is to figure out who it is. Plus, information such as wages is much more unique compared to ethnicity.
Weighing everything, I am of the belief that the current approach that the Census Bureau is lacking and that they should release more data. If the Census Bureau is truly of the belief that census blocks are too small in some cases, they should not release any block-level data. Instead, here are two ways to release some of the block-level data to help researchers while protecting confidentiality.
Currently, the census blocks are created by taking the lowest level administrative boundaries in a state (normally the municipalities) and then dicing them by a series of TIGER Line files: roads, rivers, railroads, etc. This leads to a patchwork of tiny polygons. Some of which have no population and will never have population. To combat this, blocks could be replaced by what I am going to call big blocks.
The idea of big blocks is that there is some mysterious number, let’s call it K, which will hide all of the people. The Census Bureau could just merge together contiguous blocks until every single polygon has at least K people. An upside of this approach is that in urban areas, the K value is probably right around the size of a block or two. To urban researchers, this would be just as useful as the pre-2010 blocks. Everyone using the data would be able to trust that the data that they are using is accurate. It just might have worse spatial fidelity.
A displacement approach
Another option would be to take a 10% sample of the census blocks within each census tract or county and release that data. You could not release that data just as it is if you thought it had confidentiality problems. So, you could take the center point of each of the selected blocks and move it a bit. This would allow researchers to get a picture of what is happening on the ground without releasing everything.
There are a couple of ways of doing this. If there was an established K value for the decennial census, then you could move the point within an area of that population. If that requires too much computation, then you could move the point a set distance based on some conditions while respecting some boundaries.
I have not worked for The DHS Program since September and I do not speak for them. Everything here is my thoughts alone.
No, I am not going to tell you how I might go about doing it. Please go away.