Mining census data without violating privacy


ITHACA, N.Y. -- With modern computing power, data from the U.S. Census Bureau, the Internal Revenue Service, law enforcement agencies and other sources can be combined to answer important public policy questions. The trick is to do this without violating people's privacy.

The National Science Foundation (NSF) has previously provided funding for Cornell University economist John Abowd and colleagues to develop techniques that enable social scientists to use such data while maintaining the confidentiality that both law and ethics demand.

This project has now been enlarged with a $2.9 million NSF Information Technology Research grant to expand the types of data to be made available and to ensure that the reprocessed data are valid. The grant will fund a Cornell-led consortium that includes Carnegie-Mellon University, Duke University, the University of Michigan, Argonne National Laboratory and the Census Bureau.

Abowd is the Edmund Ezra Day Professor of Industrial and Labor Relations at Cornell and director of the Cornell Institute for Social and Economic Research (CISER). He also is a distinguished senior research fellow of the Census Bureau.

These days, "anonymizing" the data -- like taking the name and address off a census form -- isn't nearly enough. Widely available public databases make it possible to identify individuals based on combinations of, for example, income level, occupation, geographic area and age. Geospatial databases can associate a street address with its exact latitude and longitude, and probably tell you the size of that household's electric bill and its cable service provider. Data on businesses can be even more transparent. How many sheet metal fabrication shops are there, for example, in Ithaca, N.Y.?

One of Abowd's solutions is to create synthetic data: a dataset of "virtual households" which, taken together, produce the same overall statistical result as the original set of census forms or other records. Another is "coarsening," in which small groups of households or businesses are blended into single records. One of the early products of this work is Quarterly Workforce Indicators Online at , which allows businesses and local governments to see where the jobs are, for what kind of workers, how much workers can expect to earn and employers to pay, drilling down to individual counties or workforce investment areas.But social scientists are naturally suspicious of synthetic data. Does it really produce the same statistical results as the actual census or other microdata from which it's derived? Abowd plans to test this by running actual research projects on both real and synthetic data.

Scientists currently are allowed access to census microdata through nine Research Data Centers (RDCs) -- one located at CISER -- that provide encrypted links to confidential databases in physically secured settings. To use the RDCs, researchers must meet confidentiality requirements and demonstrate that their research will produce results meeting the missions of the various government agencies, especially the Census Bureau. Under the new grant, some of these researchers will be asked to apply their statistical analyses both to synthetic data and to the underlying real data, and the results will be compared.

One of the goals of the project is to create "virtual RDCs," which would provide access to synthetic data over the Internet.

A key component of the project, so far only partially funded by the grant, will be the creation of the Census Bureau of a new 256-processor parallel computer, supported by Intel, Unisys and SAS Institute, to be used for the creation of synthetic data and evaluation of the proposed products.

Source: Eurekalert & others

Last reviewed: By John M. Grohol, Psy.D. on 21 Feb 2009
    Published on All rights reserved.