Abstract
A continuing need in the contexts of homeland security, national defense, and counterterrorism is for statistical analyses that "integrate" data stored in multiple, distributed databases. There is some belief, for example, that integration of data from flight schools, airlines, credit card issuers, immigration records, and other sources might have prevented the terrorist attacks of September 11, 2001, or might be able to prevent recurrences. In addition to significant technical obstacles, not the least of which is poor data quality [KSS01, KSB05], proposals for large-scale integration of multiple databases have engendered significant public opposition. Indeed, the outcry has been so strong that some plans have been modified or even abandoned. The political opposition to "mining" distributed databases centers on deep, if not entirely precise, concerns about the privacy of database subjects and, to a lesser extent, database owners. The latter is an issue, for example, for databases of credit card transactions or airline ticket purchases. Integrating the data without protecting ownership could be problematic for all parties; the companies would be revealing who their customers are, and where a person is a customer would also be revealed. For many analyses, however, it is not necessary actually to integrate the data. Instead, as we show in this paper, using techniques from computer science known generically as secure multiparty computation, the database holders can share analysis-specific sufficient statistics anonymously, but in a way that the desired analysis can be performed in a principled manner. If the sole concern is protecting the source rather than the content of data elements, it is even possible to share the data themselves, in which case any analysis can be performed. 238 Alan F. Karr et al. The same need arises in nonsecurity settings as well, especially scientific and policy investigations. For example, a regression analysis on integrated state databases about factors influencing student performance would be more insightful than individual analyses, or complementary to them. Yet another setting is proprietary data; pharmaceutical companies might all benefit, for example, from a statistical analysis of their combined chemical libraries, but do not wish to reveal which chemicals are in the libraries [KFL05]. The barriers to integrating databases are numerous. One is confidentiality; the database holders-we term them "agencies"-almost always wish to protect the identities of their data subjects. Another is regulation; agencies such as the Census Bureau (CB) and Bureau of Labor Statistics (BLS) are largely forbidden by law to share their data, even with each other, let alone with a trusted third party. A third is scale; despite advances in networking technology, there are few ways to move a terabyte of data from point A today to point B tomorrow. In this paper we focus on linear regression and related analyses. The regression setting is important because of its prediction aspect; for example, vulnerable critical infrastructure components might be identified using a regression model. We begin in Sect. 2 with background on data confidentiality and on secure multiparty computation. Linear regression is treated for "horizontally partitioned data" in Sect. 3 and for "vertically partitioned data" in Sect. 4. Two methods for secure data integration and an application to secure contingency tables appear in Sect. 5, and conclusions are given in Sect. 6. Various assumptions are possible about the participating parties, for example, whether they use "correct" values in the computations, follow computational protocols, or collude against one another. The setting in this paper is that of agencies wishing to cooperate but to preserve the privacy of their individual databases. While each agency can "subtract" its own contribution from integrated computations, it should not be able to identify the other agencies' contributions. Thus, for example, if data are pooled, an agency can of course recognize data elements that are not its own, but should not be able to determine which other agency owns them. In addition, we assume that the agencies are "semihonest;" each follows the agreed-on computational protocols, but may retain the results of intermediate computations.
Original language | English (US) |
---|---|
Title of host publication | Statistical Methods in Counterterrorism |
Subtitle of host publication | Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication |
Publisher | Springer New York |
Pages | 237-261 |
Number of pages | 25 |
ISBN (Print) | 0387329048, 9780387329048 |
DOIs | |
State | Published - 2006 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Mathematics(all)