Authors: Md. Manzoor Murshed and Jeffrey W. Ohlmann
Database privacy or computer disclosure control is publishing anonymized data about individuals in such a way, that sensitive information about them cannot be revealed. In privacy preserving data publishing, both privacy and utility of the data are important. Although data utility for the research community is the main reason for data publishing, several consequences of privacy violation make data privacy an important and urgent research topic. It is a very popular computer science research area and also an important and difficult problem. The major challenge is to store personal sensitive information in public databases in such a manner that balances society’s needs and also can guarantee privacy of the individual whose data are in the database.
Data publishing also needs to address the reason of data collection, verification of security mechanisms, protection of each specific individual’s information and the consequences of disclosing or modifying private information. Several interesting solutions of this problem are already in the literature. Sweeney comes up with the k-anonymity principle. The main goal of kanonymizationis to hide every individual in a group of size kwith respect to the non-sensitive attributes so that linking and identifying someone in other databases becomes difficult. Although k-anonymization by suppressing and generalizing cells in the table is NP-hard, several approximation algorithms have already been proposed as a solution. k-anonymity not only ensures anonymity but also tries to minimize the information loss resulting from the generalization and suppression to guarantee it.
Clustering and anonymizing similar data together can ensure minimum information loss. In this paper we propose an efficient clustering technique for k-anonymization that tries to minimize information loss and at the same time guarantees good and quality data for data mining and other related research. The main idea behind the proposed algorithm is to group similar and logically related data and records together at the same cluster as much as possible which naturally guarantees less information loss during generalization. We propose a modified one pass kmeansalgorithm (MOKA) that runs in O (n2/k) time. Initial experimental result shows some better performance over kmeansalgorithm.