Robust Classification for Large Datasets

J. Saketha Nath, Aharon Ben-Tal and Chiranjib Bhattacharyya



RCB-SOCP is a clustering based algorithm for classification of large datasets [1,2]. The key features are:

The steps involved are:

  1. Clustering the positive and negative training data points efficiently using any online clustering algorithm. Estimate the second order moments of clusters --- mean and variance.

  1. Solve the CB-SOCP/RCB1-SOCP/RCB2-SOCP formulations [1] which use both mean and variance of the training data clusters in order to build the discriminating hyperplane.

  1. Once the discriminating hyperplane w'x-b=0 is built, the label of a new test example x_t is sign(w'(x_t)-b).

The synthetic dataset, D, used in our scalability experiments [1] is available here. The scripts used to generate synthetic data, D1, D2, can be downloaded from here.

References:

  1. J. Saketha Nath, Aharon Ben-Tal and Chiranjib Bhattacharyya. Robust Classification for Large Datasets. Submitted to JMLR, 2008.

  2. J. Saketha Nath, Chiranjib Bhattacharyya and M. N. Murty. Clustering based Large Margin Classification: A Scalable Approach using SOCP Formulation. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. [PDF]

  3. Rashmin B,J. Saketha Nath, Krishnan S, Sivaramakrishnan, Chiranjib Bhattacharyya, M N Murty. Focused Crawling with Scalable Ordinal Regression Solvers. Proceedings of the International Conference on Machine Learning, 2007. [PDF]

  4. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996.

  5. Jos F. Sturm. Using SeDuMi 1.02, a Matlab Toolbox for Optimization over Symmetric Cones. Available at http://www.optimization-online.org/DB_HTML/2001/10/395.html. Software can be downloaded from http://sedumi.mcmaster.ca/.