RCB-SOCP is a clustering based algorithm for classification of large datasets [1,2]. The key features are:
The training time complexity is O(m), where m is the number of training examples.
Need not store training data in memory .
It uses second order moments, i.e. mean and variance, of clusters to build optimum classifier. Hence performs better than methods which use only mean information.
It is robust to moment estimation errors. Hence can be used with any online clustering algorithm. The scalability of RCB-SOCP can be improved by choosing faster clustering algorithms.
It can be extended to non-linear classifiers .
The steps involved are:
Clustering the positive and negative training data points efficiently using any online clustering algorithm. Estimate the second order moments of clusters --- mean and variance.
We used the following BIRCH program for clustering --- birch.tgz in our experiments . This is slightly modified version of the original program by .
Solve the CB-SOCP/RCB1-SOCP/RCB2-SOCP formulations  which use both mean and variance of the training data clusters in order to build the discriminating hyperplane.
We implemented the above formulations in SeDuMi  for Matlab. The code is available here. See the README file for details.
Once the discriminating hyperplane w'x-b=0 is built, the label of a new test example x_t is sign(w'(x_t)-b).
The synthetic dataset, D, used in our scalability experiments  is available here. The scripts used to generate synthetic data, D1, D2, can be downloaded from here.
J. Saketha Nath, Aharon Ben-Tal and Chiranjib Bhattacharyya. Robust Classification for Large Datasets. Submitted to JMLR, 2008.
J. Saketha Nath, Chiranjib Bhattacharyya and M. N. Murty. Clustering based Large Margin Classification: A Scalable Approach using SOCP Formulation. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. [PDF]
Rashmin B,J. Saketha Nath, Krishnan S, Sivaramakrishnan, Chiranjib Bhattacharyya, M N Murty. Focused Crawling with Scalable Ordinal Regression Solvers. Proceedings of the International Conference on Machine Learning, 2007. [PDF]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996.
Jos F. Sturm. Using SeDuMi 1.02, a Matlab Toolbox for Optimization over Symmetric Cones. Available at http://www.optimization-online.org/DB_HTML/2001/10/395.html. Software can be downloaded from http://sedumi.mcmaster.ca/.