skip to main content
research-article

Learning Hardware-Friendly Classifiers Through Algorithmic Stability

Published:29 January 2016Publication History
Skip Abstract Section

Abstract

Most state-of-the-art machine-learning (ML) algorithms do not consider the computational constraints of implementing the learned model on embedded devices. These constraints are, for example, the limited depth of the arithmetic unit, the memory availability, or the battery capacity. We propose a new learning framework, the Algorithmic Risk Minimization (ARM), which relies on Algorithmic-Stability, and includes these constraints inside the learning process itself. ARM allows one to train advanced resource-sparing ML models and to efficiently deploy them on smart embedded systems. Finally, we show the advantages of our proposal on a smartphone-based Human Activity Recognition application by comparing it to a conventional ML approach.

References

  1. M. F. A. Abdullah, A. F. P. Negara, M. S. Sayeed, D. J. Choi, and K. S. Muthu. 2012. Classification algorithms in human activity recognition using smartphones. International Journal of Computer and Information Engineering 6 (2012), 77--84.Google ScholarGoogle Scholar
  2. E. Alba, D. Anguita, A. Ghio, and S. Ridella. 2008. Using variable neighborhood search to improve the support vector machine performance in embedded automotive applications. In IEEE International Joint Conference on Neural Networks.Google ScholarGoogle Scholar
  3. D. Anguita, A. Boni, and S. Ridella. 2003. A digital architecture for support vector machines: Theory, algorithm, and FPGA implementation. IEEE Transactions on Neural Networks 14, 5 (2003), 993--1009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. 2013. Energy efficient smartphone-based activity recognition using fixed-point arithmetic. Journal of Universal Computer Science 19 (2013), 1295--1314.Google ScholarGoogle Scholar
  5. D. Anguita, A. Ghio, L. Oneto, and S. Ridella. 2011. Selecting the hypothesis space for improving the generalization ability of support vector machines. In IEEE International Joint Conference on Neural Networks.Google ScholarGoogle Scholar
  6. D. Anguita, A. Ghio, L. Oneto, and S. Ridella. 2012. In-sample and out-of-sample model selection and error estimation for support vector machines. IEEE Transactions on Neural Networks and Learning Systems 23, 9 (2012), 1390--1406.Google ScholarGoogle ScholarCross RefCross Ref
  7. D. Anguita, A. Ghio, L. Oneto, and S. Ridella. 2013. A support vector machine classifier from a bit-constrained, sparse and localized hypothesis space. In International Joint Conference on Neural Networks.Google ScholarGoogle Scholar
  8. D. Anguita, A. Ghio, L. Oneto, and S. Ridella. 2014. Smartphone battery saving by bit-based hypothesis spaces and local rademacher complexities. In International Joint Conference on Neural Networks.Google ScholarGoogle Scholar
  9. D. Anguita, A. Ghio, S. Pischiutta, and S. Ridella. 2008. A support vector machine with integer parameters. Neurocomputing 72, 1 (2008), 480--489. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  11. P. L. Bartlett, S. Boucheron, and G. Lugosi. 2002a. Model selection and error estimation. Machine Learning 48, 1--3 (2002), 85--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. L. Bartlett, O. Bousquet, and S. Mendelson. 2002b. Localized Rademacher complexities. In Computational Learning Theory. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. L. Bartlett, O. Bousquet, and S. Mendelson. 2005. Local Rademacher complexities. Annals of Statistics 33, 4 (2005), 1497--1537.Google ScholarGoogle ScholarCross RefCross Ref
  14. P. L. Bartlett and S. Mendelson. 2003. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3 (2003), 463--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Belkin and P. Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15, 6 (2003), 1373--1396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Bergstra and Y. Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, 1 (2012), 281--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. L. Blum and P. Langley. 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence 97, 1 (1997), 245--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. Bousquet and A. Elisseeff. 2002. Stability and generalization. Journal of Machine Learning Research 2 (2002), 499--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Boyd and L. Vandenberghe. 2009. Convex Optimization. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Casella and R. L. Berger. 2002. Statistical Inference. Duxbury, Pacific Grove, CA.Google ScholarGoogle Scholar
  21. O. Catoni. 2007. PAC-Bayesian supervised classification: The thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248 (2007).Google ScholarGoogle Scholar
  22. A. Chin, B. Xu, H. Wang, L. Chang, H. Wang, and L. Zhu. 2013. Connecting people through physical proximity and physical resources at a conference. ACM Transactions on Intelligent Systems and Technology 4, 3 (2013), 50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. J. Cook and S. K. Das. 2012. Pervasive computing at scale: Transforming the state of the art. Pervasive Mobile Computing 8 (2012), 22--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. B. Dantzig. 1998. Linear Programming and Extensions. Princeton University Press.Google ScholarGoogle Scholar
  25. E. De Vito, L. Rosasco, A. Caponnetto, U. D. Giovannini, and F. Odone. 2005. Learning from examples as an inverse problem. Journal of Machine Learning Research 6 (2005), 883--904. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Devroye, L. Györfi, and G. Lugosi. 1996. A Probabilistic Theory of Pattern Recognition. Springer.Google ScholarGoogle Scholar
  27. R. Dietrich, M. Opper, and H. Sompolinsky. 1999. Statistical mechanics of support vector networks. Physical Review Letters 82, 14 (1999), 2975.Google ScholarGoogle ScholarCross RefCross Ref
  28. F. Dinuzzo, M. Neve, G. De Nicolao, and U. P. Gianazza. 2007. On the representer theorem and equivalent degrees of freedom of SVR. Journal of Machine Learning Research 8, 10 (2007), 2467--2495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. F. Dinuzzo and B. Schölkopf. 2012. The representer theorem for Hilbert spaces: A necessary and sufficient condition. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  30. M. G. Epitropakis, V. P. Plagianakos, and M. N. Vrahatis. 2010. Hardware-friendly higher-order neural network training using distributed evolutionary algorithms. Applied Soft Computing 10, 2 (2010), 398--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. A. Felicity, A. Eliathamby, H. L. Nigel, and G. C. Branko. 2006. Classification of a known sequence of motions and postures from accelerometrydata using adapted Gaussian mixture models. Physiological Measurement 27 (2006), 935.Google ScholarGoogle ScholarCross RefCross Ref
  33. K. Fukunaga and D. M. Hummels. 1989. Leave-one-out procedures for nonparametric error estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 4 (1989), 421--423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Ghio and S. Pischiutta. 2007. A support vector machine based pedestrian recognition system on resource-limited hardware architectures. In Research in Microelectronics and Electronics Conference (PRIME’07).Google ScholarGoogle Scholar
  35. P. D. Grünwald, I. J. Myung, and M. A. Pitt. 2005. Advances in Minimum Description Length: Theory and Applications. MIT Press.Google ScholarGoogle Scholar
  36. M. A. Hanson, H. C. Powell Jr, A. T. Barth, and J. Lach. 2012. Application-focused energy-fidelity scalability for wireless motion-based health assessment. ACM Transactions on Embedded Computing Systems 11, S2 (2012), 50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 301 (1963), 13--30.Google ScholarGoogle ScholarCross RefCross Ref
  38. IBM. 2014. User-Manual CPLEX 12.6. IBM Software Group. (2014).Google ScholarGoogle Scholar
  39. K. Irick, M. DeBole, V. Narayanan, and A. Gayasen. 2008a. A hardware efficient support vector machine architecture for FPGA. In International Symposium on Field-Programmable Custom Computing Machines. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. Irick, M. DeBole, V. Narayanan, and A. Gayasen. 2008b. A hardware efficient support vector machine architecture for FPGA. In International Symposium on Field-Programmable Custom Computing Machines. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. V. V. Ivanov. 1976. The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integral Equations. Springer.Google ScholarGoogle Scholar
  42. T. Joachims. 1999. Making Large Scale SVM Learning Practical. Technical Report. Universität Dortmund.Google ScholarGoogle Scholar
  43. G. H. John, R. Kohavi, and K. Pfleger. 1994. Irrelevant features and the subset selection problem. In International Conference on Machine Learning.Google ScholarGoogle Scholar
  44. P. Klesk and M. Korzen. 2011. Sets of approximating functions with finite Vapnik--Chervonenkis dimension for nearest-neighbors algorithms. Pattern Recognition Letters 32, 14 (2011), 1882--1893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. V. Koltchinskii. 2001. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory 47, 5 (2001), 1902--1914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. V. Koltchinskii. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer.Google ScholarGoogle Scholar
  47. E. S. Larsen and D. McAllister. 2001. Fast matrix multiplies using graphics hardware. In ACM/IEEE Conference on Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. M. S. Lee, S. S. Keerthi, C. J. Ong, and D. DeCoste. 2004. An efficient method for computing leave-one-out error in support vector machines with gaussian kernels. IEEE Transactions on Neural Networks 15, 3 (2004), 750--757. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. S. W. Lee, S. W. Lee, and H. C. Jung. 2003. Real-time implementation of face recognition algorithms on DSP chip. In Audio-and Video-Based Biometric Person Authentication. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. B. Lesser, M. Mücke, and W. N. Gansterer. 2011. Effects of reduced precision on floating-point SVM classification accuracy. Procedia Computer Science 4 (2011), 508--517.Google ScholarGoogle Scholar
  51. G. Lever, F. Laviolette, and J. Shawe-Taylor. 2010. Distribution-dependent PAC-Bayes priors. In Algorithmic Learning Theory. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Li and P. M. B. Vitányi. 2009. An Introduction to Kolmogorov Complexity and Its Applications. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Q. Li and A. Bermak. 2011. A low-power hardware-friendly binary decision tree classifier for gas identification. Journal of Low Power Electronics and Applications 1, 1 (2011), 45--58.Google ScholarGoogle ScholarCross RefCross Ref
  54. Z. Liu, S. Lin, and M. T. Tan. 2010. Sparse support vector machines with L_{p} penalty for biomarker identification. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7, 1 (2010), 100--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T. Luo, L. O. Hall, D. B. Goldgof, and A. Remsen. 2005. Bit reduction support vector machine. In IEEE International Conference on Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. J. Manikandan, B. Venkataramani, and V. Avanthi. 2009. FPGA implementation of support vector machine based isolated digit recognition system. In IEEE International Conference on VLSI Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. C. McDiarmid. 1989. On the method of bounded differences. Surveys in Combinatorics 141, 1 (1989), 148--188.Google ScholarGoogle Scholar
  58. N. Meinshausen and P. Bühlmann. 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (2010), 417--473.Google ScholarGoogle ScholarCross RefCross Ref
  59. J. Mercer. 1909. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character (1909), 415--446.Google ScholarGoogle Scholar
  60. B. L. Milenova, J. S. Yarmus, and M. M. Campos. 2005. SVM in oracle database 10g: Removing the barriers to widespread adoption of support vector machines. In International Conference on Very Large Data Bases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. V. A. Morozov, Z. Nashed, and A. B. Aries. 1984. Methods for Solving Incorrectly Posed Problems. Springer, New York.Google ScholarGoogle Scholar
  62. S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. 2006. Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25, 1 (2006), 161--193.Google ScholarGoogle ScholarCross RefCross Ref
  63. S. Mukherjee, P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T. R. Golub, and J. P. Mesirov. 2003. Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology 10, 2 (2003), 119--142.Google ScholarGoogle ScholarCross RefCross Ref
  64. G. L. Nemhauser and L. A. Wolsey. 1988. Integer and Combinatorial Optimization. Wiley, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. H. Noshadi, F. Dabiri, S. Meguerdichian, M. Potkonjak, and M. Sarrafzadeh. 2013. Behavior-oriented data resource management in medical sensing systems. ACM Transactions on Sensor Networks 9, 2 (2013), 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. L. Oneto, A. Ghio, S. Ridella, and D. Anguita. 2015. Fully empirical and data-dependent stability-based bounds. IEEE Transactions on Cybernetics 45, 9 (2015), 1913--1926.Google ScholarGoogle ScholarCross RefCross Ref
  67. L. Oneto, A. Ghio, S. Ridella, J. L. Reyes-Ortiz, and D. Anguita. 2014. Out-of-sample error estimation: The blessing of high dimensionality. In IEEE International Conference on Data Mining, International Workshop on High Dimensional Data Mining.Google ScholarGoogle Scholar
  68. M. Opper. 1995. Statistical mechanics of learning: Generalization. In The Handbook of Brain Theory and Neural Networks (1995), 922--925. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. M. Opper, W. Kinzel, J. Kleinz, and R. Nehl. 1990. On the ability of the optimal perceptron to generalise. Journal of Physics A: Mathematical and General 23, 11 (1990), L581.Google ScholarGoogle ScholarCross RefCross Ref
  70. C. Orsenigo and C. Vercellis. 2004. Discrete support vector decision trees via tabu search. Computational Statistics & Data Analysis 47, 2 (2004), 311--322.Google ScholarGoogle ScholarCross RefCross Ref
  71. C. H. Papadimitriou and K. Steiglitz. 1998. Combinatorial Optimization: Algorithms and Complexity. Courier Dover Publications.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. B. Parhami. 2009. Computer Arithmetic: Algorithms and Hardware Designs. Oxford University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. E. Parrado-Hernández, A. Ambroladze, J. Shawe-Taylor, and S. Sun. 2012. PAC-Bayes bounds with data dependent priors. Journal of Machine Learning Research 13, 1 (2012), 3507--3531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. O. Pina-Ramfrez, R. Valdes-Cristerna, and O. Yanez-Suarez. 2006. An FPGA implementation of linear kernel support vector machines. In IEEE International Conference on Reconfigurable Computing and FPGA’s.Google ScholarGoogle Scholar
  75. V. P. Plagianakos and M. N. Vrahatis. 2002. Parallel evolutionary training algorithms for “hardware-friendly” neural networks. Natural Computing 1, 2--3 (2002), 307--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri. 2002. b. In Uncertainty in Geometric Computations.Google ScholarGoogle Scholar
  77. T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. 2004. General conditions for predictivity in learning theory. Nature 428, 6981 (2004), 419--422.Google ScholarGoogle Scholar
  78. L. Rosasco, E. Vito, A. Caponnetto, M. Piana, and A. Verri. 2004. Are loss functions all the same? Neural Computation 16, 5 (2004), 1063--1076. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. S. T. Roweis and L. K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 5500 (2000), 2323--2326.Google ScholarGoogle ScholarCross RefCross Ref
  80. B. Schölkopf. 2001. The kernel trick for distances. In Advances in Neural Information Processing Systems (2001).Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. B. Schölkopf, R. Herbrich, and A. J. Smola. 2001. A generalized representer theorem. In Computational Learning Theory.Google ScholarGoogle Scholar
  82. A. Schrijver. 2003. Combinatorial Optimization: Polyhedra and Efficiency. Springer.Google ScholarGoogle Scholar
  83. S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. 2010. Learnability, stability and uniform convergence. Journal of Machine Learning Research 11 (2010), 2635--2670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. 1998. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory 44, 5 (1998), 1926--1940. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. J. Shawe-Taylor and N. Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. I. Steinwart. 2005. Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory 51, 1 (2005), 128--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. B. Tang, N. Jaggi, H. Wu, and R. Kurkal. 2013. Energy-efficient data redistribution in sensor networks. ACM Transactions on Sensor Networks 9, 2 (2013), 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. A. Tarantola. 2005. Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. G. Thatte, M. Li, S. Lee, A. Emken, S. Narayanan, U. Mitra, D. Spruijt-Metz, and M. Annavaram. 2012. Knowme: An energy-efficient multimodal body area network for physical activity monitoring. ACM Transactions on Embedded Computing Systems 11, S2 (2012), 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. A. N. Tikhonov, V. I. A. Arsenin, and F. John. 1977. Solutions of Ill-Posed Problems. Winston, Washington, DC.Google ScholarGoogle Scholar
  91. V. N. Vapnik. 1998. Statistical Learning Theory. Wiley-Interscience.Google ScholarGoogle Scholar
  92. V. N. Vapnik. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10, 5 (1999), 988--999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Wikipedia. 2015. Comparison of smartphones. http://en.wikipedia.org/wiki/Comparison_of_smartphones. (2015).Google ScholarGoogle Scholar
  94. P. Zappi, D. Roggen, E. Farella, G. Tröster, and L. Benini. 2012. Network-level power-performance trade-off in wearable activity recognition: A dynamic sensor selection approach. ACM Transactions on Embedded Computing Systems 11, 3 (2012), 68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Y. Zheng, L. Capra, O. Wolfson, and H. Yang. 2014. Urban computing: Concepts, methodologies, and applications. ACM Transaction on Intelligent Systems and Technology 6, 2 (2014), 58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 2004. 1-norm support vector machines. Advances in Neural Information Processing Systems (2004).Google ScholarGoogle Scholar
  97. H. Zou and T. Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301--320.Google ScholarGoogle ScholarCross RefCross Ref
  98. H. Zou, T. Hastie, and R. Tibshirani. 2007. On the “degrees of freedom” of the lasso. Annals of Statistics 35, 5 (2007), 2173--2192.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning Hardware-Friendly Classifiers Through Algorithmic Stability

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!