skip to main content
10.1145/1645953.1646301acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Stochastic gradient boosted distributed decision trees

Published: 02 November 2009 Publication History

Abstract

Stochastic Gradient Boosted Decision Trees (GBDT) is one of the most widely used learning algorithms in machine learning today. It is adaptable, easy to interpret, and produces highly accurate models. However, most implementations today are computationally expensive and require all training data to be in main memory. As training data becomes ever larger, there is motivation for us to parallelize the GBDT algorithm. Parallelizing decision tree training is intuitive and various approaches have been explored in existing literature. Stochastic boosting on the other hand is inherently a sequential process and have not been applied to distributed decision trees. In this work, we present two different distributed methods that generates exact stochastic GBDT models, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.

References

[1]
CARAGEA, D., SILVESCU, A., AND HONAVAR, V. A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. International Journal of Hybrid Intelligent Systems 1, 2 (2004).
[2]
CHEN, K., LU, R., WONG, C. K., SUN, G., HECK, L., AND TSENG, B. L. Trada: tree based ranking function adaptation. In CIKM (2008), pp. 1143--1152.
[3]
DEAN, J., AND GHEMAWAT, S. Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[4]
FOUNDATION, A. Apache hadoop project. lucene.apache.org/hadoop.
[5]
FRIEDMAN, J. H. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29 (2001), 1189--1232.
[6]
FRIEDMAN, J. H. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 4(February 2002), 367--378.
[7]
GEHRKE, J., RAMAKRISHNAN, R., AND GANTI, V. Rainforest - a framework for fast decision tree construction of large datasets. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA (1998), A. Gupta, O. Shmueli, and J. Widom, Eds., Morgan Kaufmann, pp. 416--427.
[8]
PROVOST, F., KOLLURI, V., AND FAYYAD, U. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 3 (1999), 131--169.
[9]
QUINLAN, J. R. Induction of decision trees. In Machine Learning (1986), pp. 81--106.
[10]
SHAFER, J. C., AGRAWAL, R., AND 0002, M. M. Sprint: A scalable parallel classifier for data mining. In VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India (1996), T. M. Vijayaraman, A. P. Buchmann, C. Mohan, and N. L. Sarda, Eds., Morgan Kaufmann, pp. 544--555.
[11]
STATISTICS, L. B., AND BREIMAN, L. Random forests. In Machine Learning (2001), pp. 5--32.
[12]
SU, J., AND ZHANG, H. A fast decision tree learning algorithm. In AAAI (2006).
[13]
ZHENG, Z., CHEN, K., SUN, G., AND ZHA, H. A regression framework for learning ranking functions using relative relevance judgments. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (2007), 287--294.

Cited By

View all
  • (2024)Leveraging IoT and Machine Learning for Improved Health Prediction SystemsSustainable Science and Intelligent Technologies for Societal Development10.4018/979-8-3693-1186-8.ch016(278-306)Online publication date: 5-Jan-2024
  • (2024)Artificial Intelligence in Identifying Patients With Undiagnosed Nonalcoholic SteatohepatitisJournal of Health Economics and Outcomes Research10.36469/jheor.2024.123645(86-94)Online publication date: 25-Sep-2024
  • (2024)Artificial Intelligence in Identifying Patients With Undiagnosed Nonalcoholic SteatohepatitisJournal of Health Economics and Outcomes Research10.36469/001c.12364511:2Online publication date: 25-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
November 2009
2162 pages
ISBN:9781605585123
DOI:10.1145/1645953
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. decision trees
  2. distributed learning
  3. gradient boosting
  4. hadoop
  5. learning to rank
  6. mpi
  7. web search ranking

Qualifiers

  • Poster

Conference

CIKM '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)138
  • Downloads (Last 6 weeks)11
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Leveraging IoT and Machine Learning for Improved Health Prediction SystemsSustainable Science and Intelligent Technologies for Societal Development10.4018/979-8-3693-1186-8.ch016(278-306)Online publication date: 5-Jan-2024
  • (2024)Artificial Intelligence in Identifying Patients With Undiagnosed Nonalcoholic SteatohepatitisJournal of Health Economics and Outcomes Research10.36469/jheor.2024.123645(86-94)Online publication date: 25-Sep-2024
  • (2024)Artificial Intelligence in Identifying Patients With Undiagnosed Nonalcoholic SteatohepatitisJournal of Health Economics and Outcomes Research10.36469/001c.12364511:2Online publication date: 25-Sep-2024
  • (2024)BPCoach: Exploring Hero Drafting in Professional MOBA Tournaments via Visual AnalyticsProceedings of the ACM on Human-Computer Interaction10.1145/36373038:CSCW1(1-31)Online publication date: 26-Apr-2024
  • (2024)Blaze: Holistic Caching for Iterative Data ProcessingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629558(370-386)Online publication date: 22-Apr-2024
  • (2024)Multi-Modal Traumatic Brain Injury Prognosis via Structure-Aware Field-Wise LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336438536:8(4089-4100)Online publication date: Aug-2024
  • (2024)Automatic Accident Detection, Segmentation and Duration Prediction Using Machine LearningIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.332363625:2(1547-1568)Online publication date: Feb-2024
  • (2024)Reinforcement Learning with Neighborhood Search and Self-learned Rules2024 IEEE 13th Data Driven Control and Learning Systems Conference (DDCLS)10.1109/DDCLS61622.2024.10606856(1474-1481)Online publication date: 17-May-2024
  • (2024)A Method for Predicting the Film Thickness of IC Deposited Films Based on FCBF-CATBOOST2024 Conference of Science and Technology for Integrated Circuits (CSTIC)10.1109/CSTIC61820.2024.10531926(1-3)Online publication date: 17-Mar-2024
  • (2024)STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep LearningIEEE Access10.1109/ACCESS.2024.340232612(70581-70599)Online publication date: 2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media