An Online Tree-Based Approach for Mining Non-Stationary High-Speed Data Streams
Keywords:Data stream, decision tree, incremental learning, machine learning, online learning
AbstractThis paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.
GAMA, J. Knowledge Discovery from Data Streams. [S.l.]: Chapman and Hall/CRC, 2010.
GAMA, J. et al. A survey on concept drift adaptation.ACM Computing Surveys, v. 46, n. 4, 2014.
MURTHY, S. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Hingham, MA, USA, v. 2, n. 4, p. 345–389, dez. 1998. ISSN 1384-5810. Disponível em: <http://dx.doi.org/10.1023/A:1009744630224>.
RAMOS, G.; MORALES, R. A new method for induction decision trees by sampling. In: Neurocolt Workshop on Applications of Learning Theory. Barcelona, Spain: [s.n.], 2000.
DOMINGOS, P.; HULTEN, G. Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2000. (KDD ’00), p. 71–80.
RUTKOWSKI, L. et al. Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Transactions on Knowledge and Data Engineering, v. 25, n. 6, p. 1272–1279, 2013.
ROSA, G.; CESA-BIANCHI, N. Splitting with confidence in decision trees with application to stream mining. International Joint Conference on Neural Networks (IJCNN), IEEE, 2015.
del Campo, J. et al. Improving the performance of an incremental algorithm driven by error margins. Intelligent Data Analysis, v. 12, n. 3, p. 305–318, 2008.
RUTKOWSKI, L. et al. Decision trees for mining data streams based on the gaussian approximation. IEEE Transactions on Knowledge and Data Engineering, v. 26, n. 1, p. 108–119, 2014.
BIFET, A. et al. MOA: Massive Online Analysis. Journal of Machine Learning Research, v. 11, p. 1601–1604, 2010.
DOMINGOS, P.; PAZZANI, M. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, v. 29, n. 2/3, p. 103–130, 1997.
HULTEN, G.; SPENCER, L.; DOMINGOS, P. Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2001. p. 97–106.
FERNáNDEZ, M. et al. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res., v. 15, n. 1, p. 3133–3181, 2014.
RUTKOWSKI, L. et al. A new method for data stream mining based on the misclassification error. IEEE Transactions on Neural Networks and Learning Systems, v. 26, n. 5, p. 1048–1059, 2015.
JIN, R.; AGRAWAL, G. Efficient decision tree construction on streaming data. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2003. (KDD ’03), p. 571–576.
RAMOS, G.; del Campo, J.; MORALES, R. Incremental algorithm driven by error margins. Lecture Notes in Artificial Intelligence, v. 4265, p. 358–362, 2006.
NÚÑEZ, M.; FIDALGO, R.; MORALES, R. Learning in environments with unknown dynamics: Towards more robust concept learners. Journal of Machine Learning Research, v. 8, p. 2595–2628, 2007.
del Campo, J. et al. Improving prediction accuracy of an incremental algorithm driven by error margins. In: Proceedings of the 4th International Workshop on Knowledge Discovery from Data Streams. [S.l.: s.n.], 2006. (IWKDDS’06), p. 57–66.
Núñez, M.; FIDALGO, R.; MORALES, R. On-Line learning of decision trees in problems with unknown dynamics. In: Proc. 4th Mexican Int. Conf. on Advances in Artificial Intelligence. [S.l.]: Springer-Verlag, 2005. p. 443–453.
GAMA, J.; CASTILLO, G. Learning with Local Drift Detection. In: Proceedings of the 2nd International Conference on Advanced Data Mining and Applications. [S.l.: s.n.], 2006. p. 42–55.
BIFET, A. Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams. [S.l.]: IOS Press, 2010. v. 207. 1-212 p.
GAMA, J.; RODRIGUES, P.; CASTILLO, G. Evaluating algorithms that learn from data streams. In: Proceedings of the 2009 ACM Symposium on Applied Computing. New York, NY, USA: ACM, 2009. (SAC ’09), p. 1496–1500.
GAMA, J.; SEBASTIÃO, R.; RODRIGUES, P. Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2009. (KDD ’09), p. 329–338.
GAMA, J.; SEBASTIÃO, R.; RODRIGUES, P. On evaluating stream learning algorithms. Machine Learning, v. 90, n. 3, p. 317–346, 2013.
DIETTERICH, T. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput., v. 10, n. 7, p. 1895–1923, 1998.
KIRKBY, R. Improving Hoeffding Trees. Tese (Doutorado) — University of Waikato, 2007.
SCHLIMMER, J.; GRANGER, R. Incremental learning from noisy data. Machine Learning, v. 1, n. 3, p. 317–354, 1986.
AGRAWAL, R.; IMIELINSKI, T.; SWAMI, A. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, v. 5, n. 6, p. 914–925, 1993.