TY - JOUR AU - Ortiz Díaz, Agustín Alejandro AU - Frías Blanco, Isvani Inocencio AU - Palomino Mariño, Laura María AU - Baldo, Fabiano PY - 2020/01/15 Y2 - 2024/03/29 TI - An Online Tree-Based Approach for Mining Non-Stationary High-Speed Data Streams JF - Revista de Informática Teórica e Aplicada JA - RITA VL - 27 IS - 1 SE - Regular Papers DO - 10.22456/2175-2745.90822 UR - https://seer.ufrgs.br/index.php/rita/article/view/RITA_27_V1_36 SP - 36-47 AB - <div id="icpbravoaccess_loaded"> This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.</div><div id="icpbravoaccess_loaded"> </div><div id="icpbravoaccess_loaded"> </div><div id="icpbravoaccess_loaded"> </div> ER -