Biclustering and coclustering: concepts, algorithms and viability for text mining

Alexandra Katiuska Ramos Diaz; Sarajane Marques Peres

doi:10.22456/2175-2745.89063

Authors

Alexandra Katiuska Ramos Diaz Universidade de São Paulo
Sarajane Marques Peres Universidade de São Paulo http://orcid.org/0000-0003-3551-6480

DOI:

https://doi.org/10.22456/2175-2745.89063

Keywords:

Coagrupamento, Biagrupamento, Mineração de Textos

Abstract

Biclustering and coclustering are data mining tasks capable of extracting relevant information from data by applying similarity criteria simultaneously to rows and columns of data matrices. Algorithms used to accomplish these tasks simultaneously cluster objects and attributes, enabling the discovery of biclusters or coclusters. Although similar, the natures and aims of these tasks are different, and coclustering can be seen as a generalization of biclustering. An accurate study on algorithms related to biclustering and coclustering is essential to achieve effectiveness when solving real-world problems. Determining the values appropriate for the parameters of these algorithms is even more difficult when complex real-world data are analyzed. For example, when biclustering or coclustering is applied to textual data (i.e., in text mining), a representation through a vector space model is required. Such representation usually generates vector spaces with a high number of dimensions and high sparsity, which influences the performance of many algorithms. This tutorial aims to didactically present concepts related to the biclustering and coclustering tasks and how two basic algorithms address these concepts. In addition, experiments are presented in data contexts with a high number of dimensions and high sparsity, represented by both a synthetic dataset and a corpus of real-world news. In general and comparative terms, the results obtained show the algorithm used for coclustering (i.e., NBVD) as the most appropriate for the experiments’ context. Although the biclustering algorithm (i.e., Cheng and Church) was responsible for producing less relevant results in textual data than NBVD, its application in data with a high number of dimensions and high sparsity provided a suitable study environment to understand its operation.

Downloads

Download data is not yet available.

Author Biography

Sarajane Marques Peres, Universidade de São Paulo

Sarajane Marques Peres possui graduação em Ciência da Computação, pela Universidade Estadual de Maringá (1996), mestrado em Engenharia de Produção pela Universidade Federal de Santa Catarina(1999) e doutorado em Engenharia Elétrica pela Universidade Estadual de Campinas (2006). Atualmente é professora e pesquisadora na Universidade de São Paulo. Tem interesse de pesquisa em Inteligência Artificial, atuando principalmente nos temas: Redes Neurais Artificiais e Reconhecimento de Padrões.

References

FELDMAN, R.; SANGER, J. The text mining handbook: advanced approaches in analyzing unstructured data. New York: Cambridge University Press, 2007.

SILVA, L. A. da; PERES, S.; BOSCARIOLI, C. Introdução À Mineração de Dados - Com Aplicação Em R. São Paulo, Brasil: Elsevier Editora Ltda., 2016.

MADEIRA, S. C.; OLIVEIRA, A. L. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, IEEE Computer Society Press, v. 1, n. 1, p. 24–45, 2004.

NADIF, M. et al. Co-Clustering: Models, Algorithms and Applications. London: John Wiley & Sons, 2013.

XUAN, J. et al. Doubly nonparametric sparse nonnegative matrix factorization based on dependent Indian buffet processes. IEEE Transactions on Neural Networks and Learning Systems, IEEE, v. 29, n. 5, p. 1835–1849, 2018.

HUANG, S.; XU, Z.; LV, J. Adaptive local structure learning for document co-clustering. Knowledge-Based Systems, Elsevier, v. 148, p. 74–84, 2018.

FRANÇA, F. O. de. A hash-based co-clustering algorithm for categorical data. Expert Systems with Applications, Elsevier, v. 64, p. 24–35, 2016.

HONDA, K. et al. MMMs-induced k-member co-clustering for k-anonymization of cooccurrence information. In: IEEE. Proceedings of the International Joint Conference on Neural Networks. Vancouver, BC, Canadá, 2016. p. 2961–2966.

AILEM, M.; ROLE, F.; NADIF, M. Sparse poisson latent block model for document clustering. IEEE Transactions on Knowledge and Data Engineering, IEEE, v. 29, n. 7, p. 1563–1576, 2017.

SALAH, A.; ROGOVSCHI, N.; NADIF, M. Stochastic co-clustering for document-term data. In: SIAM. Proceedings of the SIAM International Conference on Data Mining. Miami, Florida, USA, 2016. p. 306–314.

MEI, J.-P. et al. Large scale document categorization with fuzzy clustering. IEEE Transactions on Fuzzy Systems, IEEE, v. 25, n. 5, p. 1239–1251, 2017.

AILEM, M.; ROLE, F.; NADIF, M. Graph modularity maximization as an effective method for co-clustering text data. Knowledge-Based Systems, Elsevier, v. 109, p. 160–173, 2016.

AILEM, M.; ROLE, F.; NADIF, M. Co-clustering document-term matrices by direct maximization of graph modularity. In: ACM. Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. Melbourne, VIC, Australia, 2015. p. 1807–1810.

CHENG, Y.; CHURCH, G. M. Biclustering of expression data. In: AAAI. Proceedings of the Intelligent Systems for Molecular Biology. La Jolla, San Diego, CA, USA, 2000. v. 8, n. 2000, p. 93–103.

LONG, B.; ZHANG, Z. M.; YU, P. S. Co-clustering by block value decomposition. In: ACM. Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. Chicago, Illinois, USA, 2005. p. 635–640.

BANERJEE, A. et al. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. Journal of Machine Learning Research, v. 8, n. Aug, p. 1919–1986, 2007.

ANAGNOSTOPOULOS, A.; DASGUPTA, A.; KUMAR, R. Approximation algorithms for co-clustering. In: ACM. Proceedings of the 27th ACM SIGMOD-SIGACT- SIGART Symposium on Principles of Database Systems. Vancouver, BC, Canada, 2008. p. 201–210.

PENSA, R. G.; BOULICAUT, J.-F. Constrained co-clustering of gene expression data. In: SIAM. Proceedings of the SIAM International Conference on Data Mining. Atlanta, Georgia, USA, 2008. p. 25–36.

DEODHAR, M. et al. A scalable framework for discovering coherent co-clusters in noisy data. In: ACM. Proceedings of the 26th Annual International Conference on Machine Learning. Montreal, Quebec, Canada, 2009. p. 241–248.

LEE, M. et al. Biclustering via sparse singular value decomposition. Biometrics, Wiley Online Library, v. 66, n. 4, p. 1087–1095, 2010.

DEODHAR, M.; GHOSH, J. SCOAL: A framework for simultaneous co-clustering and learning from complex data. ACM Transactions on Knowledge Discovery from Data, ACM, v. 4, n. 3, p. 11, 2010.

CHARRAD, M.; AHMED, M. B. Simultaneous clustering: A survey. In: SPRINGER. Proceedings of the International Conference on Pattern Recognition and Machine Intelligence. Moscow, Russia, 2011. p. 370–375.

ORZECHOWSKI, P.; BORYCZKO, K. Hybrid biclustering algorithms for data mining. In: SPRINGER. Proceedings of the European Conference on the Applications of Evolutionary Computation. Porto, Portugal, 2016. p. 156–168.

MIRKIN, B. Mathematical classification and clustering: From how to what and why. In: BALDERJAHN, I.; MATHAR, R.; SCHADER, M. (Ed.). Proceedings of the Classification, Data Analysis, and Data Highways. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998. p. 172–181.

PAPADIMITRIOU, S.; SUN, J. Disco: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In: IEEE. Proceedings of the 2008 8th IEEE International Conference on Data Mining. Las Vegas, Nevada, USA, 2008. p. 512–521.

PENSA, R. G. et al. Co-clustering numerical data under user-defined constraints. Statistical Analysis and Data Mining, Wiley Online Library, v. 3, n. 1, p. 38–55, 2010.

MUKHOPADHYAY, A.; MAULIK, U.; BANDYOPADHYAY, S. On biclustering of gene expression data. Current Bioinformatics, Bentham Science Publishers, v. 5, n. 3, p. 204–216, 2010.

HARTIGAN, J. A. Direct clustering of a data matrix. Journal of the American Statistical Association, Taylor & Francis Group, v. 67, n. 337, p. 123–129, 1972.

PONTES, B.; GIRÁLDEZ, R.; AGUILAR-RUIZ,

J. S. Biclustering on expression data: A review. Journal of Biomedical Informatics, Elsevier, v. 57, p. 163–180, 2015.

LEE, D. D.; SEUNG, H. S. Learning the parts of objects by non-negative matrix factorization. Nature, Nature Publishing Group, v. 401, n. 6755, p. 788–791, 1999.

HOFMANN, T.; PUZICHA, J.; JORDAN, M. I. Learning from dyadic data. In: NIPS. Proceedings of the Advances in Neural Information Processing Systems. Denver, Colorado, USA, 1999. p. 466–472.

DHILLON, I. S.; MALLELA, S.; MODHA, D. S. Information-theoretic co-clustering. In: ACM. Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [S.l.], 2003. p. 89–98.

YOO, J.; CHOI, S. Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on stiefel manifolds. Information Processing & Management, Elsevier, v. 46, n. 5, p. 559–570, 2010.

BUONO, N. D.; PIO, G. Non-negative matrix tri-factorization for co-clustering: An analysis of the block matrix. Information Sciences, Elsevier, v. 301, p. 13–26, 2015.

IENCO, D. et al. Parameter-less co-clustering for star-structured heterogeneous data. Data Mining and Knowledge Discovery, Springer, v. 26, n. 2, p. 217–254, 2013.

HOFMANN, T. Probabilistic latent semantic analysis. In: MORGAN KAUFMANN. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden, 1999. p. 289–296.

HALKIDI, M.; BATISTAKIS, Y.; VAZIRGIANNIS, M. Cluster validity methods: part I. ACM Sigmod Record, ACM, v. 31, n. 2, p. 40–45, 2002.

HALKIDI, M.; BATISTAKIS, Y.; VAZIRGIANNIS, M. Clustering validity checking methods: part II. ACM Sigmod Record, ACM, v. 31, n. 3, p. 19–27, 2002.

WANG, H. et al. Fast nonnegative matrix tri- factorization for large-scale data co-clustering. In: IJCAI/AAAI. Proceedings of the International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain, 2011. v. 22, n. 1, p. 1553.

BRUNIALTI, L. F. et al. The BinOvNMTF algorithm: Overlapping columns co-clustering based on non-negative matrix tri-factorization. In: SBC. Proceedings of the 2017 6th Brazilian Conference on Intelligent Systems. Uberlândia, MG, Brazil, 2017. p. 330–335.

CASTRO, P. A. de et al. Query expansion using an immune-inspired biclustering algorithm. Natural Computing, Springer, v. 9, n. 3, p. 579–602, 2010.

HALKIDI, M.; BATISTAKIS, Y.; VAZIRGIANNIS, M. On clustering validation techniques. Journal of Intelligent Information Systems, Springer, v. 17, n. 2-3, p. 107–145, 2001.

DESGRAUPES, B. Clustering indices. University of Paris Ouest-Lab ModalX, v. 1, p. 34, 2013.

SOLER, J. et al. Data clustering and similarity. In: AAAI. Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference. St. Pete Beach, Florida, USA.

DAVIES, D. L.; BOULDIN, D. W. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE, n. 2, p. 224–227, 1979.

FRANCA, F. O. de. Biclusterização na Análise de Dados Incertos. Tese (Doutorado) — Universidade Estadual de Campinas, 2010.

TANAY, A.; SHARAN, R.; SHAMIR, R. Biclustering algorithms: A survey. Handbook of Computational Molecular Biology, Citeseer, v. 9, n. 1-20, p. 122–124, 2005.

BRUNIALTI, L. F. Fatoração de matrizes no problema de coagrupamento com sobreposição de colunas. Dissertação (Mestrado) — Escola de Artes, Ciências e Humanidades da Universidade de São Paulo, 2016.

DING, C. et al. Orthogonal nonnegative matrix t-factorizations for clustering. In: ACM. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA, USA, 2006. p. 126–135.

DIAZ, A. K. R. et al. Uma análise comparativa das ferramentas de pré-processamento de dados textuais: NLTK, PreTexT e R. Relatório Técnico: PPgSI-001/2018. Universidade de São Paulo. São Paulo, 2018.

DIAZ, A. K. R. Biagrupamento heurístico e coagrupamento baseado em fatoração de matrizes: um estudo em dados textuais. Dissertação (Mestrado) — Universidade de São Paulo, São Paulo, SP, Brasil, 2018.

PAGNOSSIM, J. L. M. Uma abordagem híbrida para sistemas de recomendação de notícias. Dissertação (Mestrado) — Escola de Artes, Ciências e Humanidades da Universidade de São Paulo, 2018.

BAEZA-YATES, R.; RIBEIRO-NETO, B. Modern Information Retrieval: The Concepts and Technology behind Search (ACM Press Books). USA: Addison-Wesley Professional, 2011.

MARTIN-FERNANDEZ, J. A.; PALAREA- ALBALADEJO, J.; OLEA, R. A. Dealing with zeros. Compositional data analysis: Theory and applications, John Wiley & Sons, Ltd Hoboken, NJ, p. 43–58, 2011.

NOURASHRAFEDDIN, S.; MILIOS, E.; ARNOLD, D. V. An evolutionary algorithm for feature selective double clustering of text documents. In: IEEE. Proceedings of the 2013 IEEE Congress on Evolutionary Computation (CEC). Cancun, Mexico, 2013. p. 446–453.

FRANCA, F. O. D. Scalable overlapping co-clustering of word-document data. In: IEEE. Proceedings of the 2012 11th International Conference on Machine Learning and Applications. Boca Raton, FL, USA, 2012. v. 1, p. 464–467.

YAN, Y.; CHEN, L.; TJHI, W.-C. Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets and Systems, Elsevier, v. 215, p. 74–89, 2013.

WANG, S.; HUANG, A. Penalized nonnegative matrix tri-factorization for co-clustering. Expert Systems with Applications, Elsevier, v. 78, p. 64–73, 2017.

ALLAB, K.; LABIOD, L.; NADIF, M. SemiNMF-PCA framework for sparse data co-clustering. In: ACM. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. Indianapolis, IN, USA, 2016. p. 347–356.

LI, Z.; WU, X. Single multiplicatively updated matrix factorization for co-clustering. In: ACM. Proceedings of the 29th Annual ACM Symposium on Applied Computing. Gyeongju, Korea, 2014. p. 97–104.

MA, H.; ZHAO, W.; SHI, Z. A nonnegative matrix factorization framework for semi-supervised document clustering with dual constraints. Knowledge and information systems, Springer, v. 36, n. 3, p. 629–651, 2013.

CASTRO, P. A. de et al. Applying biclustering to text mining: an immune-inspired approach. In: Proceedings of the Artificial Immune Systems. Santos, SP, Brazil: Springer, 2007. p. 83–94.

PRIAM, R.; NADIF, M.; GOVAERT, G. Generalized topographic block model. Neurocomputing, Elsevier, v. 173, p. 442–449, 2016.

MIMAROGLU, S.; UEHARA, K. Bit sequences and biclustering of text documents. In: IEEE. Proceedings of the 7th IEEE International Conference on Data Mining Workshops. Omaha, Nebraska, USA, 2007. p. 51–56.

GU, Q.; DING, C.; HAN, J. On trivial solution and scale transfer problems in graph regularized nmf. In: IJCAI. Proceedings of the International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain, 2011. v. 22, n. 1, p. 1288.

CAO, Y.; HUANG, M.; ZHU, X. Clustering sentiment phrases in product reviews by constrained co-clustering. In: SPRINGER. Proceedings of the National CCF Conference on Natural Language Processing and Chinese Computing. Nanchang, China, 2015. p. 79–89.

GARG, V. K.; CHAUDHARI, S.; NARANG, A. Multi-regularization for fuzzy co-clustering. In: SPRINGER. Proceedings of the International Conference on Neural Information Processing. Daegu, Korea, 2013. p. 67–75.

RHO, V.; PENSA, R. G. Concept-enhanced multi-view co-clustering of document data. In: SPRINGER. Proceedings of the International Symposium on Methodologies for Intelligent Systems. Warsaw, Poland, 2017. p. 457–467.

HUANG, F. et al. Semi-supervised hierarchical co-clustering. In: SPRINGER. Proceedings of the International Conference on Rough Sets and Knowledge Technology. Chengdu, China, 2012. p. 310–319.