COVID-19 and the circulation information on social networks: analysis in a Brazilian Facebook group about the Coronavirus

: This article aims to quantify and qualify the information circulating in social media groups about COVID-19, the subjects covered in posts, as well as the possible relations with other subjects, events or social events, in order to generate a representative panorama of perception and social reaction to the coronavirus pandemic. For this, statistical techniques, data mining and machine learning are used to the characterization, pattern detection, and grouping of textual data. The experiments are carried out on a dataset of textual data extracted from a Brazilian public group about COVID-19 (SARS-cov-2) of the social network Facebook. Statistical analyzes are crossed with data on the advance of the number of infected, and with specific political-social events, revealing variations and influences in terms of participation and engagement in the analyzed group. In addition, through the results obtained by the clustering method used, two main groups of posts are detected, the first presenting a content pattern geared to governmental issues, and the second to personal issues. The results achieved still allow a reflection on the possible social impacts of the creation or absence of public policies to deal with the COVID-19 pandemic.


Introduction
In December 2019, the city of Wuhan, China, witnessed the emergence of one of the most alarming pandemics recorded in the Contemporary Age, the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-cov-2), known as COVID-19.Identified as a zoonotic coronavirus, similar to other known viruses, such as the Severe Acute Respiratory Syndrome (SARS-cov) and the Middle East Respiratory Syndrome Coronavirus (MERS-cov) (LIU et al., 2020),  overtook in early April 2020, the number of one million infected around the world, with more than fifty thousand deaths registered, according to data released by the World Health Organization (WHO, 2020a).According to information presented in Wu and McGoogan (2020), the three aforementioned viruses have considerably similar characteristics, with presentation of fever, cough and problems in the lower respiratory tract, with attenuation associated with age, as well as underlying conditions of other diseases.However, COVID-19 was potentially more aggressive in epidemiological terms, with an exponential increase in the number of infected people.
According to WHO (2020a), the number of coronavirus cases in Italy, for example, jumped from 888 on February 29, 2020, to 105,792 on March 31, 2020.In a study by Pan et al. (2020), 21 patients with pneumological infection caused by COVID-19 were followed, and it was found that the hospitalization period varies within the range of 11 to 26 days, which, combined with the growing number of infected people, ends up leading health systems to collapse.
In Brazil, the first case of contamination by coronavirus was confirmed on February 26, 2020(MELO et al., 2020), and on March 31, 2020, 5,717 infections and 201 deaths were recorded (WHO, 2020a).In a global context, the number of infected people increases exponentially from the moment the first infections occur.
Based on the established pandemic scenario, declared by the WHO (2020b) on 11 March, the establishment of policies and containment strategies that contemplate social isolation, quarantine and border closure was adopted by several countries.At the same time, due mainly to the large number of isolated people in their homes, a significant increase in the use of the Internet for information and entertainment purposes was observed.A survey conducted by the German portal Statista, between 16 and 20 March, with a sample of 12,845 individuals aged between 16 and 64 years, distributed in different countries around the world, found a 40% growth in the use of laptops, and of 70% in the use of smartphones, among which there was a 44% growth in the use of social networks.
It is interesting to note that the possibilities of interaction promoted by the popularization of mobile devices connected to the Internet, considering the context of Web 2.0 (O' REILLY, 2005), enabled an emergence of communities and virtual groups, in which individuals come together to seek knowledge, disseminate information, and discuss different topics.Such spaces are characterized by the continuous sharing of innumerable contents, under different presentation formats (text, image and video), and new forms of relationship and approaches on common interests are built (PENNI, 2017).
Considering the characteristic of the natural collectivity of Web 2.0, even though there are barriers regarding the establishment of direct interactions between two individuals on the network, the accessibility of a given content, generated or shared, is notably possible and expected.This phenomenon clearly occurs in the context of social networks, where there is a certain freedom of expression, and the identity of individuals are not verified, which ends up promoting greater participation and openness in terms of opinions, feelings, debates, or dissemination of information (CERCEL; TRAUSAN-MATU, 2014).
The environment promoted by social networks stimulates interactivity and relationships between generators and consumers of information.This interaction promotes a rupture of the barriers traditionally defined between these elements, being essential for the foundation of the circulation of information (SREEJESH et al., 2020).Furthermore, the emergence of new forms of communication give rise to scenarios where information emerges collaborative or even misinformation (LOGAN, 2016).Based on this, this article aims to carry out an analysis on the textual data of a set of posts published in a Brazilian group on Coronavirus on the social network Facebook, from January to March 2020.The central focus is quantify and qualify the circulating information in this group, the themes dealt with, as well as the possible relations with other subjects, generating a representative panorama of the perception and social reaction in face of the COVID-19 pandemic.
The analysis carried out still seek to generate inputs for reflections that make it possible to relate the possible relationships and influences of specific events on the effectiveness of the users' participation in the analyzed social group.For that, statistical techniques will be used for descriptive analysis of the data, in order to reveal the implicit patterns of the collected data, as well as methods of analysis of similarity between terms and between sentences named Doc2Vec, based on data mining and machine learning, in order to infer groups of posts that indicate the subjects covered.In addition, the development of the experiments follows a systematization of activities, which is interesting from the point of view of data analysis on social networks.The analysis carried out follows a paradigm based on Big Data and algorithms, considering individuals in a performative way, from the data extracted from the social network, as described in Fisher and Mehozay (2019).

Online social networks and data analysis
The exponential growth of the Internet since the 1990s, attributed to technological development, has instituted new forms of production, dissemination and sharing of information.Social media is a class of information technologies that support interpersonal communication and collaboration through Internet-based platforms, which provides the environment for the formation of dynamic structures for connecting people and interacting -social networks (KANE et al., 2014).The social networks are a set of interrelated nodes that establish bonds generally conceptualized as a social relationship ("friend of" or "boss of") or a dyadic interaction ("conversation with", "sells to", "works with").It is interesting to note that the advent of Web 2.0 represents an important milestone in the evolution of communications, fostering an environment of media convergence and circulation of information in virtual spaces, as well as an amplification in the access and manipulation of data, both by the existing technologies, as well as advances in the availability and access models.In this context, every individual connected to the Internet ends up becoming a potential content generator, whatever the connected purposes.Belk (2014) state the technologies that identifies the Web 2.0 have allowed and enhanced the development of environments and sharing spaces.
Considering the character of the intrinsic collectivity of Web 2.0 and in view of the impossibility of direct interactions between two individuals on the network, a generated or shared content may be accessible, since the free flow of data is natural to virtual spaces.This phenomenon can be clearly seen in the context of social networks, which, as punctuated by Lipschultz (2018), stand out for their freedom of expression, and for the non-mandatory identification of the individual, providing greater delivery by their users within relation to sharing opinions, feelings, or even in discussions on the network.
In addition, Web 2.0 is also reflected in the productive structure of information in virtual spaces, through the emergence of new forms of communication that are decentralized, personalized and interactive, which, enhanced by the wide and democratic access to technological devices (MCLUHAN, 1994), end up generating scenarios where the figures of citizen journalism, collaborative information, or even disinformation emerge.
The context of Web 2.0 and the possibilities of interaction promoted by the popularization of mobile devices connected to the Internet leveraged important changes in the ways of communicating.It is possible to observe the emergence of virtual communities (LEVY, 1997), where individuals come together to discuss different topics.In these spaces, a number of contents are shared and new ways of relating and grouping around common subjects are established.On the other hand, it is also important to note that the democratization of internet access contrasts with a continuous movement of individualization, deterritorialization and inequality, promoted by content  In Brazil, 70% of Brazilians (about 150 million people) have access to the Internet and, of these, 96.2% (about 144 million people) are active on social networks (KEMP, 2020).The most accessed platforms are mainly YouTube, Facebook, Whatsapp, Instagram, Twitter and Linkedin, in that order, with an increase of 8.2% in social network users in Brazil was reported between April 2019 and January 2020 (KEMP, 2020).In addition, between January 2020 and March 2020, there was a 50% increase in social media usage time by Brazilians (STATISTA, 2020a).
Social networks have become a field of study for research related to the organization and treatment of large amounts of data, in addition to providing an ideal environment for extracting knowledge through the application of data mining techniques.The most common structuring elements of social networks such as user profiles, comments, updates, evaluations and metadata are often used as data sources.Through profiles, for example, it is possible to identify people with common interests and map the relationships (ARNABOLDI et al., 2017;LI;DAS, 2020), or to check rumors, misinformation and fake news (FERRARA, 2017;HUSSAIN et al., 2018;AHSAN;KUMARI;SHARMA, 2019).
Social networks generally exhibit a rich internal structure, in which users, through their activity and involvement, define different types and intensity of interactions.As potential analysis in social networks, Tang et al. (2009) quantitatively analyzed social influence at the level of a given content, identifying representative nodes (users) in the context of the topics and the connections created from the degree of influence.The study demonstrated that some connections are characterized by high bandwidth and diversity of context, exhibiting high efficiency of information diffusion.Thus, it was possible to identify network connections where higher quality interactions occur and how much influence in a given context.
Considering the relevance regarding propagation and social networks as a means of quick exchange of information when disasters occur, some studies (KEIM; NOJI, 2011;KRYVASHEYEU et al., 2016;KIM;BAE;HASTAK, 2018;DRAGOVIĆ et al., 2019;LIU;ZHANG;ZHANG, 2020) used data extraction techniques, especially on Facebook and Twitter, to analyze the information circulating on social networks in order to map human behavior and understand its impact during emergency situations, such as natural disasters, epidemics, terrorist attacks and public order demonstrations.
Understanding social networks as the new forum for collective intelligence, social convergence and community activism, in Keim and Noji (2011) is discussed the immediate consequences of the 2010 earthquake in Haiti, with regard to the information circulating in discussion groups on MySpace and Facebook.The first finding presented in their work is that much of what people around the world were learning about the earthquake came from these sources.
In addition to sharing information, these focus groups were also used for donations and to offer comfort and support (psychological benefits) to vulnerable populations.Such a functional organization observed in social networks in the context of that disaster, opposes the ingrained vision of social media for transmitting unidirectional information such as radio, TV, newspapers and magazines.
The study suggests, conclusively, that large-scale interaction on social networks can alter the way in which the world reacts to disasters, as the positive effects of this response would increase people's degree of resilience, personal and collective responsibility, decreasing risks in socioeconomic reorganization.
Despite the potentialities highlighted, the authors emphasize the importance of managing information on social networks, in order to prevent the spread of rumors that could lead to widespread panic and negative social impacts.
Considering that the dissemination of information on social networks provides the situational awareness of its users, in Kryvasheyeu et al. ( 2016) is used the spatio-temporal distribution of messages related to disasters to suggest a model for real-time monitoring and evaluation of the disaster itself.The authors treated Twitter activity as a case study before, during and after Hurricane Sandy, pointing out in their results that real and perceived threats by people, together with the effects of physical disasters, are directly observable through the intensity and composition of the flow from Twitter messages to a wide range of disasters.That user activity on Twitter is strongly correlated with economic damage per capita when inflicted by the hurricane.The authors suggest the use of data from social networks in the rapid assessment of the damage caused by a large-scale disaster.
From the perspective of mental health, disasters in general have substantial social consequences.Analyzing data from Twitter, in Gruebner et al. (2018) are identified emotions in space-time relations -an increased discomfort, that is, negative emotions accumulated after a disaster, compared to the emotions presented during the disaster.The study suggests that significant associations of negative emotional responses in the space and time of a natural disaster can be used in alert systems, in the identification of regions or social groups that need attention with a view to mental health.
An important question addressed by Liu et al. (2020a) is that despite the opportunities offered by social networks in communicating disasters compared to traditional media, freedom of opinion can result in distorted information.To assess this impact, the authors analyzed the functional structure of social networks, according to the ability to control and influence the nodes, in communicating information about disasters and in communicating the risk of disasters or threats.The results showed that the activity of nodes in small groups favors the dissemination over long distances, while the communication of disaster risks is strongly dependent on the activity of key nodes (opinion leaders) and, furthermore, this performance is prone to generation of rumors.

Methodology
The knowledge discovery from databases is something of great importance and interest.In an environment dominated by Big Data, it demands the use of automated and intelligent techniques, which allow the generation of strategic information for the purposes related to the problem to be addressed.It is essential to execute a series of steps to guarantee the assertiveness of the results to be obtained, minimizing the impacts induced by noise in the data sets.The experiments proposed in this work are guided by the process called Knowledge Discovery in Databases (KDD), proposed by Fayyad, Piatetsky-Shapiro and Smith (1996).KDD is a process that guides the generation of information and the recognition of patterns based on the execution of five steps: selection, pre-processing, transformation, data mining and interpretation.One of the main advantages of KDD is the fact that it is an interactive process, as it is presented in a sequential and organized way and iterative because it allows interventions in its activities (PROVOST; FAWCETT, 2013).
The first step to be performed is the selection of the data.During this stage, the databases to be used to solve the problem are defined, as well as the actions related to the collection of this data Han, Kamber and Pei (2016).The data used comes from a Brazilian public group from the online social network  After selecting the data, the pre-processing step must be performed, where data cleaning activities are performed, treating possible records that present noise or missing data, in order to guarantee the quality of the analyzes performed (HAN et al., 2016).During the pre-processing phase, routines were applied to clean noise and remove unnecessary data for analysis purposes.We use rules based on regular expressions to remove links, non-alphabetic characters, quotes from other users (which are specified by terms starting with the @ symbol), hashtags (identified by the # symbol), stop words and the use of repeated letters (example: "I likeeee this") (SALLOUM; AL-EMRAN; SHAALAN, 2017;LI et al., 2019).The occurrences of emoticons were identified through regular expressions and treated by replacing them with the corresponding words (WANG; CASTANON, 2015).
The third step of the process refers to the transformation of the data into a format more suitable for the purposes of analysis.After the execution of the treatment routines, the data were transformed in two formats, based on the defined analysis purposes.First, a database structured in a csv dataset (comma separated values) was generated, containing all the attributes of the extracted posts, which is used for the development of the descriptive analysis.In addition, a textual corpus was also generated, containing only the post id and text data, which is used by similarity detection, grouping and content analysis methods.
Under the KDD methodology, data mining is considered the most important step, since it is mainly responsible for generating information.Data mining, according to Tan et al. (2019) refers to a set of processes aimed at the exploration and analysis of large data sets, with the purpose of raising standards, associations, anomalies, and forecasts.Provost and Fawcett (2013) highlight that data mining corresponds to the application of computational solutions that allow extracting knowledge from pre-processed databases.In general, the features of data mining can be explored under two approaches: descriptive and predictive.
Descriptive methods are aimed at characterizing, summarizing and discriminating data, while predictive methods seek inference or prediction from analysis in databases Han et al. (2016).The methods of descriptive analysis applied are based on statistical, computational and information visualization techniques, aimed at surveying unidentified trends and patterns (GREENELTCH, 2019).The pre-processed data are used for the generation of graphs that contemplate the temporal evolution in the number of posts, the most frequently used terms, and the engagement of the posts, which refers to the involvement obtained in the posts by the users, given by the sum of the number of likes and comments.
A point of fundamental importance refers to the fact that in the natural language processing routines, the information generated depends not only on the terms present in isolation, but also on the context in which they are found, allowing the discovery of patterns and groups specific textual content.One of the objectives of this article is to find the possible groups of posts present in the selected sample, in order to determine the content to which these groups are dealing.For this, we use the Natural Language Processing (NLP) technique called Doc2Vec (LE; MIKOLOV, 2014).
The Paragraph Vector method, also called Doc2Vec, was introduced by Mikolov et al. (2013) and can be described as a NLP tool for the representation of documents, being considered a generalization of the Word2Vec method (MIKOLOV et al., 2013).In general, this technique can be described as an unsupervised learning model, which is based on distributed vector representations of the terms or words of a text.From this, the texts referring to the considered database can be of variable size, from sentences to complete documents and, in general, the vectors are trained to predict words or terms in a paragraph and thus assign a semantic representation.
Then, the method performs a mapping based on probabilities, so that words that have the same meaning are distributed in the same vector space, making it possible to make the semantic distinction between the words in a paragraph.Sequentially, the method maps the paragraphs to different vectors of words, concatenating the vector of the paragraph with several vectors of words present in the paragraph, in order to predict the next word in the context considered.In this way, the variable length of sentences, word order and semantics are taken into account.Both the word and paragraph vectors are  Sequentially, the implemented model deals with the context.In the vector space in which the documents are mapped, the proximity between vectors represents similar usage patterns, so that words used in the same context are close to each other.This representation considers, therefore, the variable size of the document, the word order and the semantics.The interpretation, then, depends on the set of terms and not on the specific elements of the descriptive text.From the trained and validated model, it is possible to perform the semantic similarity analysis of the posts, which can be visualized through a weighted graph based on the similarity matrix, illustrating the relationships between the posts and the possible detected groups.Through the segmentation of the classes found, a similarity graph is generated between the terms present in the textual corpus, based on the method presented in Bouriche (2005), considering the semantic and syntactic relations.The results are analyzed in view of the epidemiological advances of COVID-19, considering the world reality, as well as the Brazilian scenario of the spread of the virus.

Results
With the process of extracting data from the group selected on the social network Facebook, 7523 publications were obtained.Quantitative details about the collected publications are presented in Table 1.After carrying out the cleaning and preprocessing step, the sample was reduced to 5,118 publications, with a total number of 64,854 comments and 301,611 likes.We apply data mining to the data set, for descriptive analysis, pattern extraction and content analysis.The results obtained through descriptive analysis are of fundamental importance for the knowledge of the database's characteristics, as well as the detection of specific patterns and behaviors, which, in this case, may be associated with events directly linked to the COVID-19 pandemic, or derived events, such as those linked to socio-economic issues.During the process of descriptive analysis, the data were normalized to allow comparison with the temporal curve of the number of infected COVID-19 in Brazil.This normalization avoids bias in the analysis for variables with a higher order of magnitude.The normalization was done by dividing the daily value corresponding to the variable of interest xi by the maximum value of this variable in the entire historical series considered x, resulting in a maximum value equal to 1: The first analysis was generated in relation to the evolution of the daily number of posts published in the selected group (Figure 1).It is possible to observe that although there is a relative growth in the number of posts, it does not follow the curve of the number of contaminations in Brazil, with a drop even after the daily peak of posts.However, it is interesting to note some important phenomenological features.The selected group refers to the largest Brazilian public group on the subject on Facebook, however the number of posts remained stable and at a minimum level, less than forty posts between the date of creation of the group and the occurrence of the first case in Brazil, which caused the growth curve to reach one of its peaks, with a total of 263 posts in a single day.
The peak values of daily postings present in Figure 1   March 20, 2020, relative to the date of approval of the public state of calamity in Brazil by the Federal Senate (BRASIL, 2020).The number of effectively active participants, that is, those who at some point published at least one post, is 2108 users, which corresponds to 4.5% of the total number of participants in the group, which reveals that there is a considerable participation by other users in terms of monitoring the publications posted in the group.With the purpose of validating the model, from train-corpus, vectors of terms were generated for the documents through inference.In this sense, the inference algorithm predicts the terms based on the word vectors, and these new vectors can be compared with the vectors of the trained model.Basically, in this approach, train-corpus is treated as an unknown data by the model and, once  showing the distribution between the posts (Figure 3).In the graph, each node corresponds to a post, and the edges correspond to the similarity distance between them, that is, the closer the two nodes are, the more similar they are.In this sense, for visualization purposes a threshold was applied to the similarity values, being considered only those greater than 0.7.The application of this threshold is necessary to allow the identification of possible classes of similar elements.The value of 0.7 was defined based on empirical checks.Nodes that are far apart have a lower similarity value compared to the others, and nodes that are between the two classes have similar similarity values between elements that make up each of the classes.
From the identification of the elements that make up each of the groups identified through Doc2Vec, that is, the posts most similar to each other, the  For a more careful analysis of the information circulating in the group, the similarity graphs were generated.Figure 4 presents the results obtained for Class 1, where the central terms are: brazil, hospital, china, government, and quarantine.Associated with these, there are terms that confirm the characteristic of content related to politics and informative news.Figure 5 shows the results obtained for Class 2. It is interesting to note that in these results the most relevant term is person, which is related to others that refer to family, food, health and personal care.The results obtained through the similarity graphs demonstrate that in the group analyzed in the selected social network, there are two main patterns of content in accordance with COVID-19, those of an informative and general nature, focused on issues at the level of social groups, and those of a more personal character, focused on the individual's concerns, opinions and desires.
by the evolution of suggestion and automated selection algorithms (JUST; LATZER, 2017).
Facebook.Facebook was launched in early 2004, reached approximately two billion active users in the world in 2020 and 120 million active users in Brazil (STATISTA, 2020b).In the context of the data to be explored in our work, the group selected for the experiments proposes the dissemination of information about Coronavirus, having its first post published on January 25, 2020.The group has approximately 45 thousand active users.During the selection phase, after defining the data sample, the first activity is the data extraction.For this, a Web Scraping solution (JARMUL; LAWSON, 2017) based on the Python programming language was built.All posts from the defined time period were extracted, from January 2020 to March 2020, including the following attributes: post id, user id, post content, post data, number of likes, and number of comments.The identification data of users were neither collected nor stored, and the user id code was changed at run time by the application of the unidirectional cryptographic dispersion function Message-Digest algorithm 5 (MD5) (RIVEST, 1992), without allowing the reconstruction of the original values.The 7,523 posts published were extracted, of which 68% refer to posts containing textual data and 32% refer to posts containing only image or video data.For the experiments, only posts with textual data were considered.

E
-ISSN 1808-5245    trained by descending the stochastic gradient and post-propagation (RUMELHART;HINTON;WILLIAMS, 1986).It is important to note that while paragraph vectors are unique among paragraphs, word vectors are shared (the vector of a word is the same for all paragraphs that have that word).At the time of prediction, the paragraph vectors are inferred by correcting the word vectors and training the new paragraph vector until convergence.The method can use different strategies for generating paragraph vectors, the main ones being (LE; MIKOLOV, 2014): a) Distributed Memory Model of Paragraph Vectors (PV-DM): each paragraph is mapped to a unique vector, represented by a column in an array.Each word is also mapped to a unique vector, represented by a column in an array.The concatenation or mean of the paragraph vector with the word vectors is used to predict the next word in a context.The paragraph vector can be considered a pseudo-word and represents the information that is missing in the current context, acting as a memory of the topic of the paragraph in question.b) Distributed Bag of Words version of Paragraph Vector (PV-DBOW): context words are ignored in the input and are predicted randomly from the paragraph vector.
demonstrate a clear reaction to events directly linked to the Brazilian scenario, such as the confirmation of the first case of COVID-19 in Brazil, the confirmation of contamination by a member of the presidential executive Brazillian team, and confirmation of the first death due to COVID-19 in Brazil.Such data demonstrate a greater reactive social force in terms of sharing, discussing and socializing opinions and information that can more effectively affect the daily lives of the participants in this social group.In contrast, Figure 2 shows the evolutionary curve in the average daily engagement observed in relation to the growth curve of contamination by COVID-19 in Brazil.The results demonstrate that the growth rate of the average daily engagement accompanies the growth in the number of contaminations.A single point of greater prominence occurred on

Figure 1 -
Figure 1 -Evolution of the daily number of posts.

Figure 2 -
Figure 2 -Evolution of the daily engagement.

Table 1 -
Quantitative of collected publications from the Facebook group.

COVID-19 and the circulation information on social networks: analysis in a Brazilian Facebook group about the Coronavirus
Em Questão, Porto Alegre, v. 27, n. 3, p. 42-67, jul./set.2021.https://doi.org/10.19132/1808-5245273.42-67 | 57 E-ISSN 1808-5245similarity between the vectors (inferred and modeled) is identified, a notion of the model's consistency is obtained.Although it is not a real precision value, it is a way of validating how representative the model is for the characteristics of the database documents.

COVID-19 and the circulation information on social networks: analysis in a Brazilian Facebook group about the Coronavirus
-ISSN 1808-5245 most frequent terms within these groups were calculated.Table2presents the twenty most frequent terms for both classes.It is possible to observe that the terms related to Class 1 refer to the terms most present in governmental issues and factors directly associated with the pandemic, such as quarantine, hospital, epicenter and death.On the other hand, Class 2 presents terms that are more related to people's daily lives, with emphasis on terms related to food, health and personal care.
Source: research data.