Jump to ContentJump to Main Navigation
Thick Big DataDoing Digital Social Sciences$

Dariusz Jemielniak

Print publication date: 2020

Print ISBN-13: 9780198839705

Published to Oxford Scholarship Online: May 2020

DOI: 10.1093/oso/9780198839705.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (oxford.universitypressscholarship.com). (c) Copyright Oxford University Press, 2021. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use. date: 06 December 2021

Methods of Researching Online Communities

Methods of Researching Online Communities

(p.23) 3 Methods of Researching Online Communities
(p.iii) Thick Big Data

Dariusz Jemielniak

Oxford University Press

Abstract and Keywords

The chapter presents the idea of Thick Big Data, a methodological approach combining big data sets with thick, ethnographic analysis. It presents different quantitative methods, including Google Correlate, social network analysis (SNA), online polls, culturomics, and data scraping, as well as easy tools to start working with online data. It describes the key differences in performing qualitative studies online, by focusing on the example of digital ethnography. It helps using case studies for digital communities as well. It gives specific guidance on conducting interviews online, and describes how to perform narrative analysis of digital culture. It concludes with describing methods of studying online cultural production, and discusses the notions of remix culture, memes, and trolling.

Keywords:   Big Data, thick data, digital ethnography, culturomics, social network analysis, data scraping, digital cultural studies, remix culture

Some researchers of social phenomena consider the combination of qualitative and quantitative methods risky or at least problematic (Bryman, 2007; D. R. Buchanan, 1992). While taking their reservations into account, I believe that the advantages of using a variety of tools and approaches outweigh the disadvantages (Hammersley, 1992), although the whole needs to be situated in a single coherent paradigm. Although the use of qualitative and quantitative methods has a long history in traditional social research (Jick, 1979), in the digital world it presents an especially large number of advantages—because of much easier access to ordered quantified data with the parallel need of its deep interpretation which is complicated by the massive inflow of sources and their lack of ambiguity. Mixed methods fit Internet studies particularly well (Hine, 2015). In digital social science, I propose the use of Thick Big Data, the conscious, programmatic combination of Big Data (highly quantified datasets) with thick data (deeply qualitative fieldwork).

Especially with the research of online phenomena, quantitative studies of exploratory character are an excellent choice, unlike traditional social science research. It enables a problem to be sketched, then exposed and explained through qualitative research (Spillman, 2014), to identify people and communities that can be explored, and to select the subsets of text for narrative analysis. It needs to be noted, though, that the division of research into exploratory and explanatory is an artificial one (Stebbins, 2001), and many research projects are based on the pragmatic iterative approach, returning to the same questions and blurring the division.

The interpretation of large amounts of data through a deep qualitative project is an excellent way of positioning Big Data in social research (p.24) (Curran, 2013), although it does not exclude the “anthropological frame of mind” (Czarniawska-Joerges, 1992) and reflection that are typical of qualitative research. The use of different research methods facilitates non-stereotyped thinking and increases the possibility of theorizing across micro and macro perspectives (J. Mason, 2006)—especially thanks to the Internet-enabling power of quantitative data, the use of such data to visualize problems that are explained in detail through a qualitative process makes sense. Of course, one cannot assume that the compilation of data with varying levels of detail and depth will always be problem-free—it is therefore of key importance to contextualize the results and the use of data to create a common, sensible, and consolidated interpretation (Brannen, 2005). Thick Big Data may similarly rely on the use of qualitative research for an initial pilot study, in order to identify the areas and sensible research questions for a Big Data problem. Naturally, there is no reason to strictly separate the pilot study from the full-fledged study because one needs to remember the fluidity of such divisions (Nadai & Maeder, 2005).

The sense of combining quantitative and qualitative research methods in social sciences was noted twenty years ago (Sudweeks & Simoff, 1999). Sociology and social sciences were adapting to the new reality, developing independent and mixed methods of online research. In the meantime, strong competition to those sciences arose. First, many social studies based on online data are being conducted by large corporations, using data which is not available to the public. Second, questions reserved for sociology started to be answered by academics specializing in data analysis, information sciences, mathematics, and widely understood computational methods (Newman et al., 2011). The social sciences have also been invaded by specialists from biology, medicine or physics (Barabási et al., 2002; Palla, Barabási, & Vicsek, 2007). There is also a significant rise in the digital humanities whose scope of interest has begun to encompass areas previously occupied by social sciences (Sayers, 2018). There has been a similar development in digital culture studies (Jenkins, Ford, & Green, 2018).

What is interesting is that because those researchers did not participate in decade-long discussions on the research methods and traditions, they initiate their discussions usually by ignoring the canon of good practices or the defined paradigm in social sciences, including fundamentals (p.25) of research ethics (Frade, 2016). Access to vast amounts of data posed a great challenge to sociology as a discipline, making it necessary for the research identity to redefine and consolidate itself (Lazer & Radford, 2017; McCarthy, 2016). Without reacting to the developments and adjusting to the new possibilities, it is threatened with loss of importance, although the excess of data and home-grown sociologists may increase the need for more reflection and methodological rigor because the illusion in which the data speaks for itself will quickly dissipate (Dourish & Gómez Cruz, 2018).

As Wellman, a pioneer in online social research wrote in 2004 (Wellman, 2004), to write an online behavior-related article all one needed were some interesting thoughts and insights. Later, the era of the craze of data presentation dawned. Afterwards, the period of focusing on analysis and interpretation instead of merely reporting the observations came. It can be added that currently it is even more important to skillfully combine different tools—especially Big Data analysis and/or network analysis with ethnographic and qualitative studies. Big Data in this wide sense needs thick data—because with the use of quantitative methods based on large datasets it is so much more important to give them sense through qualitative analysis (Blok & Pedersen, 2014; T. Wang, 2013). Thick data allows interpretation of big data—data do not speak for themselves and that require contextualization. The opposite is also true—qualitative research increasingly needs IT support (Ducheneaut, Yee, & Bellotti, 2010), even for the reason of social life increasingly often permeates the virtual realm and the inevitability of excluding the digital part of everyday life.

Combining thick data and Big Data is also pragmatic. Big Data allows for immersion into unprecedented amounts of human behavior data. We may even speak of the “datafication” of sociology (Millington & Millington, 2015). At the same time, the use of research only from this area places sociology in an untenable position: Big Data analysis, as performed by professional data scientists is a serious competition. Naturally, the supposed plummeting demand for specialist social science knowledge is exaggerated, and the data in itself is of low value when disconnected from skillful analysis, but the wider audience and institutions that decide upon the financing of research do not need to understand this.

(p.26) Interpretation, especially supplemented with deep qualitative research, allows a proper understanding of the results of Big Data (Halavais, 2015). It is a paradox, but the greater the inflow of quantitative data the greater the need for qualitative analyses (Babones, 2016). The interpretation of traditional quantitative studies on its own is possible, because such research is usually already quite well-contextualized through research questions, and the selection of input material. There is a lot less context in Big Data, though. Finally, the access to data ceases to be a problem, while making sense of the data becomes increasingly problematic. Naturally, the opening of social sciences to Big Data leads to the “wild interdisciplinary character” of research in which sociology meets anthropology, organization theory, or information sciences (Goulden et al., 2017), and in researching works of culture with media sciences, cultural studies, or even literary analysis, but it allows the delivery of really useful and rich social research. In this sense, we may speak of “symphonic social research projects” (Halford & Savage, 2017). In the analysis of large datasets, other disciplines are more advanced; however, sociology takes a unique privileged stance: having very high quantitative competences combined with long tradition of qualitative sociology, with the added benefit of purely ethnographic research (A. Goffman, 2014; Willis, 2013), and a deeply developed and proceduralized (Atkinson, 2013) canon of such approaches as grounded theory (Hodkinson, 2015; Konecki, 2008a), as well as developed interpretive standards within sociological theories. If you think about it, sociology is uniquely positioned to develop the canon for digital social research, as it has strong tradition of the use of qualitative and quantitative approaches, as well as of developing methodologies for human-subject research.

The use of Big Data ought not to be a goal in itself but rather a road to specific knowledge, a valid supplement for which can be found in qualitative data (Alles, 2014). Canons and ways of combining such approaches are still in the making (Huc-Hepher, 2015). For example, GPS geographical data and MySpace information helped contextualize and localize ethnographic data (Hsu, 2014). Bornakke and Due show (Bornakke & Due, 2018) how Big Data can be combined with ethnographic contextualization: using observations and interviews but also the data from 1000 hours of video footage used to record the most frequent trajectories of customers walking around a store, or combining the GPS data of 371 cyclists with observation and interviews. Similarly, (p.27) in a Danish research team large mobile phone datasets were combined with ethnographic insights (Blok et al., 2017).

Latzko-Toth, Bonneau, and Millette suggest that Big Data can be contextualized with thick description (Latzko-Toth, Bonneau, & Millettte, 2017), composed of:

  • - trace interviews: talking to selected people to whose quantitative data we have access and gathering their comments and common interpretation of the said data;

  • -  manual data collection: collecting quantitative data but not with the use of automated tools but rather through purposeful selection, by creating a database of tweets not based on specific searches but through reading each tweet and conscious classification thereof to a (p.28) specific category. The next step can be a quantitative analysis of such a qualitatively identified network.

Table 3.1 presents a possible sequence of research stages, from quantitative research to qualitative. Naturally, within a specific research project one needs to choose the quantitative and the qualitative tools and methods of data collection. The opposite can also be applied, progressing from qualitative analysis and the generated theories to formulating hypotheses and verifying them within a quantitative project. It is important here to understand the power behind Big Data and thick data.

Table 3.1 Stages of Thick Big Data research

Methods of Researching Online Communities

This part of the book is devoted to the researching of online communities, the behaviors, organization, and culture of people and avatars, starting with the description of the available arsenal of quantitative methods, and progressing to qualitative methods. Complex research of communities may also be later supplemented by the research of cultural artifacts, which is described in later in this volume.

3.1 Quantitative Research

3.1.1 Big Data

With billions of people using the Internet, we now have the possibility of tracing even the minute factors that would ordinarily be imperceptible—because of the small sample size. For instance, taking into account the changes in communication patterns, we are able to ascertain that a given person is unemployed (Llorente, Garcia-Herranz, Cebrian, & Moro, 2015). We may also observe hourly changes in a population’s moods, compare habits of individual communities, and reactions to headaches or alcohol consumption just by analyzing public tweets (Golder & Macy, 2011).

Even raw data, if based on sufficiently large samples, may be a useful starting point for future research: for example, one of the most popular porn websites, and the world’s 22nd most often visited page, Pornhub, publishes an annual report on its users. In 2016, there were 23 billion visits to the website, and the visitors watched more than 91 billion hours of video, making it a remarkably large database. The report indicated some interesting cultural differences—the longest visits came from the (p.29) Philippines and lasted an average of 12:45 minutes; the shortest were from Cuba, lasting an average of 4:57 minutes. The average visit from Mongolia lasted 5:23 minutes. This data, naturally, does not allow for interpretations or cultural generalizations, since only a fraction of the population visits pornographic sites, and in each country the visitors might come from a different cultural and demographic group. Nevertheless, the data is a treasure trove for social researchers of sexual behavior across the world. A piece of information from the report, that the word “teen” was among the most searched keywords, may be valuable for sexologists and criminologists attempting to research sexual interest in minors (A. Walker & Panfil, 2017).

There is also a proliferation of information sources. For instance, a sentiment analysis of online movie reviews allows an automatic assessment of their emotional load (Thet, Na, & Khoo, 2010), which can be especially interesting in cross-community research. Mountains of data on the various kinds of human activity are also growing—the Quantified Self movement (Swan, 2013), where people measure the parameters of their own activity (Lyall & Robards, 2018), often making them publicly available, relying on sport trackers like Fitbit or Garmin, reaching into the mass scale. Access to such data allows for better insight into the daily activity cycle, stress at work, and the influence of physical activity—issues that have long been the subject of interest of sociology of health (Pantzar, Ruckenstein, & Mustonen, 2017). The scope of biometric data is also increasing—simplified EEG measuring equipment has entered mass usage, even though it is most usable in lab conditions or during meditation (Przegalińska, Ciechanowski, Magnuski, & Gloor, 2018). Mood detection in elderly people allows for a dynamic adaptation of environment (Capodieci, Budner, Eirich, Gloor, & Mainetti, 2018). Availability of professional sensors is also increasing, inserted into the body of the research subject (Rich & Miah, 2017).

From yet another area, but still associated with the technological development, the expansion of IoT (Internet of Things) devices, which are all sorts of equipment that is constantly online and transmits data, also opens new fields for social analyses (Dale & Kyle, 2016; Ytre-Arne & Das, 2019). We can therefore predict the onset of computational social sciences (Lazer et al., 2009). As a counterweight, there have also been advancements in contemplative (Janesick, 2016) and humanist sociology (Giorgino, 2015).

(p.30) Big Data may lead to the discovery of fascinating dependencies and intimate knowledge. In 2013, Kosiński, Stillwell, and Graepel published an article (Kosinski, Stillwell, & Graepel, 2013) on predicting some potentially private data, such as sexual orientation, ethnicity, age, gender, religion, political views, psychological features like life satisfaction, intelligence, personality, the tendency to use drugs, and marital status of parents just based on Facebook likes. The model is 88 percent accurate in assessing men’s sexual orientation and the differentiation between African Americans and white Americans is as high as 95 percent accurate.

The research covered 58,000 individuals who needed to grant access to their Facebook profiles in exchange for taking a free psychological test. The researchers could access the results of all the tests and match them with the profiles, so they were able to pinpoint which likes were the best predictors of specific features or preferences on a large sample. Even though as many as 300 likes per person were taken into account, even singular occurrences could have a predictive value. As an example, liking the television show The Colbert Report was a good prognostic of high intelligence, while liking Harley Davidson could be an indicator of less intelligence. It is natural that some likes were directly related to the researched trait—obviously, liking the page “I love being gay” was a clear indicator of the person’s sexual orientation. Some were, however, rather ambiguous, but still efficient, such as liking “Shaquille O’Neal’s fanpage,” which was a good prediction of male heterosexuality.

In a sample of over 86,000 individuals, Youyou, Kosiński, and Stillwell (2015) showed that computer predictions based on the analysis of likes may be more accurate than the assessment of the research subjects’ friends. In some cases related to the use of psychoactive substances, political sympathies or health, prognoses may even be more accurate than the self-assessment of the researched person. A simplified version of a tool showing how a similar system may work can be found at applymagicsauce.com (for a limited number of Facebook likes and tweets) on the website run by the Psychometrics Center at the University of Cambridge, where Kosiński worked for a few years. For social researchers, the key message is clear: completely unrelated collections of large datasets may bring valuable, verifiable information.

This phenomenon is of immense value to marketing. Andrew Pole, the statistician for Target, an American hypermarket chain, asked an (p.31) interesting research question in 2002: how, based on the data at the chain’s disposal, can we guess that a customer is pregnant, even when she is reluctant to reveal this information? This is a crucial question for hypermarkets, because young parents are a gold mine, and their needs are easy to define. If they can be won at an early stage, by stabilizing their habit of buying diapers, they will most likely make it a habit to also make other purchases, staying with the chain for longer. This is why Amazon has offered the “Amazon Mom” for young parents since 2010. Members of the program can use Amazon Prime services, which costs about US$119 per year, for free for up to a year, as long as they meet certain purchasing criteria. Target used the data for this kind of prediction—and the dataset was quite rich, as Target keeps each customer’s credit card information, demographic data, address, and email in its database. Knowing the email address allows to connects the chain data with the online databases of consumer behavior, which are often quite well-developed—and based on online customization systems. Additionally, the chain buys data on the secondary market; these often contain data on marital status, credit rating, education, and even typical subjects of online conversations. They analyzed the purchasing history of women who signed up for the baby registry. With sufficient amounts of data and research subjects, interesting patterns started to emerge. For instance, at the beginning of the second trimester a large group of pregnant women started to purchase odorless body balms. In the first 20 weeks of pregnancy, many of them started to stock on calcium, zinc, and magnesium. Finally, the research team limited the predictive model to 25 products, based on which they could safely assume whether or not a customer was pregnant, and predict the due date (Duhigg, 2012).

Based on the algorithm, the chain started to send discount coupons for baby products. In 2010, a man in the Minneapolis area complained about his daughter receiving these coupons. Target naturally apologized for the mistake; but as it turned out, that the man’s daughter was indeed pregnant (K. Hill, 2012). Since then, Big Data analysis has made considerable progress. It is now combined with AI research, with the use of neural networks and machine learning. In 2017, Kosiński published the results of his research, according to which neural networks, following the analysis of over 35,000 images of people from a dating site, were able to state with quite high accuracy the sexual orientation of photo (p.32) subjects. The accuracy was 91 percent for men, from which each had five photos analyzed, although accuracy is related to comparing pictures in pairs, rather than at random (Y. Wang & Kosinski, 2018). Still, it proved more accurate than human assessment.

Earlier research by Kosiński’s team had been exploited by Cambridge Analytica, a private corporation established in 2013 to interfere in American political campaigns. The company became controversial because of its support for Brexit in Great Britain and Donald Trump’s successful presidential campaign. Cambridge Analytica collects extensive data on voters. It uses all the possible sources, including “free” psychological tests and polls, datasets collected by market research agencies, and specially developed mobile applications, often without the consent and knowledge of the participants (H. Davies, 2015). The company prides itself on the use of “behavioral microtargeting,” which, as they state, may forecast the needs of the research subjects and get to know them better than they could themselves verbalize. For the 2016 elections in the US, all adult Americans were categorized according to 32 personality types, adjusting the language of communication to specific people, and hinting at the political sympathies of the poll subjects to the polltakers, in order to be more persuasive on specific issues. In the USA it is so much easier, as the two main political parties have been developing voter databases for years, trying to pinpoint non-obvious common denominators of political beliefs for small groups of people. This way, parties may adjust their message to undecided supporters of the other party who can be convinced to stay at home—either by offering negative suggestions or by convincing them they need to go on vacation on the day of the elections. Similar practices are definitely controversial, because they are based on interference in the democratic process. Additionally, closely targeted advertising is only weakly regulated: public broadcasts, such as a television spot, can be grounds for legal action on infringement of personal rights or defamation, while when emitting a microtargeted advertisement, the defamed person may not even be aware of the fact. In 2019, Facebook introduced a public Ad Library in response to similar concerns. Revealing the actions of Cambridge Analytica in 2018 created a backlash against Facebook—which had done little to protect the privacy of its users. Nevertheless, all kinds of consumer-related data is collected by thousands of companies in every possible technical (p.33) way, and Cambridge Analytica is not an isolated case. Facebook introduced strict limits to data gathering it is difficult to mine the information, also for scientific purposes. Abstracting from these practical considerations, this application of Big Data has huge scientific potential.

The analysis of Big Data reveals some interesting data distributions, often diverging from the bell curve. Normal distribution assumes that outliers are rare: following the three-sigma rule as much as 99.7 percent of the area under the normal distribution curve lies within three standard deviations from the center. It is typical for demographic phenomena, like age in a defined population or intelligence in standard assessment models. Many journals in the social sciences rely on this Gaussian distribution (Andriani & McKelvey, 2009). This may make it difficult to notice phenomena of different characteristics.

Natural events, such as avalanches, fires, and epidemics, often show an exponential distribution. This is a distribution of y=kxa regularity, where y and x are variables, a is an exponent, and k is a negligible constant. Exponential distributions show that small-scale events are very popular, but there are also a few major cases. One of the first observed examples of exponential distribution is Zipf’s law, often synonymous with exponential distribution. Zipf was a Harvard linguist who concluded that a language’s most frequently used word occurs twice as often as the second in row, three times as often as the third in row (Reed, 2001). Similarly, income follows the Pareto law: 20 percent of individuals receive 80 percent of the income. Similar interesting correlations may be observed in many social phenomena: let’s mention here the size of enterprises measured by number of employees and market value (Gabaix, Gopikrishnan, Plerou, & Stanley, 2003), or the salaries paid to CEOs (Edmans & Gabaix, 2011). In classical sociological research, the analysis of social clashes in Chicago from 1881 to 1886 showed similar characteristics; the research took into account the number of employees and companies (Biggs, 2005).

Research on the Internet community confirms that online groups show an exponential character of researched features (Johnson, Faraj, & Kudaravalli, 2014). It is frequently the consequence of systemic complexity: in complex sets of codependent individuals, normal distribution fails to be the norm in favor of exponential distribution (Andriani & McKelvey, 2009). The number of website visits (L. A. Adamic & Huberman, (p.34) 2000) or the number of referring links (Albert, Jeong, & Barabási, 1999) can be mentioned here. An important element of sales strategy of Amazon.com, which made the company so successful, was to accommodate “the long tail”—satisfying the needs of the niche customers, placed at the end of the demand distribution (Spencer & Woods, 2010).

In open collaboration communities, exponential distribution may be applied to social actors. For instance, this is what the popularity of Web users looks like (Johnson et al., 2014). Similarly, rules of involvement in most Internet communities are so similar that we may invoke a 1 percent rule, where 1 percent of the population of the community generates 99 percent of the content (Hargittai & Walejko, 2008). The number of Wikipedia articles per user follows this regularity (Zhang, Li, Gao, Fan, & Di, 2014), and the top promile of editors provides as much as 44 percent of content (Priedhorsky et al., 2007). Still, when the number of active participants rises, the proportions may change dynamically (Van Dijck & Nieborg, 2009). Such interesting observations provide previously inaccessible knowledge about human behaviors in large communities.

Big Data research, however, uses large datasets. Fortunately for researchers, many large databases can be legally examined for free. Wikimedia project data may be openly downloaded with the use of the API1 in many popular formats, including JSON, XML, PHP, and even HTML. This is important when confronted with the fact that commercial services often impose limitations on accessing their data—Twitter does make their API accessible2 but with strict limits on queries and time scope. No big wonder—paid access to such data, through the company Gnip.com, is one of the more profitable products of the enterprise. What is worse, social media site licenses do not allow for making the source database available for review purposes, which means little research conducted on this data is verifiable. This is also true of research propagated by the social media sites themselves—in 2014, Facebook’s Data Science team published an interesting analysis on the evolution of memes on the site. It was based on an enormous dataset, so it had a very strong background, one of the reasons being that it used data of over a (p.35) million statuses from half a million user accounts (E. A. L. Adamic, Lento, & Ng, 2014). However, because the research team did not make source data available, the academic community cannot verify results, nor can it use the database for other supplementary analyses.

Corporations often limit access to Big Data, which makes it more difficult for the academic community to validate research results; it also creates circles of research elites that are privileged to access the data (McCarthy, 2016) This is grounds for the conflict of interest if the interpretation of the data is unfavorable for the corporations that gave access to the information and to the rest of the academic world (boyd & Crawford, 2012). In this sense, Facebook and similar corporations are undermining the development of social sciences, although they have more structured and complete data than most governments and statistical offices (Farrell, 2017).

For these reasons, large databases, access to which is neither paid nor limited, such as Wikimedia project databases, are invaluable. I include them in my research projects. Nevertheless, not all research conducted on large databases can be called Big Data analysis. For instance, with Maciej Wilamowski of the University of Warsaw we conducted data analysis from eight Wikipedia language versions. We researched 41,000 of the best articles across the projects, using the criteria of the individual project communities for assessing quality. We supported our research with a bot developed in PHP by my co-author. We ask a simple but important question: are standards for the best articles consistent or different among language editions of Wikipedia? If they are similar, we could assume that people have certain universal beliefs about the presentation of encyclopedic knowledge. In light of cultural globalization, similar organization of societies, identical technology and presentation, and cooperation across projects, this would not be surprising, especially given the visible, strong paraprofessional culture of Wikipedians. However, if they were inconsistent, we could conclude that standards for presenting knowledge are strongly influenced by local cultures and cannot be universalized. We showed significant differences in the number of words and characters for the best articles in the sample, with exponential distribution. We also noted a large discrepancy in the average number of images used to illustrate articles, the average number of bibliographical references, as well as numbers of external and internal links. (p.36) Above all, we found major differences among language versions. This led to the conclusion that there are divergences in social preferences, most likely conditioned by the culture of a language—because individual Wikipedias are defined by language, not by country (Jemielniak & Wilamowski, 2017). For instance, countries where East Romanic languages are spoken have a preference for more images, while the French show a strong inclination towards large bibliographies (from the viewpoint of the average absolute number of references within articles, although not so much from the perspective of saturation in comparison to numbers of words). It is interesting, because it suggests that the conviction about neutral objectivism of knowledge and ways to present it is largely a myth. This forms part of the wider stream of research in the sociology of knowledge.

Wikidata is a free-licensed database with massive potential for social science. It is yet another of Wikimedia projects but it contains very ordered data, easily exported in several formats. Unlike Wikipedia data, this project’s data does not need time-consuming parsing, sanitation, and clean-up. Wikidata, in perspective, may contain most of the Wikipedia-relevant data which can be recorded in an unambiguous way, without referring to any specific language, such as dates of birth and death. The database is still developing but yields some interesting observations. For instance, with Natalia Banasik-Jemielniak and Wojciech Pędzich, we collected the lifespan data, in days, for more than 800 bishops from six countries who had died in the past 30 years, and compared it with analogous data for priests and male academics. We wanted to check if a group of people who receive millions of prayers live longer than those who do not. The source of inspiration was an observation that each Roman Catholic bishop receives a few millions of prayers yearly, on average, because the Mass is regulated by the Roman Missal, in which a fixed element of the congregation recites such a prayer. On a large dataset we did not observe a significant difference in lifespan between bishops and priests, although we found out they outlived regular priests—which, however, we explained in terms of pre-selection (only priests who are over 35 may become bishops) and material status. Naturally, the results do not prove the efficiency of prayers as such—as we could not account for the commitment of the praying, the intentionality, emotional attitude towards the prayer or simple physical proximity to the person to which (p.37) the prayer was directed. Nevertheless, in a rather humorous project we managed to learn something that we were simply curious about but which was only possible with the use of Wikidata.

Without going into details, it is worth noting that Wikimedia project data (Wikipedia, Wikidata, Wiktionary, Wikivoyage, Commons) may be a phenomenal source of research material that is limited only by the researcher’s imagination, the need to ask the right questions and to seek solutions to interesting problems. Even the structure of the data can be the subject of social analysis, because the way in which large groups of people organize and categorize information is also the source of social preferences, stereotypes, and beliefs (J. Adams & Brückner, 2015). However, even though the project did research large amounts of data, a question remains if it was a Big Data project, because such a project is the research of massive datasets, usually streamed, whose analysis needs to be supported with something more than classical statistical tools and which results in the emergence of either predictive conclusions or behavioral models. George et al. (2014) suggest that the size of a dataset should not be considered as important as the “smartness” of the acquired information and granularity of data about an individual.

Quite recently, Google has set up the datacommons.org that combines integrated, ordered, and cleaned-up data from Wikipedia, American Census Bureau, FBI, weather agencies, and American election commissions. This tool is worth keeping in mind with geography research. (p.38) Stanford University’s Large Network Dataset Collection datasets and those of Harvard University’s Dataverse are also of great use.

An increasing number of interesting free tools supports the analysis of Big Data. GOOGLE CORRELATE3 trend analysis allows the use of Google search queries whose popularity dynamics follows a defined trend. This may be plotted on a graph (Figure 3.1).

Methods of Researching Online Communities

Figure 3.1 GOOGLE CORRELATE example 1

In the US, Google search queries that matched the plotted series between early 2004 and March 2017 were:

  1. 1. free web (r = 0.9953)

  2. 2. download (r = 0.9933)

  3. 3. free ftp (r = 0.9932)

  4. 4. Microsoft FrontPage (r = 0. 9932)

  5. 5. amplifiers (r = 0.9931)

  6. 6. web page (r = 0.9929)

  7. 7. Japanese language (r = 0.9929)

  8. 8. comparisons (r = 0.9927)

  9. 9. pdr (r = 0.9922)

  10. 10. real media (r = 0.9922)

Each of the results may be visualized on a graph. Result 1 correlates with the plotted curve in Figure 3.2 as follows:

Methods of Researching Online Communities

Figure 3.2 GOOGLE CORRELATE example 2 (plotted curve)

Source: https://www.google.com/trends/correlate/search?e=id:Ou5W8zluSUP&t=weekly&p=us

There is also a possibility of seeing a scatter plot; “free ftp” looks like this (Figure 3.3):

Methods of Researching Online Communities

Figure 3.3 GOOGLE CORRELATE example 3 (scattered plot)

The interest in “free web,” “download,” “free ftp,” and “Microsoft FrontPage” has plummeted. I am surprised that the interest in free web has dropped very similarly to the Microsoft WYSIWYG HTML editor, but this is understandable. The drop in searches for download and free FTP (a file exchange technology) also corresponds well with the transition to streaming content.

GOOGLE CORRELATE’s feature that allows the input of time trends in a CSV format is much more interesting. Based on the available data on sales of a defined product, sickness rate, we can see which queries are correlated with the phenomenon. We may even check what queries are correlated with other queries—“losing weight” is strongly correlated (p.39) in US search results with phrases like “physical exercises,” “losing kilograms,” “whey,” “increase of muscle mass,” and “burning fat.” The use of GOOGLE CORRELATE as even the only tool allows exploration of some quite serious research projects, or at least complements them.

Simple Google location and query data can serve practical social goals. In 2018, FINDER, an epidemiology project, based on mobile phone location data and Google searches for food poisoning, with the use of machine learning algorithms, allowed the Center for Disease (p.40) Control and Prevention (CDC), to pinpoint restaurants that needed health inspection with more than three times the accuracy of traditional methods (Sadilek et al., 2018).

At the same time, the tool had its limitations. De facto, the same data was used to run the Google Flu Trends project, which—on the basis of search query trends—estimated the probability of propagation of flu and denga viruses. The service was launched in 2008 and seemed to be an ideal application of Big Data (Ginsberg et al., 2009): a combination of search trends with CDC data resulted in an accurate estimate of flu epidemics two weeks ahead of the traditional epidemiologic models, which had great social value and life-saving potential. In 2013, however, the service had a spectacular failure, deviating from the actual results by 140 percent in the peak season for flu infections. Google Flu Trends had lost some of its predictive capabilities (Lazer, Kennedy, King, & Vespignani, 2014). This was largely a consequence of trusting in spurious correlations, of no bearing on infections, which contaminated the dataset. In 2015 the service was closed to public access, although historical data can still be viewed, and research teams may still apply to Google to put the data into better use. Putting exclusive trust in Internet-based data, without relying on any real-world data requires a precision of the algorithm and cleanliness of the data which are not fully accessible (Rogers, 2017).

One must simply remember that correlation data itself, without context, can be misleading. It is very easy to choke on Big Data and see correlations that do not exist because with large amounts of data come spurious correlations—which is catastrophic for science, because researchers are motivated to show correlations, and many commercial computer tools allow sifting of data in search of correlations. The scale of the problem is so large that some researchers simply conclude that “most of the published research is wrong” (Ioannidis, 2005).

Tyler Vigen brings a lighthearted perspective to the problem of “too much data” in Spurious Correlations (Vigen, 2015) and tylervigen.com. The correlation between US spending on science, space exploration, and development of technology on one side and suicides by hangings and suffocation on the other is worth a look (Figure 3.4):

Methods of Researching Online Communities

Figure 3.4 Correlation between US spending on science, space exploration, and technology versus number of deaths by hanging and suffocation

Source: tylervigen.com, used with permission.

In Big Data analysis the important irremovable constraints of the algorithms must be kept in mind. For example, even though online dating systems operate on tens of millions of user data and they develop matching algorithms that rely on preferences and psychological profiles, (p.41) research shows that they will not meet expectations (Finkel, Eastwick, Karney, Reis, & Sprecher, 2012). Moreover, one of the largest online dating services driven by algorithmic matching, OkCupid, provides a large amount of anonymized user data (Kirkegaard & Bjerrekær, 2016). This data can be used in research and exercises in using quantitative data (Kim & Escobedo-Land, 2015). Making it available, however, is an important ethical problem (Fiesler et al., 2015): there is a possibility to use seemingly unimportant data to create a profile and successfully behavior and features, or even maybe identify them, despite the obfuscation of identifying information. The issue is controversial and I consider the publication of such classified information risky, even if the publishing party considers the data to be correctly and fully anonymized (Fiesler, Wisniewski, Pater, & Andalibi, 2016).

It is worth making a distinction between Big Data analysis, machine learning, and Deep Learning, which can result in finding regularities in the data. Examples can include predicting poverty based on satellite photography (Jean et al., 2016), optimization of statistical analysis in questionnaires (Fu, Guo, & Land, 2018), prediction of prison violence (Baćak & Kennedy, 2018), classifying social media posts based on content analysis (Vergeer, 2015), or sentiment analysis in press articles (Joseph, Wei, Benigni, & Carley, 2016).

3.1.2 Social Network Analysis

Social network analysis (SNA) has quite a long history in social science (Carrington, Scott, & Wasserman, 2005; J. Scott, 1988). Its first famous (p.42) application was related to weak ties (Granovetter, 1973)—a phenomenon that explains why, when looking for work, it is better to seek help from acquaintances instead of friends (Montgomery, 1992). The strength of weak ties resides in their length. They act as bridges, connecting individuals socially far apart and exchanging information and resources across distances.

SNA is an excellent way to research online communities. It can also be applied offline, but it is in the virtual world that it has found popular application, because data on the relations and connections between avatars are easy to access online. It is also a new research field with much to be discovered. Some researchers claim that Internet social networks are based on weak ties (De Meo, Ferrara, Fiumara, & Provetti, 2014), or contacts acquaintances and strangers. This seems to explain the “Twitter revolutions” or social change movements in Tunisia, Egypt, Spain, or the international Occupy movement that relied on social media (Kidd & McIntosh, 2016). Other research shows that even people who know each other only online may form long-lasting and strong social ties (Ostertag & Ortiz, 2017), and the nature of social relations is an aggregate of online and offline acquaintance and contact (Chayko, 2014). Regardless, social network analysis will do a fine job, owing to easy access to data.

SNA is based on the research of ties within a network. It is usually a network of individuals or avatars, but the network can also be among devices and workstations. The two types of network connection are states (acquaintances, friends) and events (conversation, exchange, transaction) (Borgatti & Halgin, 2011). What else is measured? Within the connections, one also may research the homophily, ways in which an avatar builds connections with similar others, and how they deal with those that are unlike themselves according to specified criteria such as political preferences, education, gender, or place of birth. Much data of this kind is available in the networks’ public profiles. It is worth adding here that in network analysis it is hard to differentiate homophily from influence mechanisms—to observe whether a given reaction is the result of two people being similar and acting similarly, albeit independently, or whether this is the result of one person inspiring the other (Shalizi & Thomas, 2011).

In social network analysis, the focus is on the reciprocity/equivalence of connections, their transitivity (is the friend of our friend also our (p.43) friend?), and density, strength of connections or centrality. Selection of these indicators makes sense only after establishing which of them is a good measure of what feature. Determining the number of contacts (e.g., phone calls) needed to confirm the presence of a bond is a valid methodological issue (Borgatti & Halgin, 2011).

There are many applications of SNA. Financial specialists serving on supervisory boards can be one subject of research (Mizruchi & Stearns, 1988). SNA is ideal for exploratory research—although it can form a tool of its own for a complete research project. It is also a good introduction to ethnographic research (Bellotti & Mora, 2016).

The purpose of social network analysis is to research the structure of connections and patterns of connections. Instead of categorizing individuals by features, the analysis is based on relations. The focus, as with the systemic approach, is set on the whole structure, the goal is to observe patterns, relations that allow to distinguish a subnet, observe cliques, or draw conclusions about the acting and organizing of individual units (nodes or objects) of the network (Wellman & Berkowitz, 1988). The nodes may be people, but also projects or teams, organizations, events, and even ideas. Defining a relation and confirming its occurrence are, naturally, conventional and dependent on the research perspective—a relation is presumably a result of close proximity, but comes with trust, membership in a group, and conflict. Noticing patterns of interaction improves understanding of the social mechanism, which adds valuable context to research of individuals. It is important in focusing on boundary spanning. It is a social phenomenon based on the observation that spreading information and new knowledge in organizations often depends on people who are well-networked and communicated, both inside and outside of the organization (Meyer & Rowan, 1977; Tushman & Scanlan, 1981). In some cases, we observe boundary spanning in sister organizations. It is especially visible in open software projects, where an important role is to communicate between the community of creators and users; people undertaking such roles often do this for different, unconnected projects and voice similar opinions in each case (Barcellini, Détienne, & Burkhardt, 2008). Discovering similar social relations networks is one of sociology’s interesting contemporary challenges, also leading to theoretical developments (Erikson & Occhiuto, 2017; D. Wang, Piazza, & Soule, 2018).

(p.44) Many tools are capable of performing social network analysis, because there are many things to research. Some research can be completed exclusively with online tools. An interesting project is myheatmap.com (a free version of the tool was openheatmap.com)—allowing to input data from a CSV sheet or a Google Doc and visualizing it on a map (Figure 3.5). This way, by feeding the system with addresses of McDonald’s restaurants, we may see their density:

Methods of Researching Online Communities

Figure 3.5 Saturation with McDonald’s restaurants

Source: http://www.openheatmap.com/view.html?map=NonspeculativeCalcariferousOchidore

Pete Warden developed the web crawler that collected Facebook profile information. The script was operational for six months and collected data from 210 million user profiles—saving given names, family names, locations, friends, and interests. Warden had intended to publish the dataset in 2010 after having it anonymized; however, upon learning of the intention Facebook’s legal department threatened a lawsuit for breaching the service’s terms of use by not obtaining written permission.4 (p.45) As a result, Warden needed to delete the entire dataset (Giles, 2010). This was a shame, because many of the observations from the anonymized dataset might have been interesting. For instance, Warden remarked that groups of American cities, when analyzed in terms of Facebook friendships, form clusters, with strong connections within them and weak connections outside.5 These clusters can sometimes but not always be explained geographically: it is hard, for example, to envision why Missouri, Arkansas, and Louisiana have stronger ties to Texas than to Georgia. The data was also used for other sociological observations. In the South, “God” ranked high on the list of top 10 liked pages, while sports and beer dominated in the North. The names Ahmed and Mohamed were especially popular in Alexandria, Louisiana. Such trivia, naturally, has little cognitive value on its own, but it could be a good start for quantitative research that could contextualize these observations and lead to actual discoveries.

As part of our research of unequal involvement in open-source projects, coauthored by Peter Gloor and Tadeusz Chełkowski and based on the ideas and data of the latter, we were able to indicate, based on the analysis of almost all Apache Foundation projects, that even though open source projects are frequently described as “open collaboration,” in practice the element of collaboration is illusory (Chełkowski, Gloor, & Jemielniak, 2016). Our quantitative analysis—the first analysis of such an extensive dataset on open software—proved that the vast majority of open source programmers work independently. We also observed that the input of the individual project participants shows an exponential distribution.

We used network analysis to show the connections of all 4661 developers with their 263 projects (Figure 3.6). Apache Taglibs had the highest betweenness centrality, the degree to which it acts as an intermediary between other projects; this was a good indicator of the importance of a project. It was developed by 527 programmers over 15 years. It uses the popular Java Server Pages technology and is modular—which helps to explain its popularity. We correlated the betweenness in the network with the number of code lines, number of participants, and number (p.46) of commits.6 We observed a strong correlation between the number of contributors and the betweenness of a project (r = 0.907, p < 0.001, N = 263). When analyzing the programmer network, the user “jukka” with 6345 tasks had the highest betweenness level. This user participated in 20 projects; the correlation between the number of tasks and betweenness was r = 0.222 (p < 0.001, N = 4660)—it is meaningful but not strong, as other users had more tasks and less betweenness.

Methods of Researching Online Communities

Figure 3.6 Social network analysis of Apache programmers

Source: Chełkowski et al., 2016.

Thanks to the social network analysis, we were able to show that in open collaboration projects, the most involved users, in terms of (p.47) number of tasks, do not need to play a central role. There are developers in the center of the social network who do have a moderate number of tasks, but who are of key importance to the network.

This observation was confirmed in other quantitative research on different open collaboration projects, such as Wikipedia—for instance, users with the highest edit count are not necessarily elected as organizational functionaries (Burke & Kraut, 2008).

As part of Internet process research, social network analysis is used in political marketing. For the last few years, active bot armies have been found on Twitter, Facebook, and other social media sites (Ferrara, Varol, Davis, Menczer, & Flammini, 2016). There is strong evidence that some states, especially Russia, use bots and armies of professional trolls and commentators to reach political goals (Aro, 2016). It is a fascinating subject for new research projects from the borderlands of sociology of politics and sociology of Internet, where different qualitative and quantitative research techniques need to be combined. When SNA and text analysis are combined with deep learning, it is possible to identify distinctive “virtual tribes,” groups of people sharing word usage and behavioral patterns, often closely correlated with similar lifestyles, political choices, and worldviews (Gloor, Fronzetti Colladon, de Oliveira, & Rovelli, 2019).

How do we conduct social network analysis with the use of modern technology, but applied to offline research? For instance, the Human Speechome Project became the source of recognition for a linguist who installed cameras and microphones at his home and recorded the three first years of his child’s language development (Roy et al., 2006; Tay, Jebb, & Woo, 2017). Geolocation and phone call data of users allowed to predict, with 95 percent accuracy, friendship relations (Eagle, Pentland, & Lazer, 2009), although because of the sudden decline of the phones’ popularity as contact medium, the situation may alter significantly in the future. Analysis of phone calls of 65 million subscribers showed the relation between social network structure and diversity and access to socioeconomic opportunities (Eagle, Macy, & Claxton, 2010). Studying Twitter usage during a tsunami showed Twitter’s usefulness for crowd-sourced disaster management, but has to include opinion leaders and influencers, not just official governmental accounts (Carley, Malik, Landwehr, Pfeffer, & Kowalchuck, 2016).

(p.48) Social network analysis of an archived corpus of Enron’s employee emails led to some interesting conclusions on the dynamics of organizational crisis and informal communication (Diesner, Frantz, & Carley, 2005); in times of organizational crisis, inter-employee communication is intensified across organizational hierarchies. Email pattern analysis combined with machine learning allows the identification of top performers at work (Wen, Gloor, Fronzetti Colladon, Tickoo, & Joshi, 2019). In a wider perspective, all interactions within virtual teams beg for quantitative analyses. Collaborative Innovation Networks (COINs), based on distributed creative, IT-communication-savvy teams are the most efficient source of innovation, constituting a new research subject (Gloor, 2005).

MIT researchers (Olguín et al., 2009) conducted a project with the use of specially designed devices for research subjects to carry. Using the data, the researchers measured the length and number of interactions, proximity to other team members, and physical activity. This allowed them to reach conclusions on patterns of behavior in teamwork, and to quantify social reactions.

A similar project could be based on phone apps. Developing an Android/iOS app is expensive, but worth including in the total costs of a research grant if it will generate useful data. In any case, it will cost less than constructing a dedicated device, and will allow us to specify what kind of data we want to collect. At present, it is more difficult to come up with a research question than to access the data—the data is either already available or it is possible to obtain at a reasonably low cost. In any case, doing social network research is useful as a small component of a larger, Thick Big Data study and as a standalone project. There are many excellent textbooks on the topic (Hennig, Brandes, Pfeffer, & Mergel, 2012; McCarty, Lubbers, Vacca, & Molina, 2019; Robins, 2015).

3.1.3 Online Polls

Online behavior research based on existing data has the advantage of being non-invasive. When conducting such a study, researchers avoid the Hawthorne effect—the influence of the research on the results—although they will never avoid the influence of the need of self-presentation (p.49) of the research subjects towards their reference groups. Thanks to this type of data we may observe racial preferences in matchmaking—the analysis of Yahoo’s match-making portal user profiles showed that white heterosexual male Americans who express a preference usually indicate a lack of interest in black people, while female Americans excluded people of Asian descent (Robnett & Feliciano, 2011). It would be difficult to expect similarly strong declarations, potentially showing racial prejudice in a regular poll, although such questions are asked there frequently. Computational research based on large datasets has exposed a lack of accuracy of many traditional research methods—another example can be the discrepancy between declarations of people being in a specific place at a specific time when shown their mobile phone GPS data (Burrows & Savage, 2014).

Nevertheless, online social research is often based on active participation, for instance through experiments (Hergueux & Jacquemet, 2015). Sometimes the scope of experimental manipulation can be minor: for example, during a project co-run with Facebook on a sample of 61 million people, through differentiating access to information of whether a person’s online friends have already voted in the US Congressional elections, research was done on the social influence on the political activity mobilization (Bond et al., 2012). This was research that raised serious ethical considerations, as it de facto interfered with the elections (Ralph Schroeder, 2014). These issues will be discussed later.

The poll is the easiest form of quantitative online sociological research with the participation of the research subjects. Emailed questionnaires used to be popular, but currently there are often no reasons not to use online questionnaires (Van Selm & Jankowski, 2006), or polls collected in face-to-face meetings. The advantage of online polls over traditional, paper-based ones is obvious: data is collected directly in the form of a database, usually allowing nearly instantaneous creation of simple analyses and charts. This has transformed the structure of the collected data, even during census studies (Aragona & Zindato, 2016). For this reason, the use of online tools to collect data is practically a standard, although we cannot assume that everyone uses the Internet and with distance, issues of controlling the sample or acquiring a sufficient number of responses will arise. Additionally, the Internet is used to contact the interviewees.

(p.50) It makes no sense to abandon polls for some research questions, for instance when representativeness is important. For example, even though some research indicates that Facebook interaction analysis may provide more accurate predictions than polls (Chmielewska-Szlajfer, 2018), and that there is a visible correlation between Twitter mentions and political success (DiGrazia, McKelvey, Bollen, & Rojas, 2013), closer analysis shows that to reach political success, responsiveness and the ability to direct the narrative is of higher importance (Kreiss, 2016). Although Twitter mentions alone may be indicative of interest in politics, not the readiness to voice political support in elections (Jungherr, Schoen, Posegga, & Jürgens, 2017), it is problematic to replace political preference polls with Twitter analysis. In addition, news outlets and corporate accounts use Twitter in a more one-directional way than people do, with different hashtags and behaviors (Malik & Pfeffer, 2016), so their influence in the general poll may skew the results, if not sorted out.

This means that Big Data analysis can be used to create prognostic models but they should supplement polls, not replace them. It seems, however, that there is no escape from the increased use of the new methods of quantitative research, both those that are based on primary data, because the inclination to participate in traditional phone or paper-based polls is on the decline. It is additionally difficult to receive funding for such research, and technological changes are causing many households to get rid of their landlines. At the same time, people are eager not to answer phone calls that originate from unknown numbers. A system of unique IP addresses combined with browser data may identify people with the same precision levels as from phone polls. Nevertheless, the leading use of online polls directed towards the population are specialized research panels, which partially solves the representation problem.

Online polls are becoming increasingly popular as a research method, even though fewer people have Internet access than phone access (Couper, 2017). The disparities in technology exacerbate other kinds of inequality (Dutton & Reisdorf, 2017). In contrast to phone-based polls, online polls do not allow for a simple representative sampling (Schonlau & Couper, 2017). There are no good ways of unambiguously defining an individual’s identity (one person may have several email addresses, each of which is difficult to link to a physical location or (p.51) demographic characteristics). There is no census or list of individuals that subjects can be drafted from, although with social networks such as Facebook, authorization with social network credentials might eventually be useful.

For these reasons, non-random sampling is dominant, even in academic research, as its advantage is its affordability, even though there are sensible responses to the issue of sampling (Fricker Jr, 2016). “Opt-in” polls are used, where anyone declaring to meet the criteria can sign up (M. Callegaro et al., 2014). Recruitment is conducted through banners, social networks, distribution lists, or mailing lists. Unfortunately, voices have been raised about the problem of careless or simply deceptive participants, overrepresentation of certain social groups, and difficulties in reducing that overrepresentation (Bethlehem, 2010). These problems are especially visible in polls that offer remuneration to the participants.

Yet another problem of online polls is an error resulting from a large number of people deciding to answer only some of the questions (Couper, 2000). In classical polls we are nearly positive about the number of people who received invitations and when they were invited, but if online polls encourage participation by banners, we will receive inaccurate information about the number of views because of ad-blocking software even though marketing research and the analysis of traces left by web surfers have made measurements more precise. With emailed invitations or forum posts, the situation is even worse—often we will not know how many people withdrew from participation and why. At the same time, we need to remember that online polls, even though they are quite convenient from the viewpoint of data collection, may be problematic in reference to participants’ privacy; they might also interfere with regular conversation in online communities (Cho & Larose, 1999).

To conduct online polls, both for large groups of participants who are strangers to us and for our own face-to-face polls, we can use many free tools. Google Forms (google.com/forms) does a good job for yes/no, multiple-choice and open-answer questionnaires (see Figure 3.7). It has an ergonomic and clear interface, the possibility of limiting access to specified Google accounts, conditions imposed on the sequence of questions, and the possibility of asking random questions within the section; there are no limitations on its free version, combined with an (p.52) esthetic way of presenting or downloading the results as CSV files. I have often used Google Forms in my quantitative research of Harvard students—when I was trying to get their perception of the fairness of sharing media files (Hergueux & Jemielniak, 2019). I studied 50 people from one year group of the LL.M. and 60 from the next one—although I used the traditional method of sitting in one room with my interviewees. I relied on computer-assistant personal interviewing (CAPI) (Mario Callegaro, Manfreda, & Vehovar, 2015). My rationale was to make absolutely sure all participants would be from my target group. Additionally, it was my intention that the interviewees did not multitask while filling the questionnaire—doing many things at once is a serious risk in online research and may skew the results. Finally, because a single cohort of LL.M. students has around 200 people, I wanted to present (p.53) the questionnaire to a few dozen students from each year. An emailed questionnaire has a low return rate (Cleary, Kearney, Solan-Schuppers, & Watson, 2014), and the disadvantage of online polls as such is the lack of or incomplete answers (LaRose & Tsai, 2014).

Methods of Researching Online Communities

Figure 3.7 Default presentation of a Google Forms questionnaire results

Among other tools, esurveycreator.com is free for researchers, and esurv.org, kwiksurvey.com and opinahq.com are free for everyone. They all contain many useful options, as does limesurvey.com. Among the paid tools, surveymonkey.com is popular, with some limitations in its free version.

When using polls, the researcher must exercise caution, as unintentional mistakes are easy to make (C. S. Fischer, 2009)—these result from the sequence or wording of questions. As this is true of all questionnaires, I will not explore these issues; however, in online polls on large samples, mistake leverage is particularly easy. When the study is conducted on a large group of anonymous people who are compensated or their participation (as with Amazon Turk), one needs to remember that the value of the data may be low—the respondents will be motivated to finish the poll as quickly as possible, and controlling their conditions during poll-taking, as well as the general demographics, may be impossible.

Online research also makes it unfortunately easy to introduce manipulations. For example, in a poll on risky sexual behaviors among Latinos, the precise analysis of answers led to the elimination of 11 percent of completed questionnaires because of suspected bad faith, and because it seemed that a single person had filled out as many as 6 percent of the questionnaires (Konstan, Simon Rosser, Ross, Stanton, & Edwards, 2005) who wanted to game the system.

3.1.4 Culturomics

The notion of culturomics was proposed by ten authors of an article in Science which was very important from the viewpoint of methods of online social research (Michel et al., 2010). Their painstaking work required the support of Google but with a striking outcome: they used a corpus of millions of digitized books, accounting for about 4 percent of all books that have been published in English.

(p.54) With this database, they were able to come up with an original form of computational lexicography, for use in the research of linguistic, cultural, and sociological trends. One of their observations was that the English lexicon had 597,000 words in 1950, and 1.022 million in 2000; an annual increase of 8.5,000 over a 50-year period. Culturomics allowed the indexing of more words than any dictionary could. Similarly, it became possible to research trends in grammar, such as the tendency to creating regular past forms of previously irregular verbs. There was also an interesting result of a simple research of changes in the frequency of the use of specific terms—1939 visibly, and for good reasons, marked a sharp decline in the use of the term “the Great War” in favor of “First World War” and “Second World War.”

The analysis of the most notable people born between 1800 and 1950 in samples of 50 people per year, with biographies acquired from Wikipedia, showed a visible trajectory of fame—the peak of the mentions for each individual was noted about 75 years after their birth. With time, fame came to notable people earlier and grew faster, although its timespan is shorter than what it had been, which is a good reason to research the phenomenon of celebrity not only in show business but also in academia.

Quite surprisingly for the knowledge of the contemporary world and sociology of politics, with time the publications increasingly often focused on current issues. References to 1880 had fallen by half by 1912. The same decline took only ten years after 1973. This could indicate that fewer books on history are being written and that people are more eager to forget it. The article contains other fascinating observations, but brings an interesting approach to digital sociology.

The co-authors of the Science article were Jean-Baptiste Michel and Erez Lieberman Aiden of Harvard University. They had also participated in the creation of the Google Ngram Viewer which allows research into cultural trends and frequency of word occurrence over time, based on the titles of Google Books. By 2015 the collection numbered 25 million volumes, about a fifth of the estimated 129 million of volumes published from the invention of the printing press until 2010 (Taycher, 2010). The Ngram Viewer plots very useful charts, similarly to GOOGLE CORRELATE, but based on the corpus of books within the repository, not on search terms. Unfortunately, it is limited to eight languages, and probably will (p.55) not add more. Figures can be generated only for books published between 1800 and 2008. Out of curiosity, I counted the number of mentions and references to leading sociologists from 1920:

Figure 3.8 shows that Michel Foucault’s ideas were as popular as those of Max Weber, while from 1996 the frequency of mentions of both sociologists similarly decreases. Zygmunt Baumans’s plotted curve seems stable for years and resembles that of Erving Goffman. The creators of culturomics offer a tool on their website with which one can access data for the Ngram in a given set of queries.7

Methods of Researching Online Communities

Figure 3.8 Reference Ngram for prominent sociologists

Source: https://goo.gl/bEwtS8

Other culturomics research from 2012 showed the dynamics of popularity of individual words in English, Hebrew, and Spanish, on the database of 107 words. With time, the number of words falling out of circulation increases, while the tempo of inflow of neologisms drops, although scanning errors have contaminated the data. At the same time, the 20–30 years before 2008 showed a rapid increase in the use of neologisms, most likely associated with technical vocabulary. Neologisms and evolution of language were noticeably affected by wars and other major historical events. The peak in the increase of word use was noted around forty years after their introduction into the language.

It is easy to notice changes in terminology when they are associated with wider social changes. This type of qualitative observations simply begs to be supplemented with deep qualitative research. For instance, the English word “gay” originally meant a joyous or eccentric person. In the 1960s, the homosexual community claimed the term (Oxford English Dictionary, 2018), as opposed to other terms of pejorative character. After twenty years, it replaced a neutral medical word: “homosexual” (see Figure 3.9).

Methods of Researching Online Communities

Figure 3.9 The Ngram of the use of the words “gay” and “homosexual”

Source: https://goo.gl/vGVMzd

This type of research can also be done on corpuses of magazines and popular press. The results can be fascinating. For instance, an analysis of thirty years’ worth of world media reports on important events, combined with geographical analysis, was able to forecast, although retrospectively, revolutions in Tunisia, Egypt, and Libya, and the stability of Saudi Arabia (Leetaru, 2011).

Culturomics is useful not only in social research but also in the digital humanities, where a small revolution in favor of quantitative research is (p.56) (p.57) (p.58) taking (Nicholson, 2012). The use of Google Books corpus has some important limitations, based on a clear bias toward academic books and literary fiction, which distorts the image of the use of language and conclusions on culture and society, although the subset of novels seems robust (Pechenick, Danforth, & Dodds, 2015). This is why it is useful to approach the results obtained with the sole use of this method with a grain of salt. Nevertheless, culturomics is an incredibly valuable supplement to other kinds of digital research.

In the social sciences, the supplemental use of this method is still gaining its final shape, but in some areas, such as in changes of perception of professions (Mattson, 2015), it has a major cognitive sense. It is very useful in the research in sociology of fame and systems of social stratification (Van de Rijt, Shor, Ward, & Skiena, 2013). Other ways of computational text analysis for the purposes of sociological studies are also being shaped (Evans & Aceves, 2016).

3.1.5 Scraping

In online social research, it is often necessary to collect simple, repetitive data available on a website but not always accessible through an API. We can imagine, for instance, that we would need the prices of all children’s books from Amazon’s website—such a manual collection of this data would be extremely problematic, if feasible at all. However, even pure price data may be a source of serious socioeconomic analyses (Cavallo, 2018).

What is necessary is the “scraping” of data. Programming-savvy people write their own scripts for such purposes, or adapt existing code. Luckily, there are some easy-to-use tools at our disposal that I will describe later.

Data scraping in itself may impose some notional categories through the structure of scraped data, and, as a construct which is foreign to social sciences, it may also impose perspectives which are not typical to the social sciences, such as obsession with the data being up-to-date (Marres & Weltevrede, 2013). However, there is no escaping “investigative social sciences” based on the collection of detailed data from the Internet (McFarland, Lewis, & Goldberg, 2016).


Example 1: Donald Trump’s tweets on climate

Let’s assume I want to scrape all historical tweets by Donald Trump containing the phrases “climate” or “global warming.” They are accessible from Twitter’s advanced search page, but copying and pasting them would be time-consuming, and I may also want to analyze the number of retweets, reactions or replies. For such a simple project I can use a Chrome plugin, Web Scraper, developed by ScrapeHero team. It allows very easy data acquisition without any programming. A step-by-step guide is available here: https://www.scrapehero.com/how-to-scrape-historical-search-data-from-twitter/—the tool allows me to scrape the tweets into a CSV database, which I can work on in Excel. I only need to run both of the advanced searches, and after a couple of clicks I have my database. I immediately see that the numbers of retweets, comments, or favorites are in a text format. Unfortunately, thousands are rendered as “k,” so instead of 6000 I see 6k. For just 125 tweets a manual correction is acceptable, for larger sets it would make a sense to run a conversion. I see that by far the most popular tweet about climate or global warming from Donald Trump is:

“Patrick Moore, co-founder of Greenpeace: ‘The whole climate crisis is not only Fake News, it’s Fake Science. There is no climate crisis, there’s weather and climate all around the world, and in fact carbon dioxide is the main building block of all life.’ @foxandfriends WOW!”

Donald Trump (@realDonaldTrump). March 12, 2019. Tweet.

The tweet makes the false claim that Patrick Moore, a climate change denier and an industry lobbyist had been a co-founder of Greenpeace, a claim that Greenpeace USA immediately disputed.8

For bigger projects I can use OctoParse—a handy installable tool that even in its free version scrapes data from different sources, allowing the use of simple templates for popular harvesting websites such as Twitter, Amazon, Booking, Instagram, YouTube, Google, and Yelp. This kind of data can be used for sentiment analysis. There are good corpuses of (p.60) positive and negative words available. One can imagine e.g. studying the sentiment in certain phrases of official stock exchange companies’ messaging. One useful tutorial on combining OctoParse scraping with sentiment analysis in Python can be found at: https://hackernoon.com/twitter-scraping-text-mining-and-sentiment-analysis-using-python-b95e792a4d64.

Additionally, some very useful tools not requiring coding include Google Sheets add-ons, such as Twitter Archiver, allowing free Twitter scraping, as well as Meaning Cloud, allowing a pretty solid sentiment analysis, and Wikidata Tools, which helps with querying Wikidata directly into Google Sheets.

Example 2: Quora

For this monograph, I will explain how to scrape data from Quora, arguably the most popular service used to ask questions, visited by 300 million active users. On Quora, people post rather longish answers to questions asked by others. For this demonstration, I assume that I wish to check if people who give most answers on Quora also ask the most questions.

Scraping can be performed with many tools and without skills in Python or R programming (which allows for more flexibility) although some knowledge of website construction might come in handy. For instance, understanding that the syntax of websites that use Asynchronous JavaScript And XML (AJAX) is much more complex than that of sites that use simple HTML with the added CSS requires so that the researcher knows that scraping these two types of websites can differ significantly. Here, I will use ParseHub (parsehub.com) but the reader may use any other service.

Quora’s structure was one reason for its success over Yahoo!Answers, a service established in 2005 that met a similar need. Quora allows fluid interaction among its users, including asking and answering questions, commenting on, upvoting and downvoting answers, marking the answers with thematic categories, browsing the topics, with the pages that contain statistics of each (including the most frequently read providers of answers), and tracing whether within the questions, topics, or answers of selected users there is any new material.

Profiles of Quora users may be public or private, and the most popular authors are in the “top writer” program. The functions have remained (p.61) practically unchanged for years. Quora does not make it easier to browse the answers in a homogenous structure, for instance on a comprehensive map of subpages or within a category tree, which makes it hard to assess the size of the database of questions and answers. The service is a gold mine of information for practitioners of online sociology, but it is difficult to perform quantitative analyses without data scraping.

The advantage of scraping as a method for collecting data is the automation of downloading bits of data, thanks to which we may benefit from the scale effect of a multitude of identically designed pages. Scraping tools, if set up properly, can extract data from the defined areas of the subpages. One needs to start with proper identification of the pages which have a repetitive layout. Taking into account that Quora is in the top 100 of the Web’s most popular sites, it is no wonder its structure is schematic and repetitive: each answer to each question is served by the same content management system, optimized for search engines.

In order to display large sets of ordered data, websites use the division into connected subpages, i.e. allow the navigation through the use of internal links. Scraping tools may be adjusted to a pattern of navigation and the collection of specific data. For this example, I will focus on the subpages within Quora’s/topic/”thread-name”/writers range. The scraping algorithm will acquire the title of the thread and move to the profile of the author. The author data is anonymized. The sequence is presented on Figure 3.10:

Methods of Researching Online Communities

Figure 3.10 Scraping algorithm sequence

Profiles of individual authors are available under addresses with the syntax/profile/“user_name,” where user data can be acquired, if the profile is set as public. Within the example, I will acquire only data about questions asked and answered. A user profile may look like this (Figure 3.11):

Methods of Researching Online Communities

Figure 3.11 An example of an author’s profile on Quora

We provide these values at ParseHub and we indicate that they are to be scraped. The algorithm collects the data from the “top writers” profiles and automatically visits the next topic—if there was a map of subpages on Quora, it could be used to assist ParseHub in navigation; as it is absent, we move from one topic to another (Figure 3.12):

Methods of Researching Online Communities

Figure 3.12 Choosing related topics on Quora

The algorithm will run until halted. Taking into account Quora’s size, it is hard to assess how many topics need to be visited so that sufficient amounts of data for analysis can be acquired. The problem could be solved by supplementing the scraping with the “crawlers” or “spiders” which would first collect all the topic subpages.

(p.62) For those interested, the query algorithm is as follows:

The simple algorithm in Figure 3.13 acquired data of over 8000 users within 893 topics overnight. After a quick SPSS analysis of averages, it was apparent that the most-read authors are less eager to ask questions. On average, a member of the “top writers” category answered 244.99 questions, while asking 28.52. It may therefore be concluded that answering questions is a different social activity from asking them, and these two social roles—people sharing knowledge and people seeking knowledge—do not necessarily combine in one person, at least for the most popular providers of answers, which is counter-intuitive.

Methods of Researching Online Communities

Figure 3.13 The scraping algorithm on ParseHub

Naturally, data scraping in itself is only one side of the story. The other is, how much the data can be analyzed with the use of the basic statistical tools. Sometimes the use of more complex tools is required—these may be text mining, qualitative data analysis (QDA), or tedious Python/R processing. This example is a fundamental one—for the purpose of text mining, or linguistic analysis of the collected data there are many ready-made tools that do not require the user to have specialist (p.63) (p.64) preparation: IBM Watson Analytics, RapidMiner, Leximancer, and libraries such as Linguistic Inquiry or Word Count (LIWC), although these are mainly usable for research in English. Quantitative analysis may also be supported by the CAQDAS software to offer the researcher partial automation with coding or transcription. Sites such as Search Engine Scrapper or Lippmannian Device allow the user to research the frequency of specific word use on defined websites.

While scraping, we need to keep in mind that we are putting workload on the infrastructure of the services where we direct our queries. Therefore, if the websites offer making queries through their Application Programming Interface (API), this is definitely preferable to pure scraping; however, no such possibility exists for Quora. We also need to abide by the Terms of Service (TOS) and the rules included in the robots.txt file. Above all, what the researcher needs is common sense—and to limit the frequency of queries so as not to overload the servers.

(p.65) In reference to the API, an example is Reddit.com, or the aggregator of comments and news items, which provides a very complex and detailed description of the availability of their data under reddit.com/dev/api. One may use the API also with no programming skills. For example, collecting the posts in the r/science thread can be done with the following syntax: reddit.com/r/science/top/.json?t=all&limit=100, where “top” indicates that we are interested in the most popular threads, “t” asks the API to sort the items by date, and “limit” imposes the limit of threads to be shown—100 is the maximum value. However, the results need further processing, unlike with data obtained through scraping. Also, the limitations of results to 100 is a serious limitation.

Working with APIs of different websites is facilitated with dedicated tools. Many data processing packages allows API access to data—as with tools like The Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT), RapidMiner, Tableau, Condor, or Gephi.

3.1.6 Other Useful Tools

It is beyond the scope of this monograph to cover all topics in detail. Its goal is not to present a complete and deep description of the tools of digital sociological analysis. Every method would deserve its own detailed description. However, people who have not used data mining software before will find it more rewarding to get a sense of its fundamentals, what limitations it imposes, and how to get started with their own data collection.

In this section, I offer concise descriptions of a few more tools, with some examples of their use. I am focusing on programs that do not require programming skills—assuming that readers who are proficient at programming will benefit more from the sections on qualitative studies, culture studies, and research ethics. ScrapingHub

ScrapingHub.com was developed by people working on Scrapy.org. The latter is an excellent open source program to set up one’s data crawlers, but it requires programming skills. ScrapingHub contains tools that can produce the same results with a visual editor, Portia. In Portia, we may (p.66) choose the website that concerns us and, through an intuitive interface, mark the data we want. Portia works well with websites that do not allow for easy data export. It is a sensible alternative to ParseHub. OctoParse

OctoParse.com is another service for easy automated collection of text- and image-based data, similarly to ParseHub. It works well with several webpage formats, including AJAX, dynamically created, or Javascript-based. The service is cloud-based, with the use of hundreds of servers, thanks to which the risk of the process being blocked by the website is minimized. Data can be scraped or accessed through the API. The free version comes with many limitations but it still allows research to be conducted on simple datasets. OctoParse can be installed on a desktop computer and comes with many pre-loaded templates for popular website scraping, including Twitter, Yelp, or Amazon. COSMOS

COSMOS is a Java application for Windows, part of the socialdatalab.net service. It is available for free and may be useful in analyzing Twitter data based on country and demographic information. It allows the construction of custom enquiries, drawing the diagrams of retweets, frequency of occurrence of specific phrases, charts, maps, and word clouds based on frequency data. It is worth visiting the Social Data Science Lab service to read useful guides on hate speech online. DiscoverText

DiscoverText.com is run by Texifter. It is a complex cloud-based tool combining machine learning, coding by human teams, and a variety of computer text analysis tools—coming from simply non-structural corpus, as well as from poll results and Twitter. Unfortunately, as of September 2018, the use of Twitter data requires the consent of the company behind it, but if such permission is granted, DiscoverText allows for many useful analyses. Labs.polsys.net

This address hosts a family of simple, yet useful tools, with the added benefit of exporting the data to Gephi. I will describe a few that readers can experiment with.

(p.67) Spotify Artist Network, based on Spotify data, allows analysis of the network of interconnected music artists. It allows exporting of the network to Gephi for further analysis. It can show, for instance, that people listening to Johnny Cash are also interested in country music generally (see Figure 3.14). The structure of the network is homogeneous, but in a more general view there is a separate cluster of Southern US rock (with a high popularity of Lynyrd Skynyrd), and American folk.

Methods of Researching Online Communities

Figure 3.14 Example network of interconnected music artists using Spotify Artist Network and Gephi

NetVizz is a Facebook app allowing to scrape the data of the service’s public pages and groups for academic research (Rieder, 2013). It originally enabled the scraping of non-public groups but this feature has been removed. Following the Facebook data leak scandals, revealed in 2018, the future of the application is uncertain. It also did not pass the Facebook’s audit, as it was considered to not provide an added value for Facebook users9—but as of this writing (summer 2019), it still works. Thanks to NetVizz, we may analyze friendship networks, and demographic and relation data. It also allows analysis of the relations of likes among Facebook pages which enables the discovery of clusters of propaganda groups. It will be unfortunate to lose it.

The other tools are a YouTube data extractor and others that require a server-side installation and programming skills. Chorus

Chorus (chorusanalytics.co.uk) is a free tool to scrape and visualize Twitter data (Brooker, Barnett, & Cribbin, 2016). Twitter is a (p.68) tremendous source of data, as the access is completely public, although one needs to pay for the restricted use of historic data. We may therefore study interactions, communication networks, and connection clusters, with no fear of the deeper structure of interactions being hidden from us because of restricted permissions. Naturally, independent of technical availability, we need to control the issues of legality of access and use of the data. Webometric Analyst

Webometric Analyst (Thelwall, 2009) is a free Windows tool that enables different kinds of analyses, such as the frequency of phrase occurrence or mentions of websites. It can be downloaded from lexiurl.wlv.ac.uk/index.html. It also allows visualization of the relations between sites or tweets. It uses the API of sites such as Mendeley, Microsoft Academic, or Google Books and allows the study of citations. Finally, it allows the creation and analysis of diagrams of relations based on social networks such as YouTube, Twitter, Tumblr, or Flickr. Netlytic

Netlytic.org is a tool for text and social network analysis. It uses the APIs of Twitter, Instagram, YouTube, and Facebook’s public groups but it also allows the researcher to study databases and RSS imports. With Netlytic we may identify popular topics and make impressive visualizations, including the geolocation of some data. DigitalMethods.Net

DigitalNethods.net contains links to miscellaneous small tools, such as a Disqus discussion scraper (Disqus is one of the most popular commercial web forum systems), a tool enabling Amazon book analysis, a script collecting search suggestions on Google, and the scraping of GitHub project-related data. WikiChron

WikiChron.Science enables convenient comparison of dozens of Wikimedia sites and projects with different criteria (Serrano, Arroyo, & Hassan, 2018) and chronologically, which gives it an advantage over other tools used in similar analyses, available from the repository at (p.69) tools.wmflabs.org. It also allows visualization of query results, making work much easier. Big Data Tools

Independent work with Big Data requires complex tools. Fortunately, many of these tools are free/open source but still require expertise in IT. Apache Hadoop (hadoop.apache.org) is one of the most popular tools and is capable of quickly processing large structured and unstructured data sets. It has many extra modules, including a scalable machine-learning project (Mahout). An alternative to Hadoop is Lumify (altamiracorp.com/index.php/lumify), which boasts easy 2D and 3D visualizations, dynamic histograms, or interactive geospatially organized dataviews. HPCC (hpccsystems.com) may be used in a similar way; this tool has been praised for its scalability and a well-developed Integrated Development Environment (IDE). All the tools are advanced and will make no sense to users who are not experienced coders, and the experienced coders are aware of them already.

Finally, RapidMiner (rapidminer.com) is an interesting alternative; its platform is easier to use and does not require coding skills—it is also available as free/open source but also offered as a paid web service. It is available for free to academics who cannot cover the cost from a grant. Social Network Analysis—Other Tools

Social network analysis does not always require independent data collection but it does always require the data to be processed. Desktop software works well in this respect. The simplest solutions even encompass an MS Excel plugin—such as Node XL (nodexl.codeplex.com), which makes it easy to generate basic graphs. One of the more advanced tools is the Social Network Visualizer (socnetv.org). Like Gephi, this is a free/open source solution, also offered for free and allowing complex analyses, extending beyond standard measures. It also offers crawlers that analyze the reference networks based on a single URL, and an intuitive interface allowing an exploratory approach to data.

It is also worth looking at the possibilities offered by Intelligent Collaborative Knowledge Networks (ickn.org). For academic purposes, (p.70) the developers allow the download of Condor, a convenient program allowing the use of data from emails, calendar, Skype, Facebook, Twitter, Wikipedia, and other sources, including user data. MediaCloud

Media Cloud (mediacloud.org) is a simple, yet powerful open source platform developed by the MIT Center for Civic Media and the Berkman-Klein Center for Internet and Society at Harvard University allowing studying media ecosystems and tracking how stories and ideas are shared through online media, by analyzing millions of publications.

3.2 Qualitative Research

Qualitative social science research has often been the victim of its own success: it consists of a plethora of approaches, methods, tools, which despite bearing the same name offer diverging, or even conflicting, concepts.

In qualitative research of information systems (Sarker, Xiao, & Beaulieu, 2013), like social research, within one single typology, there are five streams of research (Ciesielska & Jemielniak, 2018; Creswell & Poth, 2017):

  • - hermeneutics,

  • - case study,

  • - grounded theory,

  • - ethnography,

  • - narrative research.

Despite some similarities, they have considerable differences in the practice of field research and in the collection of material. For instance, the goal of research based on grounded theory is to construct a theory that derives from qualitative data (Charmaz, 2014). For this reason, grounded theory research will be conducted differently from ethnographic research, even when the topic is the same (Prus, 1996). This will be true even if the analytical tools are similar or identical, as the philosophical assumptions as to the relation of the field to the (p.71) researcher will differ (Konecki, 2008b). This is because ethnography is not designed to generate theories (Hammersley, 1990) and generalizations (Payne & Williams, 2005), but rather bases its external validation on thick description. Such discrepancies between offline and online research are consistently similar, both when employing grounded theory and ethnography. The former, in addition, makes a good choice for the use of online data.

In this monograph, I do not consider the nuances of qualitative research. Unlike with computational social sciences, founded on Big Data, whose canon is still in the making, the qualitative social sciences come with excellently described and developed methodologies. The goals of this book are only to present the set of new methods that fit online social research well, and to outline the differences between online and more traditional research methodologies. Those interested in the diversity of qualitative research or the use of grounded theory (Charmaz, Komorowska, & Konecki, 2013) will benefit from the existing literature (Ciesielska & Jemielniak, 2018; Flick, 2014; Hammersley & Atkinson, 1995).

For these reasons, this book will devoted only to the qualitative approaches that I have used. I will describe my experience with the digital ethnography I used in researching Wikipedia. I will also draw the readers’ attention to specific differences related to case study, narrative analysis (Jemielniak, 2008a, 2008b), and interviews, assuming that these remarks will be sufficient to mark the most important differences between online and traditional social research. I will also describe the issues related to social studies of Internet-based culture.

3.2.1 Digital Ethnography

The subject literature contains many similar terms pertaining to online anthropological research (Jemielniak, 2013). Some are interchangeable and others are not. Each approach has its own way of referencing ethnographic tradition (Domínguez Figaredo et al., 2007). First, there is a skillful wordplay, referring to “netnographic” studies (Kozinets, 2002, 2010). Even though this name fits Internet ethnography well, it has been usurped by a marketing research tool with very weak link to ethnography, as it is not based on immersion in a culture.

(p.72) The opposite pole is occupied by connective ethnography (Leander & McKim, 2003). It is based on researching a community by combining online and offline analyses, in long-term field studies, and with the use of social network analysis (Dirksen, Huizing, & Smit, 2010). In a similar context, cyberethnography is referenced (Rybas & Gajjala, 2007) as underlining the fact that behind the online messages actual people also take part in offline life, as opposed to virtual ethnography which may focus on online communities at the cost of a vital social context. As Miller and Slater observe: “ethnography means a long-term involvement amongst people, through a variety of methods, such that any one aspect of their life can be properly contextualized in others” (D. Miller & Slater, 2001, pp. 21–2)—which would suggest that purely virtual study cannot make valid claims. Even though this approach has some proper angles, we need to remember that ethnographic research of only online communities is also both possible and reasonable. Moreover, even the conviction that through physical proximity we are able to grasp each aspect of life seems pure fiction.

Above all, it is far from clear that the term “virtual ethnography” excludes conducting real-life research or complementing online research with interviews (Hancock, Crain-Dorough, Parton, & Oescher, 2010; Hine, 2000). In light of the multitude of terms, including “Internet ethnography” (Sade-Beck, 2008), “virtual space ethnography” (Guitton, 2012), or “online ethnography” (Markham, 2008), it seems worthwhile to stick to those that are most familiar. In mid-2019, “virtual ethnography” returned over 9000 hits in English in Google Scholar, “digital ethnography” almost 5000, while “networked ethnography” a mere 65. “Internet ethnography” barely crossed 1000 hits, and “cyberethnography” returned a little more than 600. Taking into account the popularity of digital humanities programs, I think that “digital ethnography” (Murthy, 2008; Underberg & Zorn, 2013) is a safe term for online ethnographic studies that encompass also the offline context (interviews, meetings, field research offline); “virtual ethnography” ought to be reserved for just online research.

Similar categorizations can be applied to digital and virtual sociology. We need to remember that the method of using the analyses is also important: we can fathom research which is conducted with both the traditional methods and online analysis but which will bear (p.73) no traits of digital social sciences simply because the starting point and the goal of the research will be the understanding of real-life communities whose online presence will be merely additional. A list of the most common online research terms and their definitions is provided in Table 3.2.

Table 3.2 Online research terms



Internet sociology

A notion related to both the researching of online communities and to the study of Internet users, as well as products of online culture or human-bot interactions (see Introduction)

Networked sociology

A notion describing online and offline community research, with the possible additional use of quantitative tools

Digital sociology

A notion describing online community research, with the possible additional use of traditional research methods (such as interviews, observations)


An older notion, replaced by “digital sociology” (Lupton, 2012; Rybas & Gajjala, 2007), also suggesting online research supplemented with offline analysis

Virtual sociology

A notion defining the research of online communities only in their online context (i.e. research of avatars, including bots)


A marketing research method based on virtual simplified qualitative analysis, not connected with ethnographic research.

In reference to both digital and virtual ethnography, the goal is to tell a good story that explains social reality, involving the readers in the everyday world and understanding the research subjects, based on long-term field studies as with traditional ethnography (Whyte, 1943/2012).

Some researchers claim that it is pointless to seek similarities between digital and classical ethnography (E. A. Buchanan, 2004). This claim does not hold water though. Online communities are not less complex in their interactions than their “real” counterparts—in addition, it would be hard to define a prototypical, real community (Paccagnella, 1997). A belief that we are facing something methodologically and subjectively new, is nothing more than placing older research methods in a privileged position, which can result from researchers having more experience with immersion in different brick-and-mortar cultures. Such immersion is typical of digital ethnography, where researchers immerse themselves in the culture they study; as (p.74) with classical ethnography, it requires long-term participation in and deep engagement with the field. Similar criticisms were raised in the 1950s against organizational ethnography research in industrialized countries as not alien enough, not sufficiently exotic, or unfit for typical anthropological work (Gaggiotti, Kostera, & Krzyworzeka, 2016; Warner & Low, 1947). Ethnographies of the virtual worlds are simply ethnographies (Randall, Harper, & Rouncefield, 2007). There is no sense in arbitrarily separating the digital from the non-digital forms of human activity and labeling them “radically different” (Ruhleder, 2000). As Hine remarks (Hine, 2000, p. 65):

All forms of interaction are ethnographically valid, not just the face to face. The shaping of the ethnographic object as it is made possible by the available technologies is the ethnography. This is ethnography, in, of, and through the virtual.

Naturally, this does not mean digital ethnography does not require some modification of research tools (Nocera, 2002) but we may assume that it uses the prevailing theoretical and methodological frameworks to cater to a specific new research field. “Qualitative researchers who have thought carefully about internet ethnography accept that it should be employed and understood as part of a commitment to existing theoretical traditions” (Travers, 2009, p. 172). For this book, I will focus on the main discrepancies between traditional and online research. Therefore, I make general remarks on the issues related to ethnographic research while encouraging the readers to seek better and deeper descriptions (Atkinson, Coffey, Delamont, Lofland, & Lofland, 2001; Clifford & Marcus, 1986; Hammersley & Atkinson, 1995).

One of the best-known ethnographers, Geertz, wrote about ethnography (Geertz, 1973/2000, p. 6):

In anthropology, or anyway social anthropology, what the practitioners do is ethnography. And it is in understanding what ethnography is, or more exactly what doing ethnography is, that a start can be made toward grasping what anthropological analysis amounts to as a form of knowledge. From one point of view, that of the textbook, doing ethnography is establishing rapport, selecting informants, transcribing texts, taking genealogies, mapping fields, keeping a diary, and so on. But it is not these things, techniques and received procedures that (p.75) define the enterprise. What defines it is the kind of intellectual effort it is: an elaborate venture in, to borrow a notion from Gilbert Ryle, “thick description.”

The nature of ethnography is, therefore, not a set of tools but an “anthropological frame of mind” (Czarniawska-Joerges, 1992), thanks to which we can use our own reflexivity. It is being “a professional stranger” (Agar, 1980) to create a description that will enable an understanding of the local perspective. We offer a description of the researched culture in a way that gives the readers a feeling of co-participation in the discovery and comprehension of that culture, inviting the readers into the process of interpretation and creating an impression of independent immersion in the described reality (Clifford, 1983). The goal of ethnography is not an objective description of the reality but rather one interpretation, rooted in a reliable reflection of what the researcher considers important in the reality’s hierarchies of domination, power relations, interests, or prejudices (Lichterman, 2017).

Anthropology used to be dominated by an approach typical of the natural sciences: the goal was to be completely impartial and dispassionate, as this was supposed to lead to objective reflections. It was thought possible to see the social world without constantly assigning meanings (Clifford & Marcus, 1986; Weick, 1969/1979).

This functional perception is anachronistic, it is clear that the researcher is an “interpretive lens”: through whose experience knowledge, history, sensitivity, tastes, prejudices, and preferences, they filter and interpret the observed social reality, constantly negotiating and making sense of it (C. I. Gerstl-Pepin & Gunzenhauser, 2002). They reach under the cover of the social construction of reality (P. L. Berger & Luckman, 1967) without taking part in it. The belief in being able to get a neutral vision is an illusion. It may be a trick, but it interferes with reliable research (Golden-Biddle & Locke, 1997). The goal of ethnography is rather to present a subjective interpretation that will improve and expand our understanding of the world. We tell a credible, authentic academically rigorous story. The interpretation that is created in the process, naturally, needs to make sense to the researcher, but instead of striving for objectivity, which is impossible, it is fair to present and advance the researcher’s starting position, privileges, and perspectives (p.76) (Haraway, 1988). This does not result in complete relativism but rather in awareness of intersubjectivity (Feinberg, 2007; Madden, 2017) and relying on this awareness as an ethnographic assumption (A. Gillespie & Cornish, 2010).

Ethnographers become academic tools—after absorbing the understanding of the local culture, with the exercise of proper care and diligence in reliable reporting of the perspective of the research subjects, they create interpretations whose main value is the better understanding of the cultures under examination (C. Gerstl-Pepin & Patrizio, 2009). The same researcher may interpret their studies in different ways—Wolf offers three takes on the same observation, and presents three genres of anthropological inquiry, separate in concept and in time from one another; the result is a fourth narrative on the role of an ethnographer (M. Wolf, 1992). For these reasons, attempts at playing a completely detached person and conducting a transparent narrative which excludes the author are de facto detrimental to the final result (Charmaz & Mitchell, 1996), by impoverishing it and stripping it of the key advantages of ethnography. The researcher is a fixture of ethnographic practice; attempts at ignoring and masking their influence on the process destroy the outcomes.

In some streams of anthropology, it is said that researchers should not, even in their own narrative, assume a privileged position, and that doing ethnography benefits when the differences in power and access to voicing one’s opinions are reduced (Lassiter, 2001). Such beliefs are especially strong in the areas of anthropology that are associated with action research (Greenwood, González Santos, & Cantón, 1991; Jemielniak, 2006), where belief of science serving to describe the reality in an impartial and uninvolved way is rejected (Strumińska-Kutra, 2016). It is also characteristic of collaborative ethnography (Pietrowiak, 2014).

Because in ethnography it is vital to minimize the influence of one’s assumptions and stereotypes, it is best started with as few preconceptualizations as possible. The culture is often treated performatively—the researcher assumes that it may not be possible to fit the culture into a standard theoretical model but rather that the unique model of the culture will reveal itself in the course of the research (Latour, 1986). Therefore, the characteristic element of ethnography is not to begin (p.77) with hypotheses but with questions. During the study, the researchers accept what they see, although they wonder at everything and strive to understand even the simplest of things (Fetterman, 2009). This is the “anthropological frame of mind.”

Actual research tools in ethnographic studies are secondary in character. There is a canon but nothing can stop researchers from doing ethnography with the use of less standard methods—or even, on occasion, with elements of quantitative studies, for instance to conduct a pilot study for a later list of research questions. The hallmark of ethnographic studies is the process of longitudinal immersion, attempting to understand the local logic. The most frequently used research tools and methods are observations (participative and non-participative), field notes, interviews, narrative and discourse analysis (Denzin & Lincoln, 1994) and analysis of photos, videos, or cultural artifacts.

The vast majority of ethnographic studies are based on participative observation (Ingold, 2014). Observations may be used to do case analyses, or to understand the event, or simply for wider analysis of how the researched community actually works. Observations are accompanied by a research log, or field notes—it is an essential element of ethnographic work, since it allows for more reflexivity, returning to previous interpretations, externalization of one’s doubts (Emerson, Fretz, & Shaw, 2011; Sanjek, 1990), and a better understanding of one’s own limitations and starting assumptions, including the status of power or private prejudices (Alvesson & Sköldberg, 2017). Without field notes, a researcher claiming to do ethnography may soon appear to be little more than a tourist, telling vacation anecdotes.

Ethnographic research also often uses qualitative interviews to complement observation data, allowing an interpretation voiced by participants in the local culture. Yet another useful supplementary method is narrative analysis. This chapter will be devoted to ethnography, and I will add some remarks on case analysis, narrative analysis, and interviews in the context of online community research.

I will describe the main differences between digital and traditional ethnography. The most important is that, in the virtual world, the subject of research is avatars, not people (R. Schroeder & Axelsson, 2006). Granted, most Internet traffic has migrated to Facebook, Instagram, or (p.78) LinkedIn—on those platforms users often are found under their real names, but other online services such as Twitter, Wikipedia, or internet forums allow to users to create “personae,” avatars acting under a pseudonym and having their own style. By researching the behavior of avatars, we must keep in mind that one person may have several avatars, or several people may manage the same one. Moreover, a growing number of avatars are bots—accounts managed by algorithms, often without human supervision (Lokot & Diakopoulos, 2016). We therefore know both very little and a great deal about the people behind the avatars (Golder & Macy, 2014). We know a great deal because we may research their utterances, preferences, and interests. We know very little because we often lack the most basic demographic and geographical data. It is not rare to use different genders online and offline (Pearce & Artemesia, 2009). We may infer the geographical location of users just from the analysis of their network of friends (Compton, Jurgens, & Allen, 2014), however, the apparently simple task of differentiating utterances of actual people from those of bots is actually complicated (Clark et al., 2016), and acquiring potentially identifying data may be ethically problematic.

What practical problems does such differentiation present? In my Wikipedia research, I often encountered the problem of “sock puppetry,”10 where a person establishes several user accounts to give the illusion that an idea is more widespread than it actually is. People who want an entry to appear on Wikipedia or want an editing principle to be changed will resort to sock puppetry. As a result, we may observe a “discussion” among several avatars who are really controlled by the same person. The problem is so common that there is a special group of high-trust functionaries, “checkusers,” who have the tools to detect this kind of fraud and may access individual users’ private data, like IP addresses or browser versions. Wikipedia administrators strictly control this access and do not use it arbitrarily. As a former checkuser, I know that sock puppetry is quite widespread, even though I encountered it only when other users reported suspected sock puppets. Therefore, it is only logical to assume that, many more careful perpetrators have gone unnoticed. The practical conclusion for digital ethnographers is that we need to (p.79) differentiate the research of avatars from that of real people. At the same time, in many virtual communities people have an emotional bond with their avatars (Wolfendale, 2007). It is a good practice first to analyze purely virtual data and then contact selected user accounts with a request for a video conversation. Only then we can be sure that our research reaches actual people.

Digital ethnography is also different when it comes to “going native.” In digital social sciences, we may get deeply involved as equals in the researched community. Classical anthropological research rather suggests the role of “marginal natives” (Lobo, 1990; Walsh, 2004) and “professional strangers” (Agar, 1980) and distancing of the researcher from their own culture (Leach, 1982; Narayan, 1993). It is not an absolute rule (Sperschneider & Bagger, 2003; Van Maanen, 1988/2011) and researchers are warned against identifying with the research subjects to a degree that would result in the loss of research perspective (Robson, 2002), but for purely practical reasons, researchers find it problematic to go native, as their outsider status is immediately visible. Still, even though the dream of being “the chameleon field-worker, perfectly self-tuned to his exotic surroundings” (Geertz, 1974, p. 27) in traditional anthropological studies is illusory, in case of online research it becomes really quite viable. Digital natives are not born, each member joined it as a stranger. What is most valuable in digital ethnography is the experience of going native and slowly coming to understand a community from within (Gatson & Zweerink, 2004). Naturally, this experience usually leads to adopting the logic of the researched community, but it is a fair price for reaching otherwise hermetic knowledge. As a result, unlike traditional ethnographic studies, digital ethnographies much more often have an autoethnographic character (Denzin, 2006; Kamińska, 2014; Rheingold, 1994). Observational research of this type may lead to the temptation of not informing the studied community of our role and goal of participating in the group. It is also a typical problem of offline research (Konecki, 1990) which has a direct impact on the possibility of giving informed consent to participating in the study. I will address this point in the chapter on research ethics. In specific circumstances, research conducted in concealment may be justifiable, however, experience has taught me that the best policy was to declare I am an academic on my profile and not conceal my willingness to conduct ethnographic (p.80) research. Moreover, the division that Prensky (2001) makes between “digital natives” and “digital immigrants” has been criticized as inadequate and replaced with the notions of “visitors” and “residents” (White & Le Cornu, 2011).

With a theoretical easiness of going native comes the illusion of being able to permeate every community. This is illusory in a sense because one does not need to be born in a given place, from a given race, using a given language. In addition, the flexibility of self-presentation is related to those who conduct research and who have the freedom to manage and present their identity (P. Miller, 2012). Still, fluency in the local cultural code is a prerequisite. The situation is similar to attempts of permeating fandom. Someone who would like to present as a Bronie (Literat, 2017),11 an adult fan of the cartoon My Little Pony, or as a Trekker, a fan of the Star Trek franchise, would need to know the series really well, should they want to have any credibility within the community. It is similarly hard to gain entry into a motorcycle gang, although it provides excellent chances for fantastic qualitative research (D. R. Wolf, 1991). In fact, the understanding of any organization’s culture requires immersion into that organization.

It is no different with online communities which are similarly characterized by “deep diversity” of culture and context (English-Lueck, 2011). As Hine notes, “although the Internet feels like familiar territory for many of the people we study, it can seem quite strange and dangerous territory for a qualitative researcher” (Hine, 2013, p. 2). For Wikipedia users to consider me an true Wikipedian, I needed to perform tens of thousands of edits, discover where the important discussions were held, learn the jargon to understand seemingly simple lines like “fails to meet WP:NOTE, but no SD needed, submit to RfD rather” (an article subject is not notable enough but needs to be discussed because it does not fit the criteria for immediate deletion). Granted, because in many communities the records of all the discussions are easily accessible, we may consider enculturation a waste of time. It is definitely the opposite—it is enculturation that gives sense to the analysis of events. Only “insiderness [can be considered] as the key to delving into the hidden crevices of the organization” (p.81) (Labaree, 2002, p. 98). One can imagine a para-ethnographic research based on available data, without learning the social dynamics of a community through participation. However, in the radical overflow of information, people with no deep understanding of the community, or at least a trusted guide, will be incapable of understanding what they see and where to look. For instance, external researchers of Wikipedia and new users of the project sometimes have a perception that the community is deeply conflicted. They may have this impression because Wikipedian culture radically rejects hierarchy and lacks the fear of superiors, typical of mainstream organizations, thus encouraging the expression of objections and doubts whenever one holds a different point of view (Jemielniak, 2016b). This leads, naturally, to situations where many people within Wikipedia or other Wikimedia projects are willing to express their opinions, radical points of view, or ask questions publicly, expecting answers, regardless of what role within the Wikimedia Foundation or a social movement is played by their debaters, simply because of the a-hierarchical ideology (Jemielniak, 2015). The awareness of this fact may become a serious impediment to the interpretation of the research.

A problem that is akin to going native is the lack of barrier between the field and home. In traditional ethnographic research, this border is defined by “going into the field,” which is separated from home in space and time. In online research, the differentiation of time and place of research proves difficult. It is hard to separate taking field notes and deliberating on the material from using the Internet for personal or other professional purposes. The problem is similar to those classical ethnographies where the boundaries of work and life are blurred (McLean, 2007). It has serious consequences because it complicates a core component of the ethnographic method: researcher reflexivity (C. A. Davies, 2008). This reflexivity is largely ritualized, because within ethnographic research the way of speaking about one’s doubts, failures, or shyness is rather conventional (Jemielniak & Kostera, 2010; S. Scott, Hinton-Smith, Härmä, & Broome, 2012). There are no good solutions to this problem, although, one may imagine symbolic measures like designating separate computers, one for research and one for personal use. In any case, what matters is knowing one’s role.

It cannot go unnoticed that in digital ethnography, we observe a significantly different issue of being in the field (Rutter & Smith, 2005). (p.82) “Being on site” is a typical differentiating factor of classical ethnographic field research. Anthropology, in addition, is based on experiencing the researched cultures with all the senses (Bendix, 2005). The physical removal from home, travel, long-term relocation into a new environment and all-day round participation are undoubtedly factors that influence the researcher’s state of mind. As already stated, it is a key element of the ethnographic interpretation machine, so such an important change needs to be taken into account. If we cannot be on site during our digital ethnographic research, as with traditional or organizational ethnography, we need to replace fixed physical co-presence with long hours of virtual participation, development of competences related to transmitting and receiving text-based and visual messages (Garcia, Standlee, Bechkoff, & Cui, 2009). Similarly to modern organization ethnographies, it seems clear that conducting field research without maintaining physical co-presence and spatial common experiencing is possible—especially as the participants of the researched communities act similarly within those communities (Burrell, 2009).

Digital ethnography is also different in the character of its interactions. In virtual communities, these are very often asynchronous; not everyone participated in the same discussion at the same time. Depending on the community and the topic, avatars may exchange comments almost synchronically. This is typical of heated discussions on forums, Facebook groups, Twitter, but also on Wikipedia, if the sides of the discussion are deeply involved. In some Internet forums or Wikipedia discussions, however, it is not uncommon for questions or comments to receive an answer after weeks, months, or even years. One needs to be aware of that issue because it shapes the dynamics of discussion. Although messages may resemble exchanges in a regular conversation, they are very different from this mode of communication (Ong, 2002). This mode is the result of interlocutors’ awareness that they do not actually talk to one another; apart from the conversation, they participate in a collective process of establishing a public dialogic narrative or building a knowledge base. In this sense, we may speak of a new form of interpersonal interaction: a “monodialogue.” A monodialogue is a conversation in which the recipient is primarily not the person that we are responding to; the recipient of our reply may never even learn that that there is a response.

(p.83) Monodialogism directly influences different methods of observations (Garcia et al., 2009). We observe avatars instead of people (R. Schroeder & Axelsson, 2006; Williams, 2007). And because in many communities we have access to huge archives of older discussions, we may mistakenly assume that there is no difference in observing real-time interactions versus historical research. This is definitely not true. If we observe interactions in real time, we gain insights into the dynamics of the exchange. This awareness is not as easily obtained by a mere analysis of the recorded time of each utterance, although our ability to timestamp each utterance is indeed convenient. With historical analysis, we also lose the context of the current reactions of the community. We must remember that the more important discussions, controversies, and conflicts usually echo in other community-typical communication channels. For instance, typical Wikipedia discussions result in comments on Wikipedia groups on Facebook, mailing lists, IRC channels, and within private messages on different communicators. This makes it virtually impossible to recreate all of these comments after the fact. Moreover, wiki technology allows for the insertion of new messages without maintaining a linear flow of the text; one can insert a later message higher on the page to address a specific earlier fragment of communication. Because of this non-linearity, the recreation of the dynamics of a discussion is more time-consuming, even though timestamps and easy access to all versions of a page theoretically enable the reader to follow the chronology, unlike with some other platforms. Finally, having all interactions written down is a great benefit. This does not mean, however, that the researcher is absolved of the need to keep a research diary. Making notes and writing down reflections will launch the interpretive apparatus in the researcher’s mind. Relying purely on archival quotes strips the ethnographic research of one of its greatest advantages—of iterative returning to the same observations and events, and assigning meanings to them (Weick, 1969/1979). A research journal also allows for more honesty—if ethnography, as a final text, is a narrative, written from a perspective (M. Wolf, 1992), the diary creates a safety valve, where doubts can be aired, where thoughts that we will not necessarily share in the final text will be sketched.

Furthermore, an important difference in the process of conducting online observation stems from the fact that, in some communities, one (p.84) may perform it without having a user account; that is, in a way hidden from the participants. In other cases, one must enter the virtual world on its own terms and accept the forms of presentation of self in the community through a standardized avatar (Pearce & Artemesia, 2009). In turn, unlike with traditional ethnography, during virtual observations it is much more difficult to trace communication between the observed individuals—it is commonplace in virtual communities to use different communication channels and conducting discussions in the main thread simultaneously with “social life” discussions (Ducheneaut et al., 2010).

Digital ethnography, to a greater degree than its traditional counterpart, relies on being multi-sited (Marcus, 1995). In this context, it means concurrent research of more than one online community or a combination of online and offline research. It is a consequence of online communities: they often intermingle and overlap with other online and offline gatherings.

An important difference is that the digital ethnographer is in a less privileged position than the traditional ethnographer. The power of narration and control over communication is an issue that anthropologists have since long recognized as important (Fine, 1993). In traditional ethnography, however, we normally are alone in the field or in a team that will later publish observations that are collectively agreed upon—which, on a side note, is a strategy that is best chosen in an informed way, taking into account the pros and cons of ethnographic teamwork (Clerke & Hopwood, 2014). In digital ethnography, however, we never know whether we are crossing paths with other researchers who are analyzing the same events and utterances at the same time, or maybe even treat us as their research subject. In an extreme and purely hypothetical situation, we may envision a community of researchers studying one another, all with the mistaken impression that they are immersed in the local culture. Moreover, it is much simpler for others to verify our observations and thoughts. Unlike in traditional ethnography, where we may assume that the researcher creates the image of the community at a given moment which is inaccessible to others, and has full power over the narration, it is possible to confront the same data in digital ethnography even years later. Also, many online communities, perhaps because they work constantly with the written word, create their own meta-analyses of their culture, mythologies, (p.85) histories—and are very protective of their monopoly on such artifacts. Inclusion of such native ethnographies into the academic circulation, in some form or another, remains an open issue.

Another characteristic of digital ethnography is the high interculturality of participants and the low homogeneity of the researched group, in contrast to more traditional communities. Usually, the hub of the community is a single common element, such as interests, common projects, or skills in using the same tool. Because of this element, the processes of enculturation and standardization of social norms are less formative and have less influence in the participants.

In digital ethnography, we also observe different social stigmas. Many of the traditional ways of stigmatizing in offline communities are based on race, age, or physical disabilities; but these are easier to mask online. Markedly, gender remains an important category of avatar classification, although hard to identify with certainty. Men dominate many online communities, this discriminates against women or discourages their equal participation. Still, in many other ways the Internet is egalitarian. As the popular 1993 New Yorker cartoon declares, “On the Internet, nobody knows you’re a dog.” On the one hand, everyone with Internet access may present more casually than face to face and in agreement with their tastes, at least in theory, with no demographic or material limitations. On the other hand, online communities are susceptible to other types of social stratification and identity construction (Ward, 2017). During a meeting in a loud disco, appearances, clothes, and body language play major roles; in online communities language competence plays first violin. Vocabulary range, using the lingo of the community, frequency and adequacy of the use of emoticons, and even typing speed may lead to strong assessments of an avatar.

Finally, although this trait is not typical of just online communities, it may be more difficult to address in the online context: ways of building one’s status within the community may differ from our expectations. In traditional business organizations, there are fixed and relatively similar ways of playing out value and dedication, taking into account hierarchy, in addition to access to resource, money, and time,12 but in (p.86) virtual communities their unambiguous identification without being immersed in the field may become problematic. It is similar to traditional anthropological studies, but with online research, we are faced with the feeling that differences from our habits are negligible, which makes it more difficult to comprehend the situation. For example, in the Wikipedia community, in narratives of who is and is not valuable, the merit of writing well-developed encyclopedic articles arises more frequently that merely participating in discussions on the procedures and the whole bureaucracy of the project. However, an analysis of users who are elected the project’s functionaries shows that they are almost always involved with the administration of the project, not only in content creation. Para-organizational structures are often bureaucratic and solidify the status quo (Konieczny, 2009; Shaw & Hill, 2014). Additionally, the actual quality of the articles is often of less importance than the mere number of edits—users becoming administrators on Polish Wikipedia usually have more than 2000 edits, and on English Wikipedia the count runs as high as 10,000—editcountitis, the obsession with the number of edits one has, is a serious problem within the community (Jemielniak, 2014). Inside Wikipedia, “one’s edit count is a sort of coin of the realm” (Reagle, 2010, p. 157).

3.2.2 Case Study

The case study is a standard method of qualitative research, typical in studies of organizational change, when we may focus not only on a specific community but on the flow of an event. The case study is often used in ethnographic studies and for this reason, remarks and reservations from Chapter 2 also apply here. However, the goal of ethnography is to understand the cultural context and the local logic of a community as such, while case studies focus on the description and explanation of a specific event. This may be one reason why case study is perceived as easier than ethnography, as it does not require long-term environmental acculturation. It may be misunderstood as an “easy” way to do (p.87) qualitative and pseudo-qualitative studies—without using the full potential of thick qualitative interpretation and at the same time not having the advantages of clearly defined quantitative requirements.

The method is widespread both among researchers from post-positivist tradition, for whom it will allows the drawing of generalizations, and among academics associated with the interpretive tradition who are attempting to understand the logic of the situation in the local understanding (Hassard & Kelemen, 2010). Because the latter approach is closer to my practice, I will draw the readers’ attention to the specifics of this kind of case study, based on online data, in the scope which supplements the remarks from the section on digital ethnography and assuming that case study also requires a deep understanding of the researched culture.

The idea of case analysis is therefore to comprehensively understand the specific social situation (Stake, 2005), which leads to knowledge situated in local context (Flyvbjerg, 2006). It is an issue of tracing the starting situation, reasons, course, and results of an event or a transformation. It may be an organizational or cultural change, social trend, or an event that visualizes important aspects of the phenomenon of our research interest. Extreme cases work quite well with case studies, as they more accurately visualize processes within the researched community (Eisenhardt, 1989). Therefore, some suggest focusing on extreme situations, and pay particular attention to critical incidents and social dramas (Pettigrew, 1990). In case analysis, we may use all tools at our disposal, such as questionnaires, SNA, interviews, sentiment analysis, observations, or all kinds of secondary data analysis. The differentiating factor of the method is the goal: the explanation of a peculiar or characteristic event or transformation.

During my Wikipedia research, I engaged in many debates. One of my observations was that the social structure of Wikipedia channeled interpersonal conflicts towards cooperation, through the set of rules. Thanks to the combination of this mechanism with the escalation of involvement and low entry barrier to content creation, Wikipedia uses motivation of those who want to prove they are right to create the world’s largest encyclopedia.

In my book (Jemielniak, 2014) I used the analysis of a few cases to exemplify Wikipedia-characteristic processes which had been (p.88) instrumental in the development of the communities, and in which I was not directly involved. It was a historical analysis. Unlike with case studies, which are done offline, on Wikipedia all interactions are archived. Naturally, all my remarks related to observations from chapter 2 also apply here and for someone who was not familiar with the community even the selection of cases which could be considered especially important or symbolic would be difficult, we need to remember that community discussions are millions of pages and tens of thousands of words long. Nevertheless, the possibility of tracing, step-by-step, the flow of a discussion which I considered momentous made the study easier.

One of the cases I analyzed was the “Battle of Danzig”: an argument on English Wikipedia over whether the article describing the city ought to be titled with the Polish “Gdańsk” or the German “Danzig.” The case was very old, as the conflict had run from 2001 to 2005, a decade before I took up my research. It could seem that, especially in light of rapid changes on the Internet, such old stories from a community have no value today. However, this conflict, which even the community considers one of the lamest edit wars ever,13 shaped later community regulations and revealed distinct processes that are still observable. The edit war, which did have a substantial background but grew way out of proportion, is still a living thing among Wikipedia veterans, and similar debates still arise, such as whether the Ganges River should be rendered as “Ganga” (the English name of the river among native English speakers from India) or whether Mexico has an official language. It was possible for me to better describe and analyze these two cases via reaching for a historic event without which I would not have been able to contextualize the dynamics of discussion, references, developed rules of reaching consensus. As I was not personally involved in the discussion, I was able to distance myself from it and thus describe the increase of involvement, emotions, and even paranoia on both sides of the barricade.

Yet another possible approach to case studies is to draw purposefully from personal experience. For of one of my articles, I describes a debate in which I was personally and emotionally involved (Jemielniak, 2016a). I did a case study of controversial edits in the Wikipedia article “Glass ceiling,” where I reacted as a participant, trying to remove a section of (p.89) the article which I considered sexist but which still cited a verifiable source and as such did not fall under the rules of expedited removal. My goal was not complete objectivity. Just the opposite, personal experience and going back to thoughts and feelings that accompanied me in the debate, and which were definitely purely subjective, brought the added value of being able to look into the perception and reactions of an expert Wikipedian. I showed how quickly subject conflicts can arise and how difficult it is, especially for people not accustomed to the rules of Wikipedia, to abstain from a move that would result in their accounts being blocked, regardless of merit.

A short autoethnographic case analysis showed that the extensive bureaucratic ruleset of Wikipedia, as well as lack of skills in reacting to pseudo-academic reasoning, not to mention personal attacks, could easily deter women from editing Wikipedia. Making references to own emotions and reactions made it easier for me to show how difficult, regardless of one’s experience, it is to keep composed and react calmly in online discussions with people who are well-versed in the community regulations. In this case, personal experience and emotional involvement were therefore used as part of the method. In order, however, to make the best use of the elements of autoethnographic look, reflexivity and a large dose of caution are advised (T. E. Adams & Ellis, 2016)—as drawing from one’s own experience needs even more academic consideration, to keep the researcher from falling into the trap of describing their experience in a disorderly fashion, under the pretense of scientific method (Atkinson & Delamont, 2006). The researcher also needs to bear in mind that personal experience is also socially constructed and interpreted post-factum (J. W. Scott, 1991).

Trust is the bedrock of social relations, however, it is visible in different organizational forms in different ways (Latusek & Cook, 2012; Sztompka, 1999). For many online communities, trust in people, including close project associates (Latusek & Jemielniak, 2007, 2008), has been replaced by trust in procedures. When users know each other only in the virtual world, this is mainly the result of the users bearing in mind they hold discussions with avatars, and the identity of the debaters is fluid. Case analysis, as a method, causes the research of community procedures and rules to become valuable here, as well as the discussion surrounding the establishment of the procedures and rules. Even in communities which (p.90) seemingly do not construct complex rules, major regulatory rule is often played by the online platform (which makes specific forms of interaction and social signals possible) and the practice of their use. One example is the use of the period in texting and chat messaging in a way that signals reluctance to continue the interaction or that the message is less honest (Gunraj, Drumm-Hewitt, Dashow, Upadhyay, & Klin, 2016; Houghton, Upadhyay, & Klin, 2018). Similarly, detailed meaning can be assigned to the use of emoticons in specific contexts (Monica A Riordan, 2017). For these reasons, to make sense of the studied cases, or even to be able to define the start and end of the cases, we need to be well-versed in the rules of the community or to use the services of experienced guides. In other words, if, regardless of our age, we are “digital immigrants” or “guests,” at least within the researched community, we need the support of the “natives,” vel “inhabitants” so that we may be able to tell which case is interesting, how to read through it, and what communication nuances are important (Monica A. Riordan, Kreuz, & Blair, 2018).

3.2.3 Online Interviews

The use of interviews to research online communities is, naturally, possible and useful (Salmons, 2012, 2014). They may be conducted in one of a few different formulas, with each having its advantages and disadvantages (Kazmer & Xie, 2008): text-based chat, email or forum interviews, voice messenger interviews, videoconferences, ad face-to-face interviews with the representatives of online communities.

Having conducted several interviews with the use of a text chat client, I cannot recommend this method. It appears very attractive because it avoids the tedious transcription of the interview. However, the answers I got were superficial, shorter, and the interviewees could not be convinced to provide longer narratives. It also took longer to establish trust, something that other researchers have confirmed (Shapka, Domene, Khan, & Yang, 2016). This was most likely the consequence of a few factors. The main reason is that most people speak more freely and more easily than they write. In addition, most people write more slowly than they speak. Moreover, the specifics of a text chat interaction (be it IRC, Slack, Messenger) inspire shorter messages because especially in (p.91) synchronous interactions, writing longer chunks of text forces the other person to wait for the message to be sent across, so the communication cannot be received on the fly. Even if in some conversation messages are split into single sentences, or even fragments, it can be hard to tell when the message has ended. Finally, chat-based interviews invite the interlocutors to multi-task. The temptation is simply too big; although the interview can be very important for the researcher, for the interviewee it may become simply one of several open tabs or windows, not necessarily a high-priority one. If interviewees are doing other things while giving their answers, it is hard for us to expect that their involvement in the study is high.

From this perspective, it might be safer to conduct the interviews via email (Meho, 2006); but this method comes with its own disadvantages, the main being the need to stick to the list of questions and the inability of ad hoc follow-up, which, of course, is not so much a problem with structured interviews (Al-Saggaf & Williamson, 2004; Gruber, Szmigin, Reppel, & Voss, 2008). The situation is similar with web forum interviews and para-focus forms (Ping & Chee, 2009). It is still worth remembering text-based chat interviews in especially sensitive cases, where visual contact may be an impediment to completing the study (Aupers, Schaap, & de Wildt, 2018; M. Davis, Bolding, Hart, Sherr, & Elford, 2004; Neville, Adams, & Cook, 2016). It is similar to researching people engaged in illegal activities (Barrattt & Maddox, 2016). In such situations, the researcher needs to pay special attention to building trust and research relations, and to enlisting the involvement of the interviewee so as to negate the losses of the narrative’s saturation and richness (Hewson, 2016).

The problem of the casual character of messages, resulting from multitasking, is also characteristic of voice-based interviews with the use of voice messengers, but this approach does not require a separate description, as it is not much different from a phone interview. It is worth mentioning the benefits of using software that encrypts the communication, such as Signal, or technology that does not impose the need to install anything on our interviewee’s computer; Jitsi comes to mind as worth recommending. It is a communication platform, based on free/open source software which enables convenient voice- and videoconferences in the browser. Similar functions are also offered by Google Hangouts, AppearIn, Zoom or Bluejeans, but these are commercial projects.

(p.92) These tools can are easy to use in video interviews (Deakin & Wakefield, 2014). Among all the remote connectivity ways, this is the one that provides the best contact. It largely solves the problem of multi-tasking and enriches the interview with the possibility of reading facial expressions. Here, broadband connection is a must. Nothing will replace a live, face-to-face contact, though, as an important part of communication relies on body language and the direct reading thereof. We simply rely on the use of all the senses, and additionally, the building of trust and research relation is also based on co-experiencing the same reality—reacting to the same changes in the environment. Video interviews have an advantage which needs to be addressed here though—they allow to reach people whose location, when revealed, would put them in a risky position. From my experience in contacting interviewees who were in hiding because of their involvement with the free information movement, it may be the only way of accessing them. In such situations a video software interview could have immense advantages. It may also be the tool of choice for people working a lot and accustomed to corporate videoconferences. In video interviews, recording is also usually easier than with live ones—we have direct access to sound from two microphones, and environmental noise is usually lower than when meeting our interviewees in a public space. An obvious advantage is also the low time and financial costs involved in reaching the interviewee who may as well reside on the other side of the world (Lo Iacono, Symonds, & Brown, 2016).

The classical face-to-face interview is well suited to the research of communities that communicate mainly online. Participation in such communities usually allows determination of what conditions are the most beneficial for an interview—it is worth mentioning here that many online communities organize conferences, retreats, fandom meetings, or hackathons, which are interesting events, allowing to observe various rituals performed by the participants (Zukin & Papadantonakis, 2017). Additionally, such events allow interviews with people who may not be directly involved in the community but provide its infrastructure, organization of local structures, or perform community-oriented commercial activities. It is also helpful that during one visit, we may conduct multiple interviews. Some of the most interesting interviews in my research of Wikimedia communities were conducted during the (p.93) Wikimania conferences, annual events organized for community members in different parts of the world. The disadvantage that one needs to keep in mind is the preselection of people; the profile of online community members, who are both willing and active enough within the community to want to visit an international event, is very specific. Also, many online communities pay some attention to the anonymity of their participants (McDonald, 2015). During Wikimanias, for instance, people who do not wish to appear on any visual materials from the conference wear ID badges with different color lanyards. Many also appear under online nicknames instead of their real names.

Regardless of the method used to conduct interviews, it is definitely advisable for the process to undergo reflection, and that the reflection is included in the final version of the paper (Sutton, 2011).

3.2.4 Narrative Analysis

Classical narrative analysis or inquiry is used in traditional research, mainly in references to texts, although recorded speech is on occasion treated as narrative, for instance during narrative interviews (Hollway & Jefferson, 2000). These, especially the way of playing the narrative interviews out in a conversation, is directly linked to storytelling (Boje, 2001, 2008, 2014). Social scientists have become interested in the topic as a result of the narrative turn (R. J. Berger & Quinney, 2005). This turn is based on an observation that people make sense of their understanding of the world through their stories, with a defined structure, heroes, turns of events, and they negotiate the stories in an intersubjective way (Gabriel, 2004). Making a narrative is the typical form of social life (MacIntyre, 1981), and a personalized story is more suggestive than statistics (De Wit, Das, & Vet, 2008), which is associated with the perceived crisis of hierarchy of knowledge, described in more detail earlier in this book.

The narrative approach also draws from literary research (Bakhtin, 1984; Barthes, 1977), shifting the focus more to the text as such, and to a lesser extent on the possible intents of the author or the events surrounding the creation of the work (Czarniawska, 2004). The subject of the analysis is the narration in itself, and the source of the material can (p.94) simply be the interviews. The intended purpose of the interviews is important: we search for recurring motives, ways of constructing the story of oneself and of others, and of creating order in the world (Walzer & Oles, 2003). Persuasive strategies, the weight assigned to specific details, the order of events, the vocabulary, presentation of actors and their role, the construction of one’s own image and identity—all combine to make the important elements that undergo narrative analysis and which are more important than seeking for material truths (Czarniawska-Joerges, 1994, 1998). The issue of the research lies in the focus on plot, based on a vision of the world, founded in a defined way within the presented chronology and with some assumptions as to the relations between events.

As Given (2006) remarks, the development of digital technologies has had a transformative impact on sociology, especially on narrative studies. The hallmark of many online communities is that most interactions are written and often archived. For these reasons, in online social research the use of methods associated with narrative inquiry, including literary aesthetics or hermeneutics comes almost naturally (Das & Pavlíčková, 2014). It does not require radical changes in the method. We need to take into account the issues raised in the previous two chapters, in addition to some more details.

Online conversations resemble persistent conversations (Erickson, 1999). Granted, they can be conducted dynamically, completely or partially synchronically, with the ongoing participation of the interlocutors, but they may also be archived. For this reason they can be analyzed after years with no loss of the message, as long as the context is understood. Naturally, we need to take into account the awareness which the participants have that whatever they say will be written down and that even a spontaneous exchange will have many asynchronous readers. Because of this, many Internet discussions may, or even should, be treated as forms of many-to-many public transmission, or the monodialogue, not private conversations, although this can be no excuse for ignoring ethical considerations on anonymity protection and privacy of the research participants.

Nevertheless, since some conversations are addressed to a mass audience, they can be treated as public discourse. Open discussions on Twitter on global warming (Fownes, Yu, & Margolin, 2018) may be (p.95) treated not as a semi-private conversation but as a state of public debate. It would make sense to perform purely quantitative study of such a debate, as well as a social network analysis. Still, a qualitative inquiry fits great as well, either as a standalone method, or a complementary one. It is easy to imagine that one conducts a quantitative study, and then follows up with a narrative analysis of a selected subset of tweets or tweeters. The goal of such a study could be to focus on the way arguments are shaped, and recurring motives or typical conversation trajectories appear. In this sense, online research can benefit from Foucauldian concepts of discourse as systems of formulating and articulating ideas at a specific time, a great force shaping the world order (Foucault, 1980). Discourse serves to formulate meanings and consolidate social institutions; reaching these mechanisms is important from the viewpoint of social sciences. Inquiry in itself may be based on qualitative text analysis and on a quantitative approach (Elliott, 2005), though the use of sentiment analysis or the culturomics.

An important characteristic of online conversations is that they undergo the echo chambers effect: reinforcement of opinions when we stay in the company of other people but with similar views to ours (O’Hara & Stevens, 2015). In addition, digital propaganda makes spreads radical ideas online to convince people, to desensitize them to messages that would previously have been shocking, and to change the perception of what is normal and neutral (Lockie, 2017; Sparkes-Vian, 2018). For this reason, online narratives can be radicalized; para-anonymity makes it easy to use extreme arguments to shift the perceived medium ground. Additionally, because of the efficiency of trolling, discussions which are important from the viewpoint of information wars, are often waged by professional, hired disputants who are paid to pretend they are regular users, which is important for the political or business interests of their customers (Aro, 2016). The issues discussed do not always need to be related to politics and may reach into seemingly distant areas of vaccines, national pride, simply providing wider support for all kinds of anti-establishment and destabilizing tendencies (d’Ancona, 2017; Lewandowsky, Ecker, & Cook, 2017). In a sense, theoretically individual expression in social media, even if spontaneous, is also a specific form of political propaganda (Wojtala, 2018).

(p.96) Online narratives, like offline culture texts, also have a literary character and may be analyzed as signs of cyber-folklore. A specific type, worth mentioning here, is the copypaste (Chess & Newsrom, 2015). These are once-popular texts, copied and pasted to be distributed for entertainment (Meder, 2008), chain letters and all kinds of spam. Their role has been taken over by social media posts which makes tracing them easier.

The differentiating factor of online narratives is that online messages and posts, more often than regular conversations, are a performance. The goal may be to play a narrative as such, in a form similar to artistic expression, or to participate in a ritual of enacting the feeling of community with others (Bar-Lev, 2008). Participation in online communities is often divorced from one’s professional and social identity. It makes posting radical or absurd messages so much easier. A goal is not to convince others to adopt an idea but rather to evoke a reaction. From this, the phenomenon of completely voluntary and sadistic trolling arises, where asocial behaviors or messages aimed at upsetting the people on the other side of the screen (Buckels, Trapnell, & Paulhus, 2014; March, Grieve, Marrington, & Jonason, 2017). It is also interesting that concurrently with the development of trolling, we may observe the growth of pro-social behavior and attempts at ordering the dialogue and maintaining its level of culture, based on voluntary involvement of moderators; however, these attempts are rarely sufficient to create an aggression-free public discussion space (O’Connor & Mackeogh, 2007). On a side note, because of avatarization, we must constantly remember that the same person may troll from one account while providing support from another.

Trolling can often take the form of solidifying a normative; frequent areas of attack are feminist forums and digital places where feminist attitudes are voiced (S. Herring, Job-Sluder, Scheckler, & Barab, 2002). Internet forums are a frequent place for misogynists, online sexual harassment, gender-based derision, and other kinds of bullying (Moloney & Love, 2018). Online communities are not free from offline social divisions, biases, and stratifications (Rufas & Hine, 2018). Many online conversations lead to the radicalization and solidification of gender stereotypes (Banet-Weiser & Miltner, 2016), and are places of hateful messages aimed at silencing the opposite side (Jane, 2014). Such trolling is often met with defensive strategies (Stroud & Cox, 2018), some of (p.97) which are controversial as they border on mob law (Kosseff, 2016). It is also important that the expressions of discrimination by men and women differ, which makes it difficult to create rules for gender neutrality (Dueñas, Pontón, Belzunegui, & Pastor, 2016).

There is no doubt that the awareness that even academic works, written from feminist viewpoints, are the subject of trolling may lead to self-censorship and withdrawal from discussions (Carter Olson & LaPoe, 2018). The reaction to trolling from the social environment, including professional media, is also an interesting research area. One of the more frequent pieces of advice is “do not feed the trolls” (Figure 3.15) and simply ignore them so that they are denied the pleasure of causing an emotional reaction. This advice taps well into the typical trolling strategy, which may be exemplified by the sentence “Trolling is an art,” posted as a comment, and baiting unwitting disputants to point out the “mistake.” Ignoring trolls, however, leads also to the ignoring of symbolic violence and shifting responsibility to the (p.98) victim (Lumsden & Morgan, 2017). Naturally, this also may shift the perception of what is a common ground view, or a balanced position to the ideas expressed by the troll.

Methods of Researching Online Communities

Figure 3.15 “Do not feed the trolls”

Source: Pixabay

Trolling in itself can be of para-artistic character, meaning that it could to mask deep irony and provide entertainment (Dynel, 2016). Perhaps this is why some forms of trolling are accepted, and online conversations with the participation of trolls are often based on an unwritten social contract, according to which the response to trolling is more trolling and absurd escalation (Coles & West, 2016). It is the reason why trolling cannot be prevented, with the state of technology as it is now, based on automatic algorithmic filters and why human moderation is necessary (T. Gillespie, 2018). Narrative analysis of this kind of interactions ought to take their characteristics into account and remember that they form some kind of art or performative acting. Such art is also associated with other online culture products. I used trolling only as an example of specific narrative and activity and that other kinds of narratives may be examined, such as support groups, blogs, ways of self-creation, and self-narration in discussions, or conspiracy theories.

3.3 Research of Works of Culture

One of the major changes that we experienced as the result of the Internet revolution is the new way of spending free time, consuming and producing media (Livingstone & Das, 2013). Some even invoke the convergence of consumption and production in “prosumerism” (Bruns, 2008). Prosumerism is revealed in a large part of the population, passive consumers of films, television, music, and the like, presented by professional teams, which have become co-creators of that media. Naturally, the distribution of people actively involved in actual media production is lopsided: only a small minority is involved in this activity, and only a fraction can compete with professionally created art. Nevertheless, popular involvement in the production of culture has serious consequences for many branches, not only commercially, in the effect of falling prices of stock photos, but also socially. In many cases, competition is based on offering goods of comparable quality but (p.99) at generously discounted prices, as “amateurs” who publish their works online are more concerned with fame than money (Shirky, 2009; Surowiecki, 2004). This transforms the system of perceived values and the spirit of capitalism (Yeritsian, 2017), but also creates new potential for exploitation, inequalities, and abuse (Dusi, 2017).

Some researchers are not happy with this change. Keen laments the “cult of the amateur” (Keen, 2007) and predicts plunging quality as the result of lack of professional standards for production and quality control; in consequence, he also forecasts the doom of culture. The argument is hyperbolic, although, according to the Gresham-Copernicus law, bad money drives out good money, and competition from people who are not subject to quality control standards and procedures or professional ethics may have devastating consequences for cultural production (Helberger, Leurdijk, & de Munck, 2010). We need to remember, however, that the conviction of the growth of the role of amateur work is partially a myth—cutting out the intermediaries in the distribution of works from professionals to final clients is more noticeable (Brabham, 2012). Perhaps the concentration on the dichotomy of professionally produced versus amateur-driven culture ought to be abandoned in favor of the circulation of culture (Jenkins, 2006).

Spontaneous and bottom-up culture is closely connected to open collaboration communities and the gift economy. The idea of the prosumer revolution relies less on the actual mass production of works of culture than on the possibilities offered, and the emergence of ahierarchical online communities focusing on spontaneous creation (Benkler & Nissenbaum, 2006).

Digital communing reshapes what we perceive as ownership (Kostakis, 2018), and leads to other serious cultural changes, within the perception of value, or intellectual property and authorship (Pouwelse, Garbacki, Epema, & Sips, 2008). These notions arose in a world where the entire system of circulation of works of culture was aimed at the separate roles of active creators and passive recipients, and law maintained the business model including major intermediaries (publishers) and enforced the resulting monetary transactions.

From the perspective of sociology, the notion of self-agency underwent its own transformation (Sztompka, 1991) of culture participants—it is both increased, as forecast by the enthusiasts of prosumerism (Knott, (p.100) 2013), but limited, through the control of platforms and computer systems (Ritzer, 2015; Ritzer & Jurgenson, 2010; Van Dijck, 2009).

3.3.1 Remix Culture and Politics

Cultural production of online communities has only recently become available to social scientists for research. Professional online communities have been widely researched (Ciesielska, 2010; Coleman, 2013; Dahlander, Frederiksen, & Rullani, 2008; Lakhani & Wolf, 2003); however, movements based on amateur, spontaneous participation and creation of culture are just now becoming objects of interest of the representatives of social sciences (Boellstorff, 2008; Pragnell & Gatzidis, 2011; Steinmetz, 2012). It is surprising, as the movement of free culture and information was initially developed with involvement of social scientists, including those who used qualitative methods (Kelty, 2004).

The phenomenon of spontaneous culture co-creation has major consequences, as it is associated with changes in interpersonal hierarchy and relations. The metamorphosis of culture consumers to producers (Lessig, 2004), also through remix culture (Lessig, 2008) results in a cultural change within the legal (Benkler, 1999; Lessig, 2004), economic (Benkler, 2003, 2013) or social areas (Zittrain, 2008). Portals such as 9gag or Imgur, where people spontaneously share pictures and videos, which are often simply popular pictures of movie frames with added commentary, are more popular than professional services, created by full-time, paid crews. It is even more visible in the social networks geared for media production and sharing, such as TikTok, used by more than half a billion people, or Instagram, with over one billion users. It is also worth mentioning that in the face of the collapse of the job market for the young, the development of “do-it-yourself” careers of bloggers, vloggers, web musicians or even meme creators is useful for the development of professional competences and portfolio associated with the more traditional job market (Bennett, 2018).

Remix culture is based on the strong social acceptance of derivative works—in simple words, remakes of original works (Cheliotis, 2009). A remix is a delicate balance between the original work and skillful combination of popularly recognizable contexts and artistic traces (p.101) (B. M. Hill & Monroy-Hernández, 2013). Contrary to appearances, people participating in the remix culture or the associated fandom culture, despite their casual attitude to copyright law, have their own code of behavior (Hetcher, 2009). They allow, granted, to make extensive use of the existing works, mixing of movie and quotes or images, but with simultaneous adhesion to the idea of authorship—not so much as formal recognition of the original author in terms of remuneration, but rather in the acknowledgement of and homage to the creative input. As research proves, even children, when using software that enables the use of others’ code, pay attention to whether others notice their work—while not paying attention to whether they will be automatically mentioned as original authors by the algorithm (Monroy-Hernandez, Hill, Gonzalez-Rivero, & boyd, 2011).

Naturally, this approach results in culture clashes with the norms of copyright law and with the expectations of the groups that rely on their creativity for a livelihood. Although online creators often invoke the right to quote, some corporations which are copyright owners do not acknowledge this interpretation (Freund, 2014). Usually the law is on the side of the latter group, although the social feeling of justice is increasingly more divergent from it (Chused, 2014; Hergueux & Jemielniak, 2019). Courtroom confrontations are very rare. Derivative works, thanks to their popularity, also increase the popularity of the originals. For example, remix culture contributed to a revival of interest in Lego products (Einwächter & Simon, 2017). The prosumer movement, even though its subcultures have praised rebellion and opposition to corporations on occasion, is also a source of free labor for these corporations. This is so not only in the area of promotion but also by providing content to distribution platforms (Sugihartati, 2017).

In open collaboration communities, this type of creation of content takes place in a networked participatory environment. Produsers (Bruns, 2008), people who use and create at the same time, usually while remaining anonymous, strengthen and grow the common output—Internet content—through continuous improvements to its material. The effects of their work are not products in the classical sense, as produsers are not purely producers. The works continue the activity of others, often very imitative and derivative. The wide availability of editing tools increases the number of people involved in produsage and (p.102) presumption, i.e. the combination of creative and utilitarian/consumption activity, as well as the Read/Write culture (Lessig, 2004). Even Internet users who are not co-creating anything at the moment, may—at any time, with no preparation or the need to acquire competences—become co-producers. This leading role is less often played by those who create and remix the works, in favor of those who disseminate this culture, acting as transmitters (Frank, 2011)—because they also take pains to pore through the works, categorizing, describing, and commenting on them. The creation is very derivative, partly because cyberculture is based on the communal, not the individualistic, aspect of culture. Maybe this is the secret to the popularity of Creative Commons licenses which enable the users to reuse the works for non-commercial purposes, or even with no restrictions at all, as long as they credit the original author (Carroll, 2006).

Potentiality resulting from unlimited access to media—in this case to Internet social media—gives the produsers both nearly absolute freedom of expression and the power of shaping the contents accessible online and returning to the previous works, following the quote according to which “the Internet never forgets.” One function of this radically democratized and pluralist medium is the possibility of expressing one’s opinion and critique, including political protests, and as a result, social involvement (Castells, 2013a, 2013b; Milan, 2013), whose influence on the issues of interest to sociologists, such as political system, national culture, customs, or even demographics, cannot be overestimated.

The Internet grants entry to a new dimension of political involvement to but at the same time in ways which used to be reserved for political cartoon satire—press caricature or grassroots street art in public spaces. Partial or complete anonymity of such works of involved art is becoming an option for many Internet users (Mouffe, 2008). They live in a virtual space which, in its fluidity and temporariness, resembles Marc Derbyshire’s (2008) idea of non-places (Augé, 1995). Permanent change of the cyberspace and the transience of its functioning allow unlimited social and culture-creation activity (Dahlberg, 2007). The Internet is therefore also an ideal discursive platform for grassroots social-political activity (Jordan & Taylor, 2004), which uses art to propagate ideas.

The agency of anonymous works of digital art results from their placement between the reality that they serve to comment on and potentiality (Agamben, 1999). It is based on how these works of digital art can (p.103) influence the reality through their virtual existence (Leadbeater, 2008; Van Dijck, 2009). An important example of such subjectivity is the potential influence of the Internet and modern technologies on powerful grassroots social movements. For instance, an important area of research is the role of Facebook and Twitter during post-election riots in Iran in 2009, during the revolution in Egypt, and later during the 2011 Arab Spring (Bruns, Highfield, & Burgess, 2013; Christensen, 2011; Khondker, 2011; Lotan, Graeff, Ananny, Gaffney, & Pearce, 2011), or the Spanish Los Indignados movement (Castells, 2013b), or Ukrainian EuroMaidan (Bohdanova, 2014; Onuch, 2015) in 2013 and 2014, as well as the #MeToo phenomenon. Similarly, more technologically advanced groups get involved in the hacktivism, which is social activity through hacking (Coleman, 2014), in the form of website cracking, or simply Denial of Service (DoS) attacks—causing websites to stop working because of excessive traffic.

Even though “Twitter revolutions” and the role of online publications in shaping social change are criticized as a fancy of the media (Mejias, 2010; Morozov, 2009), the influence of technology in the increased agency of individuals is far from obvious (Christensen, 2011; Segerberg & Bennett, 2011), and activism becomes “slacktivism” (Kristofferson, White, & Peloza, 2014; Skoric, 2012)—involvement that requires just a few clicks and gives the feeling of having completed a duty and provided a distraction from actual activities—spontaneous, online community-created satire both social and political, as well as purely entertainment-oriented, are cultural phenomena that require the attention of the social sciences and that bear importance on the emerging social reality.

3.3.2 Research of Humor

Ethnographic researchers claim that the true understanding of culture is confirmed if the researchers start to understand the jokes of their interviewees, meaning that they possess similar cultural capital. It is similar to the native knowledge of a language—understanding irony is one of the most difficult competences of a language (Banasik & Podsiadło, 2016). In the words of Dougherty, a cartoon “requires that the viewer be familiar with current issues and debates, savvy about the cultural context, (p.104) and capable of analytical judgments” (Dougherty, 2002, p. 258), and similarly, a joke requires complex understanding the cultural context. For researchers of culture, jokes are a source of knowledge about social sentiments, including political views (Virno, 2008), making them the point of interest of historical studies (Granger, 1960; Wood, 1994), sociology, anthropology, or political science (Klumbytė, 2012). We may even state that in many communities, research into jokes and their comic imaginarium—to coin a joke, focusing on “anecdotal evidence”—may be of higher cognitive value for cultural analysis than focusing on the research of pure facts (Jemielniak, Przegalińska, & Stasik, 2018). This is one reason why researching online humor, both in the sense of studying jokes of selected online communities and going deep into the research of the rules of communities focused on cultural production is worthy of deeper sociological analysis, even if it is underestimated.

Political memes, like graffiti, may be seen in categories of political involvement (Mouffe, 2008) or simply social critique of the activity of the state. Laughter and jokes are some of the most popular techniques of civic resistance—in their democratized form they are a way of negotiating social reality which is accessible to anyone (Friedman, 2012). An important catalyst for the textual and visual political satire is mass media—printed newspapers for political comic strips and caricatures (DeSousa, 1982; Gamson, 1992), and more recently, the online space for older and newer forms. Extreme cases of the increased reach of such works of culture are caricatures of the Prophet Muhammad (Sturges, 2015; Weaver, 2010a), which caused actual physical violence. These channels of communication allow jokes to question the symbolic order: they celebrate its critical function and control, watchdog spheres, allowing for wide circulation of contents.14

(p.105) “Humor appears when people resolve two conflicting images in ways that make sense within distorted systems of logic. The processes by which organization members set up such puzzles for others to solve—and the processes by which these are actually solved—say much about the ways organization members work and play together” (Kahn, 1989, p. 46). Analyses of ludic behaviors in organizations and communities (Hunter, Jemielniak, & Postuła, 2010) have been increasing in popularity in the social sciences.

Similarly, organizational humor is often presented as a weapon of symbolic violence between employees and their superiors (Fleming & Spicer, 2007; Jemielniak, 2007). Totalitarian organizations, including governments, also see humor as a threat (Oring, 2004). There are at least two reasons: irony deconstructs and disarms official organizational propaganda but also allows individuals to see their roles from a distance (Kunda, 1992). The larger the discrepancy of power between individuals and organizations, including the structures of the state, the more humor becomes a defensive weapon of the weakest: examples reach far beyond the obvious, in anti-totalitarian opposition (Benton, 1988) and encompass customer-producer relations, visible in popularity of jokes about Microsoft (Shifman & Blondheim, 2010), the movement of African American emancipation (Weaver, 2010b), and female emancipation (N. A. Walker, 1988). A daily dose of humor allows us to create and make sense of professional roles and builds opposition to managerial control (Lynch, 2009). In a way, organizational rhetoric, used to strengthen the expected behavior and reinforce the hierarchy, is undermined through deconstructive ambivalence of spontaneous employee resistance (Höpfl, 1995) expressed through humor—both within commercial organizations and social movements.

These processes have a carnivalizing character, according to Bakhtin (Bakhtin, 1984) who cited the example of medieval carnivals to show the crucial role of unofficial and spontaneous ludic behavior in maintaining the social contract. Temporary suspension of dominant norms and hierarchies allows people from lower social echelons a moment of (p.106) freedom, while making them aware of the fixed order of things. Jokes and spontaneous humor in organizations and communities, like the carnival, are the realm of temporary freedom from the prevailing discourse and the fixed system of domination. In humorous tales—jokes, drawings—we find messages that escape the control of formal hierarchy, thanks to which they can be of use for socio-cultural analysis.

The goal of ethnography, according to Agar, is to reach the “notion-points,” carriers of cultural topoi and archetypes, and making a specific translation of them, which allows the interpretation of the culture in its context (Agar, 2006). Even though Agar did not reference online research, cyberculture is especially rich in such points. Ironic messages are one of the more interesting areas for researching them.

Analysis of humor, including political satire, is especially useful when new, not yet solidified, cultural changes are studied. For this reason, it is useful to analyze online community phenomena and their cultural works. Online humor is a specific form of creativity in that it makes perfect use of the creative character of participating in culture (prosumerism) with an easy form of participation: all that is needed is paraphrase, deconstruction, or combining an image and a comment to arrive at a comic effect. This is how memes are born.

3.3.3 Online Memes

Apart from blogs, thematic forums, and social media, where discussions can be held and social movements started, the most valuable tool of social critique can be found in memes (Shifman, 2014b). Although it seems impossible to trace the genesis of individual memes, it is easy to pinpoint the creator. In “The Selfish Gene (Dawkins, 1976), Dawkins presented the term “meme” to define extra-genetic behaviors and cultural phenomena that spread from one person to another—starting with language norms and ending with sports traditions. With the development of the Internet, the term “meme” started to be used in reference to the processed (remixed) cultural contents that are made available online (Brake, 2014; Knobel & Lankshear, 2007). Internet memes, in their essence, emerge from the world of the anonymous pan-individual network which does not belong to anyone (J. M. Adams, (p.107) 2014); at the same time it forms the quintessence of democratized and pluralized digital culture, created by the widely understood prosumer crowd. The latest and most elegant academic definition of the phenomenon may be ascribed to Davison in “The Language of Internet Memes,” where he writes: “an Internet meme is a product of culture, usually a joke, which increases its influence through online propagation” (Davison, 2012, p. 122).

Socio-cultural researchers have been focusing on individual cases to trace the shaping of Internet memes. They concentrated on meme creation and migration (Shifman, Levy, & Thelwall, 2014), memes’ role in expressing prejudices (Woźniak, 2016), specific relations (Wiggins & Bowers, 2015), cultural logic (Shifman, 2014a), or the importance of memes for specific subcultures and individual identities (Nissenbaum & Shifman, 2017).

A meme, as an element of mass culture, has become a means of commenting on the prevalent socio-cultural reality. In this sense, Internet memes are the direct descendants of the culture of socio-political satire at its peak. The ridiculing online humoristic memes comment on events or messages using text elements with visual and audio-visual ones (Da Silva & Garcia, 2012). Memetic nonsense is based not only on notional deconstruction of intellectual art but also on playing with the social norm (Katz & Shifman, 2017).

Many comments are inappropriate or use very dark humor (Burroughs, 2013) and resemble trolling (Greene, 2019). “Memetic activism,” also known as “snarktivism,” is a defense tactic against the contested actions of politicians, international corporations and non-governmental campaigns that simplify social problems. The best example is the use of 4chan platform by the anti-capitalist Occupy Wall Street movement in 2011 (Coleman, 2011; Milner, 2013b). The Occupy movement was recognized as a meme by the Know Your Meme portal (Bratich, 2014). 4chan, in contrast, is said to have popularized memes in contemporary culture. At the same time, it is a radically anti-systemic community, building identity based on a contemptuous attitude to “normies,” who are people following social norms (Nagle, 2017). One of its most infamous campaigns was convincing gullible users that upon drilling a hole in their iPhones they would be able to use mini-jack head-/earphones with their devices, or that heating a mobile phone in a microwave would (p.108) charge its batteries. 4chan also popularized the “pedobear” (a pedophile-associated mascot) meme, spread rumors about Steve Jobs having a heart attack which caused a momentary plunge in Apple stock prices, or provided the possibility of coordinating large-scale social resistance actions, such as Distributed Denial of Service (DDoS) attacks workplaces by sending massive amounts of queries to a server so that the server’s website is inaccessible. The Anonymous movement was also established on 4chan (Coleman, 2014). 4chan is also the cradle of Internet memes.

The satirical character of online memes comes from locality—in their form, they are definitely represent contemporary Americanized global culture (Shifman & Boxman Shabtai, 2014), however, their content is often of high social-political importance only on a local scale (K. V. Anderson & Sheeler, 2014). The meme’s message is understandable only in a specific socio-cultural context, even though it is composed of signs that are understandable for supranational communities of the Internet (Shifman & Boxman Shabtai, 2014). Creative use of Internet memes as social involvement and the critique of local political stage is apparent in the audiovisual “Harlem Shake” meme. A joke meant as a dance happening (a group of people listens to a piece of music without moving a muscle just to start a frantic dance at one point; the happening is recorded and edited to expose the contrast between the two states), gained a political dimension when young people in some Middle Eastern countries performed dance moves inspired by African American culture while wearing in traditional Muslim clothes. In Egypt, the ruling Muslim Brotherhood arrested the people responsible for the local version of the international fad (Werbner & Modood, 2015). Something similar happened in Russia. Such clashes and transfers of cultural contents are a hallmark of political potential carried by the culture of virtual communities (Tsing, 2011). This use of memes simply begs for social network analysis supplemented by interviews with the participants in and distributors of the memes, and finalized with a socio-political analysis of the context and the reasons for the power of the memes.

As we can see, the role of political satire, including humorous provocation, which is reflected in slacktivism, snarktivism and trolling cannot be overestimated (Milan, 2013). Carefully tracing memes can (p.109) both help us understand contemporary civic society and understand the way social media spreads information—including politically loaded pop-cultural contents—which prosumer online communities consider socially important. Internet audience of the contemporary socio-political stage participate in the remixing, processing, and popularization of contents. At the same time, it creates new, efficient channels of distribution whose research is also the domain of contemporary sociology.

Internet memes can be divided into “image (or visual) macros” which are remixes of familiar pictures with a comment (see Figure 3.16), and “reaction Photoshop,” the use of a familiar picture or a symbol in a new context (Shifman, 2014b). An image macro can be “This is bait”:

Methods of Researching Online Communities

Figure 3.16 “This is bait”. Example of an image macro

Source: https://knowyourmeme.com/memes/bait-this-is-bait

It became popular on 4chan as a comment signaling that the message leans towards trolling and was, naturally, the starting point for numerous remixes. The picture is one of the most popular memes of all time, although the issue of propagation and popularity of memes is a complex one and therefore worthy of different research approaches (Zannettou et al., 2018). The problem of quantitative research is the lack of clear distinction when a derivative work becomes independent and ought not to be treated as a derivative anymore.

(p.110) The examples of “react Photoshop” are a photo of a police officer pepper-spraying seated demonstrators from the Occupy movement, edited into medieval paintings15 or variations of the “Chubby Bubbles Girl,” (Figure 3.17) a girl running away from whatever the creators put in the background:

Methods of Researching Online Communities

Figure 3.17 “Chubby Bubbles Girl”

Source: https://knowyourmeme.com/memes/chubby-bubbles-girl/

With memes, we can express an infinite number of ideas in a specific semiotic form which is also characterized by unlimited flexibility (Milner, 2013a). Memes use the structural properties of the given work of culture as a set of templates for free use and reuse in a new context (Massanari, 2015). In 2004, Glen Whitman, a blogger for Agoraphilia, coined the term “snowclone.” It is related to sentences such as “grey is the new black,” where the words grey and black can be replaced with any other nouns (”X is the new Y”). Satirists may therefore use the original photo or a ready-made picture from the resources of Internet portals such as Meme Generator or Rage Comic Builder, and afterwards adorn them with a humorous text in an original or altered form, to create a joke which may become popular. Such “image macros” are easy to produce and the most image repositories even provide trending backgrounds which are recommended when creating a meme. It would be an interesting research question to analyze which pictures are most often recommended and used by the meme generator websites.

Memes are an efficient transmitter of social moods, as they combine surprising forms and concepts (the variations on the British poster from World War II: “Keep Calm and Carry On”, Figure 3.18) (Virno, 2008). (p.111)

Methods of Researching Online Communities

Figure 3.18 “Keep Calm and Carry On”

Source: http://knowyourmeme.com/memes/keep-calm-and-carry-on

One of the more interesting examples are the “advice memes”—a variation of image macros, bearing the picture of a person giving “bad advice,” pasted on a colorful spinning background with a repetitive pattern of a duck or bear. At first, the advice came from funny animals, however, the form itself was also used for political critique—in the USA, during the discussion on national debt, the giver of bad advice was Barack Obama or the economist Paul Krugman (Vickery, 2014). Rintel summarizes this in a blog post: “Whatever we call it, internet comment culture is a reinvigoration of an active public voice. It’s a combination of popular culture and folk culture, appropriating and mashing together objects and ideas from media industries and objects and ideas created from whole cloth” (Rintel, 2011).

“Advice memes” can evolve. For instance, a study of the “confession bear” meme shows that the initial use of the image only for humorous (p.112) purposes on Reddit evolved after some time and caused the publication of a series of memes with serious content, also mentioning rape, molestation, and addiction, in a way that contested the dominant discourse of culture. This generated long community discussions, related to both the honesty of confessions and the suitability of memes as the carrier of such confessions, as well as the possible regulations within participatory culture of the portal (Vickery, 2014). Ways of using memes can also be of research interest, to show how communities with seemingly no norms aspire to self-regulation—often returning to those standards of behavior patterns that they themselves contest at the rhetoric level (Gal, Shifman, & Kampf, 2015). Analysis of memes may therefore be based not only on the analysis of images which we recognize from visual sociology but also on the research of readership, contexts of creation and distribution, and expression of meme-related social norms, as well as deeper auto-analyses of memes that are sometimes created by the communities.

Internet memes represent a phenomenal growth of digital culture of social commentary, becoming a new tool of political agency in public opinion (Davison, 2012). It is worthwhile to include meme analysis into research projects, drawing on the achievements and tools of cyber cultural studies (C. W. Anderson & Revers, 2018; Nissenbaum & Shifman, 2017).

According to Google Trends, in the English version of the search tool, in USA memes reached the same level of interest as Jesus in 2012. They are now are four times as popular (Figure 3.19):

Methods of Researching Online Communities

Figure 3.19 Google Trends results for “jesus” and “meme”

Source: https://goo.gl/tTLn5R


(4) Warden’s case is interesting in that Facebook’s robots.txt file did not prohibit site indexing, and this is the traditional method websites use to signal whether their information can be processed. It is difficult to envision Google, for instance, requesting written permission to index each website’s contents. Nonetheless, this is an important lesson to anyone using crawlers in social research.

(6) More precisely, committal—the transferring of one’s piece of code into the common repository.

(8) Patrick Moore was a president of Greenpeace Canada though. Since leaving the organization he has rejected the consensus of the scientific community on climate change, and insisting that there is no proof for human-caused increase in carbon dioxide. See more: https://en.wikipedia.org/wiki/Patrick_Moore_(environmentalist)

(12) As an example, in my research on software engineers, I noticed that managers view the amount of time that their employees spend at work, not the quality of the work done, as indicative of the value of the employee; in other words, time was symbolic in showing loyalty and the devotion to the organization (Jemielniak, 2009).

(14) The attitude of different communities to picture culture is interesting in itself. As part of my ethnographic study of the Wikipedia community I participated in a discussion about image filtering. Simply put, the Wikimedia movement community wanted to decide whether logged-in users should have an additional setting at their disposal. Upon loading an article that contains photos or pictures that can be considered controversial, the person would see a warning instead of the actual picture. The setting would not even have to be a default one, with an opt-in, so only people who wanted such an option enabled would need to find it and set it. The Wikimedia community, in the movement’s largest vote, collecting 24,000 participants in 2011, supported this solution, and the Wikimedia Foundation’s Board of Trustees published a resolution encouraging the development of technical means to enable Wikimedia users set what contents they would like to be concealed. Despite strong support, a group of active Wikimedians considered similar solutions as potentially leading to censorship. A few large projects conducted their own polls, leading to the conclusion they did not want image filtering to be enabled, with similarly massive support of the idea (79% on Spanish Wikipedia, 81% on French Wikipedia, 85% on German Wikipedia). As a result, the idea was abandoned as the risk of forking [what’s this?] was too high.