The power and promise of social media as a resource and tool for doing social research is widely recognised and much vaunted. Social media data are becoming an increasingly attractive resource for social scientists, but the question remains as to what exactly we might want to do with data like these. This study describes a small-scale interdisciplinary project in medical sociology which instigated the development of an innovative method for making practical use of ‘big data’ drawn from Twitter. What results is a depiction of how a collaboration between software developers, requirements engineers and social scientists demonstrated a need for a new method of data capture, a description of the method by which that need was addressed, and a discussion of the value of the insights that can be drawn through using that method.
By the end of this case, you should be able to
- Describe the two approaches to social media data collection: (1) query keyword searches and (2) user-following
- Understand the differences between the two approaches to social media data collection, and the different reasons why you might choose one or the other to address different research questions
- Develop a mock research question to be addressed with a user-following strategy and write down details of a research design for this mock project in terms of how you would identify a relevant and appropriate user-group to follow
The present case is about an interdisciplinary collaboration between a group of researchers from computing science and social science, who were seeking to explore the possibilities of social media analytics for their topic of patient experiences of cystic fibrosis (CF). CF is a genetic medical condition which primarily affects sufferers' lungs, most often causing frequent build-ups of mucus in the lungs and airways which lead to infections and make it difficult for sufferers to breathe. CF is very difficult to manage – currently, there is no cure for the condition, although there are numerous strategies for treating CF symptoms, including medication, physiotherapy and lung transplant surgery.
The medical sociologists within our group had an interest in exploring the experiences and lifestyle issues of CF sufferers and those surrounding them (family and friends, etc.) – how they coped with the condition on a day-to-day basis. One avenue of interest was in whether CF sufferers and their friends and families actually talked about this sort of thing on social media, and if so, what did they say? We chose to focus our investigations on the way in which they used Twitter. Twitter is an environment in which users tweet about a huge range of topics, and it was our assumption that one of those topics may well be the experiences of living with (or living with someone who has) CF. Having been intrigued by the possibilities that social media might hold for health research, we were keen to see if we could identify and capture some data which we might explore ourselves.
Our work was conducted with a software package of our own design called Chorus. Chorus is a social media data capture and visual analytic suite comprising two programs: Chorus-TCD (TweetCatcher Desktop) allows users to collect Twitter data, and Chorus-TV (TweetVis) allows users to build an array of visual models to facilitate the analysis of that data. An introductory tutorial to the Chorus package, which outlines the key features and functions of both Chorus-TCD and Chorus-TV, can be found at http://youtu.be/KmCrmiBOOvw.1
Using Chorus-TCD, we began by conducting a query keyword search for two terms: ‘cystic AND fibrosis’. Although at first glance the data appeared highly relevant and well-suited to our needs, the more we delved into it collaboratively, the more we realised that it was limited in terms of its use as a source of genuine CF experiences. Our dataset consisted primarily of news headlines and CF charity event advertisements such as the following:
- Everybody follow @[CFCampaignAccount]2 now and help raise awareness for Cystic Fibrosis plz.
- RT: If you support CYSTIC FIBROSIS awareness buy a CFaces Calendar in aid of @[CFCharity] this year PLEASE Find them at: [URL].
- Child excluded from school because he has gene for cystic fibrosis, via @[OnlineHealthNewsService].
While these kinds of insights may prove relevant to certain research questions – perhaps for a study of information propagation and citizen news broadcasting via Twitter (e.g. see Procter, Vis, & Voss, 2013) – it became clear to us that we were not capturing the CF experiences we had anticipated. Our working hypothesis as to why we weren't able to answer our questions was that for cystic fibrosis sufferers and their loved ones, any possible talk about CF would probably not consist of explicit usages of the words ‘cystic AND fibrosis’.3 Hence, we began to understand that our dataset was highly likely to have completely omitted a huge amount of CF-related talk on the grounds that it did not explicitly mention the terms we were sifting for, for example, ‘cystic AND fibrosis’. Rather, we mused, CF-related talk may more typically consist of a variety of other more mundane and general terms such as ‘coughing’, ‘pain’, ‘treatment’, and so on. These terms lend themselves more to everyday conversational tweeting but are by the same token difficult to locate with a query keyword search, not least because by virtue of the exploratory nature of the project we did not know a set of terms through which CF-talk was organised and structured by those tweeting about it.
To outline the potential difficulties involved, the only possible way to comprehensively capture this CF-related talk would necessitate us to sift through unwieldy reams of Twitter data collected around the term ‘coughing’ (for instance) to find any possible mentions of the term that may relate to CF – this is a good practical example of the proverbial searching for needles in haystacks! Having started our search in this way, we would have then found ourselves trying to address several impossible questions:
- on a practical level, how would we know what search terms might capture CF-related talk in the first place?
- how could we boil down potentially millions of tweets into relevant CF-focussed data?
- how would we know which coughs could be attributed specifically to cystic fibrosis?
Then, we would have to repeat this same laborious process for another general and non-CF-specific term (say, ‘pain’, or ‘treatment’), all the while not knowing if our efforts would prove fruitful or if we were missing any key terms that we simply hadn't thought to look into. It was abundantly clear that for our purposes, a query keyword search, however well constructed, was simply not going to provide us with answers to the questions we wished to pose.
Heading back to the drawing board, we began to think about different ways of constructing datasets which might better ensure the presence of relevant data. What we required was some way of identifying a user-group who might possibly tweet about things we were interested in, then capturing their tweet output to identify what it is they actually tweeted about and build an analysis from there. To this end, we developed a distinct method of data capture – we call this user-following or user-driven data collection – whereby we could comprehensively capture the Twitter timelines of a list of tweeters. What user-following data collection does is remove the reliance on keyword searches by instead allowing researchers to follow the changing conversation of a set of Twitter users regardless of what they are tweeting about (and what language, grammar and syntax they may be using). Our initial explorations with this new type of data showed it to be markedly different from that drawn via a query keyword search. Whereas query keywords gave a tightly focussed structure to the data (i.e. all of the tweets captured in this way are around a single topic of your choosing), user-following data reflect the vast array of different topics that each Twitter user tweets about, and we quickly saw that this kind of data may require a very different analytic approach. The remainder of this case will outline exactly what this approach consisted of for our work with CF-related talk.
The first step was to identify a list of users we wanted to follow who might possibly tweet about their CF experiences. We decided to compose our list of followers of one of the more popular and active CF news and update accounts4 and track the timelines of each of those followers. Having got a list of followers, we computed a mean tweets-per-day value for each user and ordered the list from lowest to highest. This ordering was done so as to assist in the filtering out of both users who have never tweeted (tweets-per-day = 0) as well as users who tweet too much and might possibly be spammers or bots (all users who tweeted more than one standard deviation above the mean were skimmed off the top of the dataset). Even after this cleanup process, the number of followers of this account was around 6,500 and the total tweet yield was in excess of 3,000,000 over a roughly 6-month period (14 February 2013 to 23 August 2013). Hence, as a means of breaking the data down into manageable chunks, we have selected here (for purely demonstrative purposes) to analyse tweets from the lower end of the tweets-per-day spectrum, ranging from 0.01 to 0.61 tweets per day on average. This resulted in a dataset of 282,129 tweets, which we broke down further into two ‘half’ datasets of 141,063 and 141,066 tweets each (so as to alleviate the time spent processing the visualisations available within the Chorus package).
Running these datasets through Chorus-TV, we were able to visually identify topics of interest to our user-group of candidate CF-talkers without knowing how (or even if!) they tweeted about CF on Twitter at all. Our explorations focussed on a term-level cluster map, which plotted the semantic similarity of words on a two-dimensional space such that words that are used similarly appear closer together. In this way, collections of terms cluster together as topics, and these can be followed and traced by the lines that are drawn between them (such that you can see how topic clusters link and relate to each other). The overall term-level cluster maps for each half dataset can be seen in Figure 1.
Time was spent exploring the structure of these maps, looking for clusters and topical strands, which might possibly be classified as CF-talk, and as we expected, the majority of topics captured bore no relation to CF at all. We noted clusters related to such topics as the Easter holidays, Mother's Day, Harry Potter, the birth of Prince William and Kate Middleton's baby and so on. We were perversely encouraged by this disparate and vast array of different conversations, in that this kind of talk we considered to be representational of everyday tweeters' interests rather than the irrelevant tweets of spammers or bots. As such, our user-driven strategy appeared to be working.
Navigating around the cluster maps, we eventually alighted on a set of distinct topics with a strong, demonstrable relation to CF, even though they did not necessarily appear in the same cluster as the terms ‘CF’ or ‘cystic’ or ‘fibrosis’. The first of these topics was (in the first half dataset) around the hub term ‘organ’, which resulted in a topical strand pertaining to the terms ‘double’, ‘lung’ and ‘transplant’ (see Figure 2).
These details convey a picture of transplant-talk as having a significant relation to lungs, and specifically double-lung transplant procedures, for the user-group captured, with around one-quarter of tweets explicitly mentioning the term ‘transplant’ also mentioning the term ‘lung’ and between approximately 9% and 12% mentioning the term ‘double’. Drilling down further into this cluster, we discovered that tweeters routinely involve themselves in personal communications with those who are undergoing or who have recently undergone a double-lung transplant surgery, expressing concern and sympathy and well-wishing. For example,
- @ConcernedTweeter thanks:-) Yeah am needing a transplant pretty badly now, still battling everyday though! #organdonation #CysticFibrosis.
- RT: @CFCharityAccount will every cfer please keep @CFSufferer in ur thoughts, she's in theatre right now having her double lung transplant!
- @CFSufferer Hope you are doing unreal since your transplant! Was so pleased to hear the news #wooooohooooo.
- My prayers are with you @CFSufferer, hope you're doing well after your 2nd double lung transplant!
Typically, these tweets appear to be designed to be of interest only to the transplant-receivers to which they are directed (i.e. with an @mention, which is how Twitter allows users to communicate directly and publicly with each other), or possibly also other CF sufferers as in the second tweet example above. However, what we also noted was that as well as expressing personal communications, tweeters also talked about double-lung transplant surgeries in a different way and with a different set of tweeting practices. This user-group also utilised episodes of specific double-lung transplant surgeries (and the CF sufferers undergoing them) to topicalise important issues around transplants and CF as the conditions they were meant to treat (i.e. post-operation aftercare, campaigning to increase the list of people signed up to the organ donor register, and so on). For example,
- @CFCharity it's my bro's 30th bday today. He has CF & had a double lung transplant 1 yr ago which saved him – need more awareness!!!
- RT: @CFSufferer Its transplant week next week. I'm still here because of an Organ Donor. Please sign the organ donor register! #RT.
- RT: @CFCharityAccount: RT@CFSufferer: #fromtheheart I have cf and I need a double lung transplant! sign up as an organ donor it will save lives!
- Heartbreakingly, beautiful, courageous @CFSufferer died on Thursday at 20 yrs old: a transplant could have saved her. #OrganDonation #Neverforget.
In this way, although it is clear that users have a huge sympathy for fellow CF sufferers whose last resort had been to undergo a painful and risky surgical procedure (and expressed those sympathies within the tweeting CF community), these same tweeters actively orient to the seriousness of the surgery as a way of acknowledging and raising awareness of the condition outside of those who already know about it. This is demonstrated by the reliance on several Twitter-specific practices, including ReTweets (which can both propagate information further as well as display an agreement with it), communications (@mentions) directed explicitly towards charity accounts (which are intended as public displays of agreement), and the use of emotive and/or informative hashtags designed to make the information searchable and visible through Twitter's search structure. Hence, we saw people with a vested interest in CF – sufferers themselves as well as family members and friends – using surgery episodes as a way of encouraging positive assistive action in the wider public, such as signing the organ donor register and potentially saving the life of a CF sufferer.
Turning now to another distinct topical/conversational cluster, we noted a cluster of talk around the terms ‘cough’ and ‘coughing’. These terms are lent an added significance due to the fact that our dataset is organised around users who have expressed some interest in CF as a condition in which coughing is highly pertinent as a lay-diagnostic measure of the illness and in which coughing is generally more serious a symptom than it may be for non-CF-sufferers. Following the same topic-based analytic strategy as with our investigations into double-lung-transplant-talk, we performed a rough visual analysis of the clusters occurring around the terms ‘cough’ and ‘coughing’ to identify a list of related terms, as well as investigated strongly co-occurring terms (see Figure 3 and Table 2).
Our analysis showed talk around the terms ‘cough’ and ‘coughing’ to consist of many diverse sub-topics and expressed with markedly different tweeting practices to double-lung-transplant-talk. We found that the terms ‘cough’ and ‘coughing’ were used in a wide variety of ways in our datasets, as demonstrated in the co-occurrences table (Table 2), where even the strongest related terms – ‘throat’ and ‘[Name]’ – only feature in around 5% of tweets (meaning that the remaining 95% of tweets which do not feature the terms ‘throat’ and ‘[Name]’ are about a broad array of other small-scale sub-topics). What this means is that for the users captured here, coughing was talked about in a multitude of different ways and different contexts. However, there were enough uses of the terms throughout both of our datasets (136 instances across both datasets) to pique our interest – why were so many people tweeting about coughing specifically? We decided to look further into this.
The coughing-talk we identified as related explicitly to CF showed that for CF sufferers and their friends and family, coughing is a daily issue and is more seriously acknowledged than perhaps for non-CF sufferers. For these tweeters, Twitter was a way of reporting coughing episodes, as well as venting frustration at various aspects of coughing as a symptom of CF (particularly its impact on sleeping) and asking for advice and support from others with a knowledge of CF. Example tweets included the following:
- CFers, ive been coughing for last 6 hours. Tried ventolin, oxygen, physio, brandy, stacking pillows. NEED to sleep. Anyone any other ideas? X.
- Having to sleep with 4 pillows so you're sitting upright because you keep coughing. #cfproblems #soreneckinthemorning #hatingthis.
- stupid bleedin cough, just when I was excited for an decent nights sleep. #CFsucks. dog tired but lungs #rebel. #needcodeine.
Hence, we began to build up a base of CF sufferer's lifestyle experiences, by exploring how they talked about a mundane daily symptom: coughing. This gave us an insight into one of the key day-to-day issues for CF sufferers – not being able to sleep – and a selection of strategies by which they handled it at home (i.e. with codeine or other medication, physiotherapy, alcohol, changing their sleeping position and so on).
More than these insights alone, we also noted several terms in the co-occurrences associated with ‘cough’ and ‘coughing’ that we could draw on to further understand how CF impacts on daily lives. For instance, we noticed an abbreviation – ‘pwcf’, which we found stood for Person/People with CF – that co-occurs with the term ‘cough’ in dataset 1, and searched for usages of this term to find out what it was and how it was being used. What we found was a whole new cohort of lifestyle issues that did not necessarily mention the terms ‘cough’ or ‘coughing’, and which was typically expressed through a common structured joke format appropriated by Twitter users with an interest in reporting CF-related issues, for example:
- RT @CFCharityAccount: You know you're a pwCF when you use their cough as a tracking system.
- You know you're a mum of a pwCF when every handbag, rucksack, day bag etc., in the flat has a spare Creon pot in it. #cfaware.
- You know you're a pwCF when a tune-up doesn't have anything to do with your car. #cfaware.
- RT @CFCharityAccount: u know ur a parent of a PWCF when u tell your GP what antibiotics to give ur kid, what strength and 4 how long!!!
Our brief look at this talk allowed us to understand that there were many diverse lifestyle experiences beyond double-lung-transplant-talk and coughing-talk which we could now locate and explore in further analyses. Moreover, it was clear that CF-talk on Twitter was structured around Twitter-specific communication practices in multiple ways: users appealed for ReTweets to raise awareness of key CF issues, they shared URL links, they used hashtags to topicalise the content of their tweets, they expressed their sentiments through common joke formats (i.e. ‘You know you're a [parent of a] PWCF when…’), and so on. In this way, we now understood that the user-group selected as the basis for our data capture was thoroughly well versed in Twitter practices, and demonstrably used Twitter for a variety of communicative purposes (e.g. reporting symptoms, asking for advice, public appeals, personal communication). In this way, we are led to the conclusion that Twitter (and possibly other social media) may prove to be a key resource for social research projects looking to find genuine valid accounts of an array of experiences, both in medical sociology and in the social sciences generally. The key moment in our work has been the development of a method by which those insights could be uncovered in spite of the huge volumes of data that we found ourselves dealing with. It has been the aim of the present case to outline exactly how we put that method to use to move from our simple set of research questions to a bank of numerous varied and valid insights, as well as having plentiful ideas for how to expand beyond our original aims and into new and interesting analytic areas.
This study has been a story about an interdisciplinary group of researchers who found themselves in the business of looking for needles in haystacks and who gradually worked through some methodological ideas until they found themselves with a collection of needles and a technique for detecting them. Having been left dissatisfied by the limitations of the typical query keyword searches for eliciting social media (Twitter) data, we developed and implemented an alternative data capture strategy – user-following – which was better suited to addressing our (and potentially others') research questions. This method provides a different way of slicing up Twitter data, and one which is more attuned to the everyday use of Twitter for users as opposed to data that is singularly oriented to one or more semantic terms. What this new method of data capture allowed us to do was to uncover issues which were important to the user-group we had identified as relevant to our research interests without knowing beforehand what terms they were using that we could search for. By navigating our way around a set of topical cluster maps, we saw distinct clusters and areas of pertinence to our chosen topic of CF, and digging into these topics, we found a wealth of lifestyle issues and CF experiences which would help us address our research project. Moreover, we also found other related key terms (such as ‘pwcf’) which have the potential to direct us down different interesting and potentially insightful further research avenues. In this way, our user-following investigations into the presence and nature of Twitter conversation around CF and related lifestyle concerns provided another way into the topic which proved well-suited towards uncovering the insights we had aimed for from the outset.
1 We advise readers to watch this video prior to reading the rest of the case, on the grounds that the insights we draw are closely related to the features and functions available as part of the Chorus software suite.
2 All identifying details of tweets and tweeters have been anonymised. Where we have substituted a description for a username or similar, these are understood to have not been valid Twitter usernames as of the date of publication.
3 We put this down to the fact that for CF sufferers and their loved ones, CF was such a pervading feature of their daily lives that it wouldn't require an explicit mention in every single tweet, excepting possibly for cases intended to be broadcast to ‘lay’ people where the term ‘cystic fibrosis’ would provide a key identifying detail of the topic.
4 A lot of the decisions we made (e.g. which CF news and update account specifically shall we use for our follower list?) happened in this way, where although we had attempted to explore options as systematically as possible, the outcome was usually settled on a trial-and-error basis of diving into the data and seeing whether what we had done had worked or not. We stand by this strategy as a means of providing a perfectly adequate demonstrative example of a novel method which we had developed and applied simultaneously, but also on the grounds that it produced valid and insightful results in and of itself. Clearly, more time should ultimately be spent on ‘working out the kinks’, but this kind of discussion is in part what we want to inspire by publishing the goings-on of this initial study of ours.
5 Co-occurrences, also known as collocations or concordances, are a linguistic analytic technique (McEnery & Hardie, 2012) available within Chorus software. For co-occurrence statistics, Chorus computes a value from 0 to 1 for each term used with the chosen root word based on how many times the root word appears with other terms. In this way, co-occurrences can be taken as a local (i.e. specific to your dataset) probability that you will find, for instance, the term ‘lung’ appearing with the term ‘transplant’. The table above indicates that these terms co-occur in between 23.0% and 26.7% of tweets across both datasets.
- We describe our user-following data capture strategy as a way of finding ‘needles in haystacks’. What might we mean by this phrase? Describe how the user-following data capture strategy might be more suited to this kind of research objective than a typical query keyword search.
- The case study we present here is around the topic of cystic fibrosis experiences and lifestyle issues. List three other topics from outside medical sociology which might be best addressed with a user-following data capture strategy.
- Our datasets were compiled from the timelines of Twitter followers of a cystic fibrosis news and update account. Using the three topics you listed in exercise 2, consider how you might go about locating a user-group on Twitter that would help you explore each topic.
- Using Chorus software (see link in ‘Links to Relevant Web Resources’ section) collect a small user-following dataset from between 50 and 100 Twitter users around one of the topics you listed in exercises 2 and 3. Using the analytic techniques outlined in the present case (visual analysis of the term-level cluster map and the co-occurrence table), navigate your way around the dataset and write down any insights you elicit.