Skip to main content
Search form
  • 00:00


  • 00:10

    QINLAN SHEN: Hi.My name is Qinlan Shen.I'm a third-year PhD student from Carnegie Mellon Universityworking at the Language Technologies Institute.What I work on, in general, is the intersectionbetween natural image processing and social media analysis,so applying techniques from natural image processingto understand how people behave in social media.[How did you become interested in computational social

  • 00:30

    QINLAN SHEN [continued]: science?]I got interested in it because my backgroundas an undergraduate was actually in linguistics and a little bitabout social linguistics and historical linguistics.So I got interested in understanding itfrom a more computational side after takinga lot of pre-science classes.

  • 00:51

    QINLAN SHEN [continued]: I also had a computational background of some sort.So there was less transition because Iwas more computational before my undergraduateand then switched to linguistics and focused a little bit moreon linguistics later on in my undergraduateand then switched back to computer science and morecomputational approaches.So it was a little bit easier of a transition for me

  • 01:11

    QINLAN SHEN [continued]: than I think would be for most people.But the transition from a more computational backgroundto linguistics was actually quite interesting.Because before I entered undergraduate,I never knew what linguistics was.And I knew I wanted to do something a little bitcomputational, but then when I took my first linguistics

  • 01:31

    QINLAN SHEN [continued]: class, I'm like, oh, this is an amazing subject.There was so much to learn.And so I started focusing a little bit more on that.And then only in my later years of undergraddid I realize that I could kind of combine those twofields because there were people workingon things in the intersection between these.[What are the benefits and challenges of] [studyinglinguistics and computation together?]

  • 01:56

    QINLAN SHEN [continued]: Sometimes I feel like, in recent years,it's been a little bit difficult integrating thembecause the field moves quickly.So techniques arise and fall and have very different lifecycles.So in the current stage--well, in the past three years, I feellike linguistics has gotten a little bit shifted

  • 02:17

    QINLAN SHEN [continued]: to the back seat.But it's coming more into prominence now.So it fell a little bit into the back seat becauseof the rise of deep learning.But now, people are understandingthat a lot of the information that you learn from linguisticscan inform how you apply the techniques in machine learningto language.[Why is social media such an interesting] [area to study

  • 02:41

    QINLAN SHEN [continued]: in linguistics?]It's a really great source of datawhen you're studying linguistics because it's raw data from howpeople actually speak.So a lot of natural imaging processingworks on news data, which, while important,is not exactly how most people interactwith the rest of the world.

  • 03:02

    QINLAN SHEN [continued]: So that's why I'm excited in working in social media data.[What are you currently studying?]So what I've been looking at recentlyis looking at how people interact in political debate.So the focus of my current work wason a surprisingly big debate group called Big Issues Debate,

  • 03:24

    QINLAN SHEN [continued]: which is within a social media community called Ravelry, whichis centered around the fiber arts,so knitting, crocheting, and such.But surprisingly enough, one of the biggest groupsin this community is of groups that are on political debate,not knitting.So it receives around 5,000 posts a month.And what's interesting about this groupis how transparently you can understand

  • 03:48

    QINLAN SHEN [continued]: how people moderate the debate.So moderation, in conventional wisdom,has been seen as crucial for debatebecause moderators create and enforcerules that allow users with different viewpointsto engage civilly, in an orderly manner.But a lot of the tension with--

  • 04:10

    QINLAN SHEN [continued]: what a lot of people don't think about whenit comes to moderation, especiallywith sensitive issues like politics or social issues,is that moderation of people speaking about these issuescan be seen as censorship under some circumstances.So it's often difficult to maintain order in these groups.So within this group itself, there

  • 04:31

    QINLAN SHEN [continued]: has been a lot of tension between usersof the community and the moderation team,where the users perceive the moderators as targetingspecific viewpoints unfairly or specific users whohave expressed some views against the moderation teamunfairly in the past.And so what I wanted to do was understand why these users had

  • 04:53

    QINLAN SHEN [continued]: this perception and pick apart whether or not it was actuallyhappening and underneath, if it was happening,what was causing the moderators to treatcertain users unfairly, if it wasn't happening,what was causing the users to feel that they were beingunfairly targeted by the team.

  • 05:13

    QINLAN SHEN [continued]: Moderation should occur when peopleare expressing bad behavior.So we needed to pick apart the bad behavior from everythingelse that we know about the user,like whether or not they were conservative or liberalor whether or not they've interacted with the moderationteam in the past.So traditional and natural imaging processing techniquesfor understanding bad behavior often

  • 05:33

    QINLAN SHEN [continued]: rely on word-level information, so slurs, profanity,certain keywords associated with certain targeted groups.And even with more sophisticated machine learning techniques,they often still focus on word-level information.So this works very well when you'retrying to detect hate speech because it's very obvious when

  • 05:55

    QINLAN SHEN [continued]: someone says a slur or curses.But in this case, because there arerules in place in the debate on what kinds of thingscould be said, often, the offensive behavioris a lot more subtle and nuanced,and it was very, very difficult to capturewith strong keyword information.And initially, it was compounded by the fact

  • 06:15

    QINLAN SHEN [continued]: that in this group, profanity and talking about slursand how they're used are both common topics of discussionbecause they are important social political issues.So we decided to zoom out and take another viewon how to capture notions of offensive language, which islooking at it more in context.

  • 06:37

    QINLAN SHEN [continued]: So one of the theories that we use in our workis trying to capture dialogue acts, whichcapture more sentence-based intentproperties of an utterance in context,so you can understand how people transition from dialogueact to dialogue act.So we created a unsupervised graphical model that

  • 07:01

    QINLAN SHEN [continued]: allows us to pick apart dialogue acts from backgroundinformation such as the actual topics people are discussing.And then, we applied this method to finddialogue acts that were more associatedwith offensive behavior.And then, using the resulting found high-risk behaviors,

  • 07:22

    QINLAN SHEN [continued]: we incorporated that with the features of viewpoint and userhistory to run a regression analysis to determinewhether or not there was evidence that users werebeing censored in this group.[How did you collect the data?]For gathering the data on Ravelry,

  • 07:44

    QINLAN SHEN [continued]: Ravelry has an API for returning posts and informationabout the posts.But one of the interesting things that we needed to dowas track down moderated posts, whichare less obvious because it's not explicitlybuilt into the API.But this group has a very structured wayof marking moderated posts, which actuallymade it more interesting for us to study, which,

  • 08:05

    QINLAN SHEN [continued]: unlike in other communities, where, if a post is moderatedfor bad behavior, it just gets deleted,they actually maintain the text of the postbut they strike through it with some markup.And then mods are also expected to edit the postto write their justification.So it's pretty easy to find a pattern for detecting posts

  • 08:27

    QINLAN SHEN [continued]: with strike-throughs and then the phrase "mod-edited"and some kind of reasoning.So that's how we extracted out moderated postsfrom the rest of the posts.As for the modeling techniques that we use,so the graphical model that we use to detect dialog actsworks a little bit like a topic model.

  • 08:48

    QINLAN SHEN [continued]: But the intuition behind separating outtopical information from more stylistic functionis that topics transition in conversation.But what you expect in a single thread of conversationthat there is just one background topic of discussion.So, for example, in this debate thread,if it's a debate about gun control,

  • 09:09

    QINLAN SHEN [continued]: you'll see a lot of words about guns and laws and such.And we expect that topic to stay the same throughout the thread.But we have another, faster, transitioning threadthat captures a little bit more stylistic information,like, oh, this person just rebuttedthe last person's argument, or this person made a claim.So using that intuition, the model

  • 09:34

    QINLAN SHEN [continued]: has, basically, two different distributionsfor drawing out topics-- one faster-moving one for morefunctional information, and then one slower-moving onefor content-related topic, traditional topic information.And then, in terms of implementation,

  • 09:56

    QINLAN SHEN [continued]: we work with a lot of languages in our group.So the model itself, the unsupervised graphing modelthat we use to actually detect these dialogue actswas implemented in Java because it'seasier to, often, implement graphical model packages there.And it's slightly faster than in certain other languageswhere you could implement models.

  • 10:17

    QINLAN SHEN [continued]: But a lot of the processing and pre-processing of textis handled in Python because there'sa lot of tools built in for pre-processing text.And then, for the statistical analysisthat we did at the end with the informationthat we gathered from the dialogue,like model, people's viewpoints, and people's user history,

  • 10:39

    QINLAN SHEN [continued]: was run in Stata--a more statistical package.[What challenges did you face] [during your research?]I think there were a couple of challenges with figuring outhow to scope the nature of the projectsince we had a lot of trouble with the initial stages

  • 11:03

    QINLAN SHEN [continued]: of the model, actually interpretinga lot of the dialogue acts that were returned.Because with an unsupervised model,you often get very noisy results.And so we had to run a lot of experimentsand test different parameters to make sure notonly was our model working, were our results just unexpected?

  • 11:26

    QINLAN SHEN [continued]: So we ran lots of experience on that.And then, one of the frustrationswas that once we actually got the model runningand had results, one of the things we found frustratingwas that, OK, so we detected a little bit of censorshipon the side of the mods against users

  • 11:47

    QINLAN SHEN [continued]: with conservative viewpoints.However, the effect of this censorship was so, so tiny,and we didn't know what to make of this informationuntil we realized that maybe if we reframethe problem-- as in the problem with the groupisn't so much that there is such strong censorship that people

  • 12:08

    QINLAN SHEN [continued]: are leaving the group but rather that there'ssomething with how the group is beingrun that gives users the impression that they'rebeing censored.And so the shift of the second halfof the more qualitative parts of the projectshifted to understanding why people actuallyperceive there to be censorship when there wasvery little actually occurring.[What additional methods did you] [use to conduct

  • 12:31

    QINLAN SHEN [continued]: your research?]We didn't directly interview the participants.But we looked at a lot of their--so they're in this group.There are places for users to talk to mods.

  • 12:51

    QINLAN SHEN [continued]: So we didn't actively interview the participants,but we looked at their responses to moderation commentsand tried to understand what they were feeling at the momentwhen they received the moderation decisionand understand the different kinds of feelingsthat the users were bringing up to the mods in termsof how they felt they were being unfairly judgedin this situation.

  • 13:12

    QINLAN SHEN [continued]: [What advice would you give to] [someone new to social mediaresearch?]One of the interesting things about social media researchis you always have to be prepared to be surprisedby your results because people never behave the way youinitially think on paper.So sometimes you run a model, and the results you get

  • 13:37

    QINLAN SHEN [continued]: don't work out.But in this case, it would be like what happened with me.Instead of calling it quits and going,oh, this isn't working out, it would be better to go in.When you're working with social media,it's better to go in at some pointand understand why you're seeing the results you'reseeing, not just relying on your results

  • 13:60

    QINLAN SHEN [continued]: from your quantitative models.So it's something that I hold very important to mein terms of social media analysis.[What is next for your research?]So this research is still ongoing.So now that we've identified some problems in this group,

  • 14:23

    QINLAN SHEN [continued]: we're actually looking forward to working with the moderationteam itself to see if we can design interventionsin a way that could improve the relationshipbetween the moderation team and the users.So that's one direction that we're think about going in--so addressing some of the issues that we found were causingpeople to perceive censorship when there weren't.

  • 14:45

    QINLAN SHEN [continued]: The other direction that we're moving inis trying to apply this to other communitiesother than this big issues debate communitybecause, while it's a very active communityand is very rich in data, not very many people know about it.And so we want to see if a lot of the results--

  • 15:07

    QINLAN SHEN [continued]: and the demographics of this communityare very different from elsewhere on the internet.Because Ravelry, unusually in terms of social mediasites, generally skews female and older.So we wanted to see if a lot of the findings that wehad in this research apply to other platformswith different demographics.

Video Info

Publisher: SAGE Publications, Ltd.

Publication Year: 2019

Video Type:Case Study

Methods: Unsupervised learning, Social media analytics, Computational social science

Keywords: censored data; censorship; computer science; data analysis; data visualisation; demographic analysis; internet data collection; linguistics; political debates; quantitative content analysis; research; Social media; unsupervised classification ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:



Carnegie Mellon University PhD candidate, Qinlan Shen, discusses her online censorship perception and bias research using social media and unsupervised machine learning, including what prompted the research, data collection methods, challenges faced, and research still to come.

Looks like you do not have access to this content.

Studying Online Censorship Perception & Bias Using Social Media & Unsupervised Machine Learning

Carnegie Mellon University PhD candidate, Qinlan Shen, discusses her online censorship perception and bias research using social media and unsupervised machine learning, including what prompted the research, data collection methods, challenges faced, and research still to come.

Copy and paste the following HTML into your website