Skip to main content
SAGE
Search form
PDF
  • 00:00

    [MUSIC PLAYING][Developing Software to Download & Analyze YouTube Data]

  • 00:09

    MIKE THELWALL: My name is Mike Thelwall.[Mike Thelwall, Professor of Data Science,University of Wolverhampton] I'm a professorof data science at the University of Wolverhamptonand head of the statistical psychometric research groupthere.So I research into developing methodsto analyze data from the social web or the general webto address social science research questions.So I write programs to implement the methods

  • 00:30

    MIKE THELWALL [continued]: and put them free on the web.So I like to share research methods.[How did you become interested in this field?]I got into this area by an interest in the social webgenerally and looking at different sitesand trying to think of ways in which itwas possible to extract data to research those sites.

  • 00:52

    MIKE THELWALL [continued]: So I got into the area from the perspectiveof looking at the sites themselvesand trying to think of ways of researching them.I've always been interested in the social web since it arrivedand curious about how it works, who uses it, why they use it,why they use it, what are the differences in waysin which it's used.So I've liked to develop methods to analyze that in any way

  • 01:14

    MIKE THELWALL [continued]: possible.[Why did you choose to use YouTube in your research?]I guess I've always been a little bit of a YouTube user.I've never been a huge fan of YouTube.But it is a site that is incredibly popular.It's been the second or the third most popular websitein the world for a decade.

  • 01:35

    MIKE THELWALL [continued]: So I think it's not researched very much in academiafor the amount of reach it has, for the number of users it has.So I think it's an under-researched site.And I noticed that it allows you to access its data for freefor research purposes.So it's one of the few sites which

  • 01:55

    MIKE THELWALL [continued]: provides a good source of data for researchers.So when I noticed that then I thoughtit was a good idea to try and develop methodsto research YouTube and get insights into who use it,how they use it and also to get insights into the topics thatare represented on YouTube--so either study YouTube itself or the topicsthat have videos on YouTube.

  • 02:17

    MIKE THELWALL [continued]: [What methods did you develop?]So I've developed methods to downloadYouTube data to start with.So if you use my program, Mozdeh,which is free on the web, you candownload the most recent 100 to 150 comments

  • 02:37

    MIKE THELWALL [continued]: on any video in YouTube.And you can do that for one video,you can do it for all the videos in the channel,or you can do it for a large set of videosor a large set of channels.So the starting point is downloading the comments.And then from that, my software will detect sentimentin the comments.So you can see if they're positive or negative.

  • 02:58

    MIKE THELWALL [continued]: And it will attempt to detect the gender of the commenters.So you can see what proportion of the commentersare male and female.And it will also analyze the text in the commentsand tell you if there are trends in terms of males usingsome words more than females or the opposite or onevideo or video channel using one set of wordsmore or less than another.

  • 03:19

    MIKE THELWALL [continued]: So you can identify large scale trendsin the comments on the videos.[How does the data collection process work?]The program Mozdeh is Windows based program.And if you download the program, youneed to feed it with a list of either YouTube videos

  • 03:42

    MIKE THELWALL [continued]: or a list of channels, YouTube channels,or a set of keyword queries.If you feed it with keyword queries,it will identify matching videos and then downloadtheir comments.Otherwise it will download all the commentson the videos you feed it with or the videos in the channelthat you feed it with.So you have to start by coming upwith either the channels or the videos or the keywords.

  • 04:04

    MIKE THELWALL [continued]: And then when it downloads the comments,it puts them into a simple plain text file.So if you want to analyze them in with a spreadsheet,for example, you could.Or if you have a preferred text analysis program,you could easily load the data into that program.But the program itself also comes in with a suiteof analysis tools.

  • 04:26

    MIKE THELWALL [continued]: So if you go into the analysis tools of the program,then it will annotate the commentswith the author gender, the sentiment of the comments.And then it has a set of techniquesfor analyzing the text of the comments as well.But Mozdeh's a Windows based piece of software.And you have to start by feeding itwith a list of keywords for the videos you're

  • 04:47

    MIKE THELWALL [continued]: interested in or a list of the videos themselves,so the URLs of the videos, or a list of the channelsthat you're interested in the videos on.And once it has that list, then itwill download the comments on the videos.And then you can either analyze them inside Mozdehwith its internal facilities, or you

  • 05:07

    MIKE THELWALL [continued]: can take them to a plain text fileand then load them into your own favorite text analysis program.Mozdeh gets its data from YouTube via the YouTube API,which is a service that Google providesto anyone who wants to use it, which gives free limited accessto YouTube data.And the terms and conditions allow the data

  • 05:28

    MIKE THELWALL [continued]: to be used for research purposes so long as you read the termsand conditions and make sure that you're not violatingwhat they permit you to do.And so that gives an automatic free sourceof a large set of comments.[Does your software use text analysis?]

  • 05:50

    MIKE THELWALL [continued]: Mozdeh provides basic text analysisin terms of comparing the frequency of useof individual words between males and femalesor between different videos or between different channels.So a typical question you could askMozdeh is, which words are used in the comments more by malesthan by females?

  • 06:11

    MIKE THELWALL [continued]: And then Mozdeh will give you a listof the words that are used more by males and by females.And it also gives you a statistical testif you need that for significance.So it'll report how many words are statistically significantlyused more by males and females and vise versa or statisticallysignificantly used more by one channel than another,or you can just have just a complete list

  • 06:31

    MIKE THELWALL [continued]: without the statistical significance.But it's just based on the words themselves.It doesn't do anything fancy with them.It just compares the individual wordsused between two sets of the comments.In addition to the text analysis,Mozdeh also gathers the date when the comment was made.So you can also analyze the comments as a time series.

  • 06:53

    MIKE THELWALL [continued]: So you can see when the most comments were madeand compare that between videos or between gendersor between channels.And it will also draw networks of the relationshipsbetween individual commenters based on whether theyreply to each other's comments.So if you have a set of videos with really intense debate--

  • 07:16

    MIKE THELWALL [continued]: although this is rare in YouTube.But if you do have an intense debate in a set of videos,then you can convert listen to a picture, a networkdiagram with Mozdeh.[Tell us more about how your software detects gender]YouTube used to provide the gender of their usersas part of their free data, but they

  • 07:38

    MIKE THELWALL [continued]: don't provide this anymore.So now we have to guess the gender of users.And Mozdeh guesses the gender of users from their name,from their YouTube name.So if the YouTube name is John Smith, for examplethen Mozdeh will guess this is a person whose first name isJohn.And John is a male gendered name.So it's probably a male.

  • 07:58

    MIKE THELWALL [continued]: And Mozdeh has a list of thousands of first namesthat are either usually male or female in the USAas its basic gender list.But if you have comments from a different part of the world,then it has a set of first names from a few other countries,Turkey, India, for example.

  • 08:19

    MIKE THELWALL [continued]: So you can detect gender in different languages.It doesn't work all the time.But it's right more often than it's wrong.So it gives you a reasonable split offof probably male and probably female comments.Mozdeh can't detect the gender of all users.So it detects about 30% of users in a typical set of comments.So the remaining 70% of people who

  • 08:41

    MIKE THELWALL [continued]: have names that are either rarer than a few thousandthat we have or are from different cultures, or they'reshort versions of names.So for example, there are lots of Patrick's and Patricia's.And nearly every Patrick is male,nearly every Patricia is female.But almost exactly 50% of Pat's are male and female.

  • 09:03

    MIKE THELWALL [continued]: So if someone puts their first name as Pat,then we have to throw away their data,because we don't know whether they're male or female.And it does sometimes make a mistake.So the most common mistake last timewe manually checked a set of commentswas the name Alpha, which is a female name in the USA.But in YouTube, there are quite a few commenters

  • 09:26

    MIKE THELWALL [continued]: who are males that describe themselves as Alpha Male John.And Mozdeh sees alpha as the first word, thinks, OK,that's the first name, first name Alpha, it's a female.So it does make mistakes.And if a person has a non-binary gender thenMozdeh just can't detect that at all, because as far as I know,there aren't lists of names that are non-binary names.

  • 09:49

    MIKE THELWALL [continued]: So if you are non-binary, you eitherpick a name that is male or female or neutralor something completely different.[What research questions can you answer using data fromYouTube?]If you're analyzing YouTube comments,then it's good for research questions.

  • 10:10

    MIKE THELWALL [continued]: First of all, if they're about YouTube and YouTubecommunication, then obviously YouTube comments are perfect.But also, if you've got research questions that are about topicsthat are naturally represented on YouTube.So if it's a research topic where videois an appropriate medium, then YouTubeis a good source for that.

  • 10:30

    MIKE THELWALL [continued]: And specifically for this method,if you're interested in the gender aspect,then it's very good or if you wantto compare between different sorts of videos.So you have a research question, whichsays how is this type of video different from that typeof video, then the text analysis works really well for that.[Tell us about an example of when

  • 10:50

    MIKE THELWALL [continued]: you have used your software]As an example of a recent project using Mozdeh,I analyzed 50 museum YouTube channelsand investigated the gender of the commenters for the 50museums to look to see if there was a pattern in termsof which types of museums had mainly male audiences, which

  • 11:15

    MIKE THELWALL [continued]: types had made me female audiences,and if there were exceptions to the general rule.So this is of interest because museumstend to be quite secretive about their audiencesand often just don't collect gender information at all.So if you're lucky, you'll know how many peoplewent through the doors of a museum or an art gallery

  • 11:36

    MIKE THELWALL [continued]: and very little else.But if they have a YouTube presence,then you can analyze the commentsand compare the popularity the gender breakdown of the YouTubevideos if not their online visitors.So I took of 50 very large museum YouTube channels--including art galleries with museum, museum or art gallery

  • 11:59

    MIKE THELWALL [continued]: YouTube channels.And I downloaded the comments and the videosand analyzed them for gender and thenlooked for patterns of types of museum or art gallerythat had mainly male or female audiences.And there was a interesting gender differencebetween these museums.

  • 12:21

    MIKE THELWALL [continued]: So we could categorize the 50 museums by maleto female ratio.And the most female museum or setof museums in fact for this case wasthe set of Liverpool museums, including the Lady Leaver ArtGallery.And they'd attracted a huge female audiencefor a set of really beautiful videos

  • 12:43

    MIKE THELWALL [continued]: about getting dressed in the 18th century.So if you see that video, it's absolutely fantastic.It's really high production values, unusual type of video.And it had a huge audience, mainly female audience,lots of very positive comments.And at the other extreme, the most male museum or art gallerywas again in the UK.It was an international study, but again in the UK.

  • 13:06

    MIKE THELWALL [continued]: It was the Bovington Tank Museum.And these had videos which were fantastically popularwith males but not females and low production values.But the videos typically involved a man, nothigh production values but very emotionallyengaged in his topic, so really talking

  • 13:27

    MIKE THELWALL [continued]: with a lot of feeling about the tanksthat he was very interested in.So that was interesting to see how big the difference was.And the gender difference factor was 96.So women were 96 times more likelycompared to men to watch the Lady Leavervideos than the tank.

  • 13:47

    MIKE THELWALL [continued]: Videos so it was surprising to see the big difference.And also within this, there was a patternthat the art galleries tended to have higher female audiencescompared to the museums.So art galleries were more female oriented.But there were some exceptions, including the Los Angeles

  • 14:08

    MIKE THELWALL [continued]: Museum of Modern Art, which had a relativelyhigh male audience.And the videos it produced that attracted more malesthan typical for an art gallery wereabout punk rock art in the 1970s and particularly the designelements of the punk rock art, so the visual design elements

  • 14:29

    MIKE THELWALL [continued]: of punk rock art.So that was interesting to see.So although these were still quite female friendly videos,for an art gallery, they were very successful with males.So that was an interesting finding.[Were there any unexpected challenges with this project?]

  • 14:51

    MIKE THELWALL [continued]: So there are a few challenges with YouTube data.So the main challenge is probablythat most of the comments can't be gendered.The gender method only works for 30%.Another challenge is that the YouTube API onlyallows the most recent 100, 150 commentsto be downloaded on each video.

  • 15:12

    MIKE THELWALL [continued]: So if you have one video with thousands of comments,you don't get all the comments.So that's a little bit of a challenge of interpretation.And another big challenge is that when someoneleaves a comment on YouTube, they mightbe playing a game with you.They might be trolling.So you can't be sure that the comment

  • 15:34

    MIKE THELWALL [continued]: should be taken at face value.So you have to be a little bit cautiousabout interpreting the words that peopleuse in their comments.[What other data sets can be used with your software?]On YouTube, I don't think there is a big problem with bots.So trolls are a problem.But I don't think that bots are a problem.

  • 15:56

    MIKE THELWALL [continued]: So I haven't seen a comment that looks like it's been put thereby a bot on YouTube.And I'm not sure why that is.It might be that the YouTube technology prevents botsfrom getting in, or they're just very good at filtering them outwhen they arrive, or bots don't see YouTubeas a high value site even though it's fantastically popular.

  • 16:18

    MIKE THELWALL [continued]: [What advice would you give to people looking to do similarresearch?]I've described a little bit about how Mozdeh canbe used for YouTube analysis.But it also allows you to gather different types of dataand apply the same types of analysis to that data.So for example, you can collect your data

  • 16:40

    MIKE THELWALL [continued]: from Twitter instead of YouTube.So it's just a different button clickat the start to go for Twitter rather than YouTube.It will also import data from the TripAdvisor websiteor academic publication databases.If you have a database of academic publications,you can feed them in, and it willdo the same kind of analysis.And there are various other different types

  • 17:01

    MIKE THELWALL [continued]: of more specialist data sets that you can import.And in fact, if you have your own dataset which is text, name, and date,you can feed it into Mozdeh if youwant to do the gender sentiment analysis with the program.If you going to do a research project with Mozdeh,then I think it's a good idea to run a pilot test first.

  • 17:21

    MIKE THELWALL [continued]: Try it out on a small scale with a little bit of data, justto make sure that you will get reasonable data for your topic.So a big problem is you might have a fantastic research idea.And then you download the data on YouTube,and there are two comments on your channels,and you can't do anything.So you really need to make sure that youdo have enough data to analyze to start with.

  • 17:43

    MIKE THELWALL [continued]: And it's quite simple to gather the data with Mozdeh.So with my students, I always tell them, get the data first.And then think more about how you're goingto analyze the data later.Because if you can't get the data,then there's no point in wasting timeabout planning the project that you can't do.[References][Thelwall, M. & Mas Bleda, A. (2018).YouTube science channel video presenters and comments--Female friendly or vestiges of sexism?] [Journalof Information Management, 70(1), 28 46.doi 10.1108/AJIM 09 2017 0201][mozdeh.wlv.ac.uk][www.youtube.com /watch?v=UpnwWP3fOSA]

  • 18:04

    MIKE THELWALL [continued]: [www.youtube.com /user/TheTankMuseum][MUSIC PLAYING]

Video Info

Publisher: SAGE Publications Ltd

Publication Year: 2019

Video Type:Video Case

Methods: Data mining

Keywords: data mining; databases; gender; internet data collection; open source software; research methods; research questions; Social media; Social network analysis; Social network analysis and issues; Social science research; text analysis; time series; trends; Twitter; video research; YouTube ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

Abstract

Mike Thelwall, PhD, Professor of Data Science at the University of Wolverhampton, discusses his research developing software to download and analyze YouTube data, including why use YouTube data; methods developed; how data was collected; what questions can be answered with YouTube data; unexpected challenges with the project; whether the software analyzes text, how it detects gender, and other types of data it can analyze; and advice for those looking to do similar research.

Video Info

Publication Info

Publisher:
SAGE Publications Ltd
Publication Year:
2019
Product:
SAGE Research Methods Video: Data Science, Big Data Analytics, and Digital Methods
Publication Place:
London, United Kingdom
SAGE Original Production Type:
SAGE Case Studies
ISBN:
9781526496294
DOI
https://dx.doi.org/10.4135/9781526496294
Copyright Statement:
(c) SAGE Publications Ltd., 2019

People

Academic:
Mike Thelwall

Segment Info

Title:

Segment Num: 1

Keywords:

Segment Start Time:

Segment End Time:

People

Things Discussed

Organizations Discussed:

Events Discussed:

Places Discussed:

Persons Discussed:

Methods Map

Data mining

Data mining refers to the process of discovering useful patterns in very large databases. It uses methods from statistics, machine learning, and database management to restructure and analyze data in ways that permit knowledge or information to be extracted from the material.
Data mining
Developing Software to Download & Analyze Youtube Data

Mike Thelwall, PhD, Professor of Data Science at the University of Wolverhampton, discusses his research developing software to download and analyze YouTube data, including why use YouTube data; methods developed; how data was collected; what questions can be answered with YouTube data; unexpected challenges with the project; whether the software analyzes text, how it detects gender, and other types of data it can analyze; and advice for those looking to do similar research.