Skip to main content
Search form
  • 00:05

    Studying Algorithmic Management Using Web Scraping & Clustering

  • 00:09

    DR. JAMES ALLEN-ROBERTSON: My nameis Dr. James Allen- Robertson.I'm with the, University of Essex, Lecturer in Sociology.I'm in the area of media and science technology studies,but shifting into a more kind of computational social scientistrole.

  • 00:31

    DR. JAMES ALLEN-ROBERTSON [continued]: The shift came because I've done a lot of work with textsmainly.So my PhD I did work on the history of digital piracy.And to do that there's no real kind of documentsthat you can draw on that are in the traditional documentaryrole, so you have to use online documents.

  • 00:52

    DR. JAMES ALLEN-ROBERTSON [continued]: So I had a lot of experience gathering lots of documentsonline and going through them, from a very qualitativeapproach.And once I finished that project I wanted to do a different one.I wanted to do one on 3D printing and the way in which3D printing was being expressed online in different newsarticles and the different kinds of news articles

  • 01:12

    DR. JAMES ALLEN-ROBERTSON [continued]: that are out there.So places like Wired, how mainstream newspaperswere expressing it, how trade magazines were expressing it,things like that.And I quickly found that the scale in which articleswere being generated was far outpacing my abilityto collect them.And actually then going through and qualitatively

  • 01:35

    DR. JAMES ALLEN-ROBERTSON [continued]: coding up these articles was taking forever, as well.I had a research assistant across the summer,we got nowhere.It was a very slow process.And it just happened to be that I went to a conference,I'll name drop the Association of Internet Researchers,had a very good workshop.It's just two hours web scraping for social scientists

  • 01:55

    DR. JAMES ALLEN-ROBERTSON [continued]: in Python.And I went and I was there for the two hours.And I went away and that was it.I was hooked.Absolutely.And it took two more years, after that point,to get to where I'm using it in my research,but that was the starting point.Uber Versus Taxi Drivers AlgorithmOne of the research projects I've got going on at the moment

  • 02:19

    DR. JAMES ALLEN-ROBERTSON [continued]: is I'm looking at the way in which workers in the gigeconomy understand the way that they'remanaged by their algorithmic managers, so the apps.So if you're an Uber driver the appthat tells you who to pick up, where to go,how much you getting paid, and that kind of thing.And what I'm interested in is kindof how they understand what the algorithm is

  • 02:40

    DR. JAMES ALLEN-ROBERTSON [continued]: doing in the backgrounds.And whether they care about the algorithm at all.And I was interested because a lot of articlesabout Uber drivers, they use qualitative interview methods.And they go and they talk to the Uber drivers.And when they get there they talk to themand they say, "What do you think about the algorithm?"And then they tell them.But I didn't know whether that was

  • 02:60

    DR. JAMES ALLEN-ROBERTSON [continued]: because they were being asked about the algorithmor they were actually talking about itor did they understand the algorithm in a different way.So I want to take a different approach.So the method, essentially, is I'm drawingfrom a couple of forums.There's Uber drivers forum and a forum for taxi drivers,regular taxi drivers.Together in aggregate it's about 50,000 different forum threads

  • 03:26

    DR. JAMES ALLEN-ROBERTSON [continued]: of discussions of these drivers.And the project essentially is taking all of these threads,and then using text processing and clustering techniquesto find out what the key themes are within those threads.So what key topics of discussion are.So you can understand really what the drivers are talkingabout as Uber drivers, what the drivers talk about

  • 03:47

    DR. JAMES ALLEN-ROBERTSON [continued]: as taxi drivers.But what's most interesting is where the driversconverge in the topics they talk about and where they diverge.And it's in that divergence that youget a sense of what's different about beingmanaged by an algorithm that makes the job different.How do you gain access to the forum data?

  • 04:08

    DR. JAMES ALLEN-ROBERTSON [continued]: So access to the forums.That, in itself, is an ethical question, first of all.So it went through ethical review.And we talked a lot about getting permissionfrom the forum itself, getting permissionfrom the individual drivers.Now once you're doing large scale tax collection,getting permission from every single forum user

  • 04:30

    DR. JAMES ALLEN-ROBERTSON [continued]: becomes problematic.And so in this discussion with my institution, reallywe came up with this view that, well if the forumis open, if you don't need a usernameand password to get in.And if the drivers are all using pseudonyms,and if that forum itself states that this is public spaceyou should be using a pseudonym.It could be considered public data, to some extent.

  • 04:51

    DR. JAMES ALLEN-ROBERTSON [continued]: Now you still have to be careful about what individual piecesyou quote because it can all be derived backto at least these pseudonyms.So you still have to have those ethical questions.But in terms of actually getting access,it's, in a sense, freely available there.As long as it's treated properly.In actually getting the data, the extraction process,I program in Python and I used what a library called Scrapy.

  • 05:15

    DR. JAMES ALLEN-ROBERTSON [continued]: And Scrapy is designed for kind of large scale web scraperdesign.And so I built my own scraper that essentially traversesthe forum and pulls as many threadsas it can and then loads them into a database, whichI kind of have running as a back end for all my projects,as well.What skills do you need to build a scraper and to use Python?

  • 05:41

    DR. JAMES ALLEN-ROBERTSON [continued]: To make a web scraper with Python,is actually relatively easy.It's almost kind of one of the few first projectsthat you could try and do.Not necessarily with scraping, but with another librarycalled Request.So if you have maybe five pages of somethingthat you wanted to get, you could build a web scraperfairly easily, I would say.When I first started learning Python,

  • 06:02

    DR. JAMES ALLEN-ROBERTSON [continued]: I built a web scraper for a different projectin about two weeks.And that was really muddling along and getting errorsall over the place, really failing at every turn,but still getting through.To use something like Scrapy, thatrequires, probably, a bit more advancedbecause it's very fast.

  • 06:23

    DR. JAMES ALLEN-ROBERTSON [continued]: it's designed in such a way that it's both,going out and collecting data and processingdata at the same time, so that you can kind of bedoing multiple things at once.So it works very quickly.So it's a little bit more of an advanced tool.In terms of telling it, in a sense, what to do.You don't just point out the website and say off you go,

  • 06:44

    DR. JAMES ALLEN-ROBERTSON [continued]: it's not that clever yet.You have to really tell it, at every stage, whatit's going to see.So you kind of say to it, right you'regoing to start at this address.And then when you get there, you'regoing to see a page that looks like this.And you need to find this kind of link that looks like this.And then when you find that link, I want you to go to itand then get all of the pages from that link.

  • 07:05

    DR. JAMES ALLEN-ROBERTSON [continued]: And then I want you to go to each one of thoseand I want you to find this bit of text that looks like this,and it's coded up like this and ithas these tags attached to it.And I want you to get that and grab it.And then I want you to put it into the databaseinto this section.So you have to be very explicit.It's like giving instructions to a toddler,you have to give every step very explicit, you stand here,

  • 07:26

    DR. JAMES ALLEN-ROBERTSON [continued]: you walk there, you take five steps.You have to be very, very clear to it.And I suppose one hurdle for thatis that you have to understand HTML quite well.Because when you're looking at the pages,you're looking at the page, not as itlooks to you in a browser, but how the computer sees it,which is as a kind of a set of HTML code.

  • 07:50

    DR. JAMES ALLEN-ROBERTSON [continued]: But actually there's a lot of tools out therethat really help you.I mean, even built into a lot of browsers these days.If you right click on something on the pageand you go to something like inspect.You look at that, a whole different window will open upand it'll show you in the page where that object isand how you can get to it.So there's ways to do it easily.

  • 08:12

    DR. JAMES ALLEN-ROBERTSON [continued]: How do you manage your data?When you start off in these kind of things,first you start storing your data in just little text fileson your computer because that's the simplest way to do it.And then you start finding that, when you'retrying to open that folder, your computer starts

  • 08:32

    DR. JAMES ALLEN-ROBERTSON [continued]: grinding to a halt because it's got 10,000 text files in there.So then you think, OK, I need to be a bit more strategicand have better storage.So then you can use things like DataFrames.So in Python, there's a library calledpandas, which is a really good data storageand structuring system.And you can save data in a DataFrame

  • 08:54

    DR. JAMES ALLEN-ROBERTSON [continued]: and it's able to then save it to a computer as a single file,very efficiently, and it works very nicely.That can work up well to data whereyou've gotten you know maybe 10, 20, 30,000 instances or soof observations.But once you start collecting at a larger scale,or even once you start collecting at greater speed,

  • 09:17

    DR. JAMES ALLEN-ROBERTSON [continued]: then you need to get into databasesbecause your DataFrame won't necessarilybe able to keep up with the speed at which you'recollecting data.So when I implemented the Scrapy web crawler,I had to implement a database as well.And there are Cloud solutions, but I

  • 09:38

    DR. JAMES ALLEN-ROBERTSON [continued]: found them more complicated than actuallyjust implementing one on my desktop computer.So I have, on my desktop at work, running,a database in there.And it's kind of my general purpose database.It uses the Mongo framework, whichis kind of an emerging NoSQL framework for databases.

  • 10:03

    DR. JAMES ALLEN-ROBERTSON [continued]: It's really simple.It's much more simple than actuallytrying to use the classic SQL database system.And I recommend it because, really, it'smuch more flexible than SQL itself.And it as a general purpose for everything.So I've got a Twitter Monitor set on my desktop right nowcollecting for another project and it'sfeeding the Mongo database.And I've got another project that's collecting data

  • 10:24

    DR. JAMES ALLEN-ROBERTSON [continued]: and it goes into that database in a different collection.So it took me a long time to really, A,work out how to set up this database, and B,to work out how to use it.But once I did it really paid dividendsbecause it allowed me to really use itas my central hub for all data.And it, itself, could be a kind of an analytical tool as well.You can do some analysis on it to get descriptive statistics

  • 10:47

    DR. JAMES ALLEN-ROBERTSON [continued]: out of it really fast.So the largest dataset I've got in there at the momentis 4 million tweets from the Catalonia referendum.And you can ask it a query, find meevery tweet from this particular user.And it will give you an answer within 5 to 10 seconds.It's much faster than any other system you can use.

  • 11:08

    DR. JAMES ALLEN-ROBERTSON [continued]: How do you prepare your data for analysis?Really analysis of data, at this scale,requires you to work in subsets, to really draw what you need.So all the data sits generally in this database

  • 11:31

    DR. JAMES ALLEN-ROBERTSON [continued]: until I need it.And then depending on what kind of operation I want to do,I then pull, selectively, different fieldsfrom the database, so different parts.So maybe, for example, I will pullif from a database of tweets, I'llget it to give me all of the dates that the tweets weretweeted and the user names, perhaps.

  • 11:53

    DR. JAMES ALLEN-ROBERTSON [continued]: And then I'll pull that down and load thatinto a pandas DataFrame.And then you can use pandas, which it's not just aboutstorage and structuring, but it's also an analytical tool,to get it to rearrange the data so it's in order,get it to re-sample the data so that it gives youa count of how many tweets per day, per month, per year,whatever.

  • 12:13

    DR. JAMES ALLEN-ROBERTSON [continued]: And then what I found, actually, isthat in terms of analysis, when you'reworking at this kind of scale, youcan get statistical measures, which provide you a nice roundnumber to give you a general idea,like the mean number of likes or the mean number of re-tweets.But what's really more useful for me is visualization.So when I got to the analysis stage in a lot of my projects,

  • 12:36

    DR. JAMES ALLEN-ROBERTSON [continued]: I then had to move into actually learningabout visualization libraries.Visualizing in two dimensions, in three dimensions,visualizing time series data.Because as a human, you can't really grasp ituntil you can see it sometimes.So that was a big challenge for me, is that I got the dataand I knew what I could do with it.

  • 12:57

    DR. JAMES ALLEN-ROBERTSON [continued]: But you can't look at 10,000 rows of data by hand.You have to find a way of aggregating it somehow,and visualization is useful for that.How do you analyze your data?I'm in the middle of the analysis stage for the Uber

  • 13:17

    DR. JAMES ALLEN-ROBERTSON [continued]: project.And it is a really good example, I think,of where these techniques give you new insights,but also you need to go back to the traditional foundationsof what it is we do.So to analyze this Uber data, what I've essentially doneis used text analysis techniques,

  • 13:43

    DR. JAMES ALLEN-ROBERTSON [continued]: which render the text for every single forum threadinto a numerical representation whichexpresses what makes that document more different or moresimilar to the others.And what you get when you do that is essentiallylike a massive spreadsheet.So every row is a forum thread.And then there is a column for every possible word that

  • 14:04

    DR. JAMES ALLEN-ROBERTSON [continued]: could be in that piece of text.So I think it was something like 50,000 rowsand maybe like 70,000 columns, or somethinglike that, huge kind of thing.And then what you need to do, if you actuallywant to turn it into a visualization,is actually I think conceptually quite

  • 14:27

    DR. JAMES ALLEN-ROBERTSON [continued]: simple, is you use what's called dimensionality reductiontechniques.Common ones are PCA, but there's some newer ones called TSNE,T-S-N-E, which are built for really big datasets.And what they do is they essentially take those 70,000columns and shrink them down to three, or two,

  • 14:48

    DR. JAMES ALLEN-ROBERTSON [continued]: and try and retain that variation between the rows.And then you can use that three or twoas essentially your x, y, and/or z positions on a plot.And so what I did was, I essentially didthis dimensionality reduction and plotted it into a 3D space.And what I've now got is this kindof big ball of dots in three dimensions.

  • 15:11

    DR. JAMES ALLEN-ROBERTSON [continued]: And the more similar the documents, the closer theyare together.Then what you can do is run a clustering algorithm, whichessentially takes all those coordinates and works outwhich ones are closer to one another than others,looks at where the divisions are likely to be.And then if you can take those labels and color the dotsby those labels, you can then see

  • 15:32

    DR. JAMES ALLEN-ROBERTSON [continued]: that this is one cluster over here,this one cluster over here.And then you can use another techniqueto get out the words that are mostsignificant for the documents within each cluster.And at that point you've kind of gota labeling system that has been derived by the computer.Without you ever really telling it anything about the data,it tells you, right, these are the wordsthe most significant for all the documents

  • 15:53

    DR. JAMES ALLEN-ROBERTSON [continued]: within this group over here.So that gives you a really good big macroanalysis.But sometimes it can be the case that the top words aren'tclear about why those are the top words.Or you want to understand, well, OK, those are the top words,but why is that interesting?

  • 16:13

    DR. JAMES ALLEN-ROBERTSON [continued]: And so the stage I'm back at, at the moment,is that I'm actually taking random samplesfrom those clusters, exporting them out, and putting theminto a regular kind of qualitative data analysissoftware.And going through them by hand looking to see whatare the themes.Now, you get a random sample of 50 or so,and the cluster is represented by maybe 10,000.

  • 16:36

    DR. JAMES ALLEN-ROBERTSON [continued]: But that 50 will give you a decent enough insightinto starting to understand.That's where I am now.And it remains to be seen what we find.What challenges have you come across?I would say in any kind of projectlike this, using these kind of methods, we're importing them.

  • 16:57

    DR. JAMES ALLEN-ROBERTSON [continued]: We're importing these methods from computer science,from these large companies.And one of the algorithms that I usefor determining the similarity of documentswas developed by Google, you know, [INAUDIBLE] big algorithmthat's really useful.But these companies don't necessarily design these thingswith research in mind.They design them with predictive text in mind or search

  • 17:19

    DR. JAMES ALLEN-ROBERTSON [continued]: results in mind.So it can be quite difficult to understandwhat the technique is doing, and whether thatis meaningful for research.Or are you just essentially producinganother version of Google on a very small scale?Is it something that we can derive meaningfrom as social scientists?I think people find that quite hard when they're

  • 17:41

    DR. JAMES ALLEN-ROBERTSON [continued]: learning this stuff because all the tutorials you find online,all of the online courses, they're allgeared to this computer science applications,rather than social science research.So it's up to you to really make that decisionof whether this is a meaningful technique,or whether it can be made meaningful.I mean, for example, a lot of the techniques, whenever

  • 18:02

    DR. JAMES ALLEN-ROBERTSON [continued]: you look at tutorials online, theystress so much that before you use your model,you've got to train it on Wikipedia because that willgive you a highly representative pictureof the English language.Well, actually, if you don't train it on Wikipedia first,if you just train it on your really reduced, verydomain specific subset of data, say Uber forums,

  • 18:24

    DR. JAMES ALLEN-ROBERTSON [continued]: that means that you may not be gettinga representative picture of the English language.But you're getting a picture of the English languageas specific to the domain of Uberdrivers, which can tell you something about the way Uberdrivers talk and the kind of wordsthat they tend to associate together and things like that.So you have to be willing to go against the grain.Another challenge is ethics, I think.

  • 18:46

    DR. JAMES ALLEN-ROBERTSON [continued]: And it's a really important one because it really isn'tconsidered in these tutorials.It really is a case of, if you can work outhow to do it, then go ahead and do it.But I think as social scientists,we have to hold ourselves to a higher standardthat, just because we can do something,we shouldn't necessarily do it before consideringthe consequences for us, for the people

  • 19:09

    DR. JAMES ALLEN-ROBERTSON [continued]: that we're harvesting this data from,the consequences of publishing about this data.It's really important.And it's not just about just getting thingsthrough your university ethics panelbecause you're university ethics panel may notunderstand necessarily what you're doing either.So it's up to you to really be on the ballabout what is acceptable and what is ethical in this domain

  • 19:30

    DR. JAMES ALLEN-ROBERTSON [continued]: and collaborate with people to really geta better understanding of what is right to dowith these techniques.


Dr. James Allen-Robertson, PhD, Lecturer in Sociology at the University of Essex, discusses his research using web scraping and clustering to study algorithmic management, including his interest in this type of research, the Uber versus taxi driver algorithm project, accessing data, building a web scraper, using Python, managing the data, preparing the data for analysis, and challenges faced.

Looks like you do not have access to this content.

Studying Algorithmic Management Using Web Scraping & Clustering

Dr. James Allen-Robertson, PhD, Lecturer in Sociology at the University of Essex, discusses his research using web scraping and clustering to study algorithmic management, including his interest in this type of research, the Uber versus taxi driver algorithm project, accessing data, building a web scraper, using Python, managing the data, preparing the data for analysis, and challenges faced.

Copy and paste the following HTML into your website