Skip to main content
Search form
  • 00:05

    [Revealing the Content of the Edu Blogosphere]

  • 00:11

    SARAH HEWITT: So I'm in my final year of my PhD.[Sarah Hewitt, Postgraduate Student, Universityof Southampton] I'm going to submit,I hope, at the end of August.So what I'm going to be talking to you about and showing youtoday is effectively my PhD but without allthe tedious explanation and all the kind of wordy bits.And like so many social scientists,

  • 00:34

    SARAH HEWITT [continued]: you have to get to grips with computer sciencebecause, increasingly, our data issitting on the web in various different places.And you have to get your hands dirty.If you're a PhD researcher, you have to do it yourself.If you're not, if you're lucky enoughto have gone through that process,then you can probably collaborate with somebodyand they can do it for you.

  • 00:54

    SARAH HEWITT [continued]: But you do have to engage with computer science.[How Can Social Media Research Provide InsightInto Research Questions?What topics have been discussed by the Edu community?Why is this even important?How can I find out?"This project is built on the acknowledgement thatpolicymakers do not often consider teachers' voicesin the policymaking process, but also on the hope that if enoughvoices are heard, they will have no choice but to listen."]So, as I said, I used to be a secondary schoolteacher before I did my PhD.And I was having a bit of a rough time, to be honest,as a teacher.That's one of the reasons why I took a career break.But I spent some time on social media, on Twitter,

  • 01:16

    SARAH HEWITT [continued]: and I discovered there's a really big edu-communityon Twitter.They write blogs, they use Twitter to promote their blogs.I found them and realized that someof the issues that I was facing were exactly the same as issuesthat other people were facing, and that the difficultiesthat my school had were shared by other schools up and down

  • 01:38

    SARAH HEWITT [continued]: the country.So I realized that I wasn't alone.And I know that when people think about social media--we've talked today about Facebook, Twitter, Instagram,Reddit--but blogging is also a form of social media.And of course, because you can link,you can comment on other people's blogs,there is a social network community there.

  • 02:00

    SARAH HEWITT [continued]: So I'm going to start with a second highlighted questionwhich is, why this-- or my research, I think--is even important?And I think we can all agree that education is treatedpretty much like political football in the UK,and of course in other countries.And the one voice that seems to be excluded

  • 02:21

    SARAH HEWITT [continued]: is the voice of teachers, the voice of edu-professionals.We are the people that don't appear to be heard very often.And I left teaching when Michael Gove was the Secretary of Statefor Education.And he was, of course, in charge of or directing Ofsted, whocome in and investigate schools and give thema judgment as to how well they think they're doing.

  • 02:44

    SARAH HEWITT [continued]: And also, of course, there was a backgroundat the time of schools being encouraged to become academies.And what happened was that some peoplemade the connection between a school being given a judgmentthat it was failing and then being askedto convert to an academy, which fitted in with the government'spolicy.As a result, of course, there was an awful lotof criticism of that policy, or what

  • 03:06

    SARAH HEWITT [continued]: was perceived to be a policy.And some of that criticism was in the form of blogs writtenby some very important people.And there's some evidence to suggest that those blogposts were read, possibly by Michael Govebut certainly by other people.And as a result, Ofsted were encouragedto change their inspection policy and not to--

  • 03:29

    SARAH HEWITT [continued]: if you like-- demand to see the things thatwere causing schools to fail.So there is some evidence that teacher voicewas heard at that point.The quote in blue actually comes from a researcherin the US, Kirsten Greene.I found her PhD and she was looking at very similar issues.And all I can say is that through the research that I've

  • 03:50

    SARAH HEWITT [continued]: done, we may think it's bad in the UKbut it's nothing compared to the situationthat they face in the US.And of course they've got other problems now as well.[The Edu-blogosphere--A History]["The increasing availability of digitized text presentsenormous opportunities for social scientists.Yet hand coding many blogs, speeches, government records,newspapers, or other sources of unstructured text is notinfeasible."]Teachers expressing their view, of course, isn't new.The Times Educational Supplement was launched in 1910.And, of course, teachers have been writing to the TESand writing articles for TES ever since then.

  • 04:13

    SARAH HEWITT [continued]: The big thing happened in 1999 when Blogger came along.And Blogger meant that you could write your own blogwithout having to write all the code to create your web page.So that made life a lot easier.But the real breakthrough, I think,was in 2004 when Blogger were able to give bloggersa unique URL so they had their own unique address

  • 04:34

    SARAH HEWITT [continued]: on the internet, on the web.And that meant that if you're writing a blog,you could put a link in to someone else's blog,someone else could put a link into yours,when you were commenting on a blog youcould create that link.And so there is the social networkand I think that's what really changed the landscape.[How Can I Find Out?]So, what I want to go on to talk about now

  • 04:55

    SARAH HEWITT [continued]: is where I can start to find out what teachers talk about.And the answer is that it's all on web pages.So if you are on Google Chrome and you rightclick on any web page and you click on Inspect,you'll see all of that code.So that's all of the code that's usedto create the web page that your browser is showing you.And everything I need is somewhere on those pages.

  • 05:19

    SARAH HEWITT [continued]: And what I have to do is find a piece of codethat will collect all that data for meand write it into a spreadsheet.That's what they call HTML, HyperText Markup Language.My code, then, will take all of that dataand write it into a spreadsheet so I end upwith one row per blog post.I've actually got a list of about somewhere near 2,000 blog

  • 05:44

    SARAH HEWITT [continued]: URLs.There is a member of the edu-community on Twitter,who's been there a long time, who made it his missionto create a spreadsheet and list all of these URLs.So he is obviously going to get an acknowledgment on my PhDbecause he saved me a huge amount of work.I've added a few of my own, but Andrew Ldid all of the real heavy lifting for me

  • 06:06

    SARAH HEWITT [continued]: in terms of identifying the people that I'm looking for.[How Much Data?A total of 7,786 blog posts from a small sample of 200 blogs(some short posts deleted).How many from a total of nearly 2,000 blogs?Also, 53,305 unique terms.]So, how much data?I'm going to talk a bit about big data,but obviously some of the things that we've heard today,the data set is huge.The data set I'm talking about today is just a sample.This is only 7,786 blog posts written by about 200 bloggers,

  • 06:33

    SARAH HEWITT [continued]: I think.To be honest, I lost count.But once I've got all of my data set,then it is going to be really huge.And there's the real challenge of big data.The other thing to point out hereis that within just the small sample that I'vebeen using to develop my methodology,there are over 53,000 unique words.

  • 06:54

    SARAH HEWITT [continued]: And one of the first things that I have to get rid ofis I have to reduce those number of words.Computer scientists call this dimensionality reduction.[How Many Lines of Code?]First of all, though, I just want to give you a little--have a little look at that some of my code.So I'm harvesting data from web pages,

  • 07:15

    SARAH HEWITT [continued]: and you'd think that if everybody'swriting on WordPress then it's easy,all the data's in the same place on a WordPress site,but it's not.Every WordPress theme puts the data in different places.And then there's Blogger and there's Weeblyand there's all the others.So I ended up writing goodness knows how many lines of code

  • 07:36

    SARAH HEWITT [continued]: to try and cover every eventuality to get all the datathat I wanted.To be honest, it took me nearly a year of my PhD justto do this particular job and make surethat it was working properly.And I come from a humanities background.I've got no--I had no background in computer science, or even math.So I've had to do this from the ground up.And I have to say I'm quite proud that I've

  • 07:57

    SARAH HEWITT [continued]: managed to do that.So that's the kind of snippet of the codethat I've had to write to get the data that I need.People have talked about the kind of tools they've used.Well, I used a lovely little tool justto play with my sample set called Orange, whichis sitting in the Anaconda suite of tools,

  • 08:17

    SARAH HEWITT [continued]: which is all Python-based.But the lovely thing about Orangeis that you have this workspace, you drag and drop your icons,and you link them together, and it does all the work for you.But it will only do it on a relatively small data set.[Preprocessing and Processing Data]So you can see on the left-hand side there,there's my little blue icon for my corpus.And the next thing I did was I shuffled to randomize that set,

  • 08:42

    SARAH HEWITT [continued]: and then I just took off a 25% sampleand did some work on that.If you put a big data set in, Orange just eithertakes forever to process what you wantor it just stops altogether.So it's great for playing around and developing a methodology.But when I actually come to do it properly,then I'll be doing something slightly different.

  • 09:02

    SARAH HEWITT [continued]: But what I'm going to talk about next is the preprocessing bit.So you can see those green lines.And this is where I'm going to talkabout dimensionality reduction and howto get rid of some of the words that I've got in my data set.[Dimensionality Reduction]So what I did was I generated a word cloud.But before I did that, there were about 300 odd words

  • 09:22

    SARAH HEWITT [continued]: that you can get rid of straight away.There's plenty of research to suggestthat there are about 300 odd words,like and and but and so, that you can get ridof without having any meaningful impacton the accuracy of the results that you're looking for.So I got rid of those.And then I did this word cloud and it justgives you a visual representation of the words

  • 09:45

    SARAH HEWITT [continued]: that are used most often.Again, there's research for peopleto help them to pick out the next lot of wordsthat they can safely get rid of.But because this is my domain, I can bring my own expertiseand experience to this.And so I was able to make a judgment that the wordson the right-hand side could also come out--

  • 10:05

    SARAH HEWITT [continued]: because they're not really adding any value,they're just noise--for what I'm looking for.I did talk to a couple of people about this at lunchtime.So, basically, words don't mean anything to a machine.It's numbers.That's how machines work, that's how they analyze data.And so the next step is to reduce the words, if you like,

  • 10:27

    SARAH HEWITT [continued]: to numbers.And this is done by counting the number of timesthe word is used in a document and thenadjusting that count to take into accountthe length of the document.So if you have a word used a lot in a small document,that will give a particular number.But if that word is used the same number of timesin a longer document, then obviously there

  • 10:48

    SARAH HEWITT [continued]: has to be some adjustment, the word is less important.So, effectively my data set is reduced to a load of scores.And you can see there are lots of zeros therebecause a lot of the words aren'tin all of those documents.That's another reason why dimensionality reductionis so important.Because at the end, we're doing a huge mass calculation.

  • 11:08

    SARAH HEWITT [continued]: The more numbers you can get rid of, the easierthat whole process becomes.[Possible Document Space]What we can do then is we can convert--and I'm covering a bit of ground that Josh did here--but we can convert those numbers into coordinates.It's a bit of a math jump I'm making here.But using something called cosine similarity,

  • 11:28

    SARAH HEWITT [continued]: we can represent those documents as points in space.And once we can do that, we can then see, if you like,how those documents might be grouped together.And the closer they are in space, the more likelyit is that those documents are talking about the same thing.Or for my example, or for me, it'smore likely that those blog posts are talking

  • 11:50

    SARAH HEWITT [continued]: about the same types of things.This is what's called unsupervised learning.You're clustering documents, you'rejust letting the algorithm get on with the job.The problem is that you need to tellthe algorithm how many clusters you're looking foror how many topics you're looking for.And if you've got no idea, there are some algorithmsthat will give you an approximation,

  • 12:10

    SARAH HEWITT [continued]: but an approximation is all they are.They're of limited use.[Categorization-- Unsupervised]In fact, there is an algorithm thatwill tell you how many clusters you'vegot in your set, which I ran.And it told me I had 792 clusters or groups or topicswithin my data set.I can't deal with that, that's too big.

  • 12:31

    SARAH HEWITT [continued]: I can't sit down and go through 792 groupsand decide whether actually they should be a group or not.That's just not feasible.Again, another kind of classic big data problem.[Categorization-- Semi supervised]So the way to do it is--and I think this is, again, the kind of thingthat you were talking about-- there'sa semi supervised approach.And this is where you say, right, I'm

  • 12:51

    SARAH HEWITT [continued]: going to label some documents and I'mgoing to give them a category.I don't have to label very many, aslong as I've got a few in each category, that will do.And then the algorithm represents the documentsas nodes in a network.So we've been looking at lots of network diagrams for Twitterusers.Well, it's exactly the same principlebut the node is-- instead of being a person-- is a document.

  • 13:12

    SARAH HEWITT [continued]: And the algorithm already knows some of the nodesbecause I've prelabeled them.And then it just simply says, well,the ones that are closest to the ones you've labeled,those are probably all talking about the same thing,so we can put them in a group.[Semi supervised Learning--Create a "TrainingSet"] [1.Continued professional development/training 2.Positioning 3.Professional concern 4.Reflective practice 5.Resources 6.Soapboxing 7.Other]So, I labeled 316 posts.And I got my categories from the research

  • 13:33

    SARAH HEWITT [continued]: that's already been done.There has been some research into whatteachers and other edu-professionals blog about.But the research is either stuck to one platform, whichwas specifically designed for the purpose,or it's the manual hand-coding, so just a very small set.But nevertheless, that gave me my categories 1 to 6,

  • 13:57

    SARAH HEWITT [continued]: and I labeled 316 posts that fitted into those, plus Iadded the 7th, Other, to catch all the other ones.And the next slide I'm going to show youis my results on my sample set.And I'm really pleased with this because it's really pretty.[Results, 2017 is not a complete year,The 2010s were good years for blogging--either more bloggers blogged or the existing bloggers bloggedmore often, "Drift"-- these are almost certainly not the samebloggers each year, As a teacher,it is heartening to see continued professionaldevelopment and resources featuring highly]And so you can see how over in my sample

  • 14:18

    SARAH HEWITT [continued]: set over each year how the blogs split out into the categories.And you can also see that the 2010s were key years.There are some provisos here.2017-- I started gathering this at the beginning of 2017,

  • 14:38

    SARAH HEWITT [continued]: that's why there's only 135 blogs in there.2004 is the earliest year, only six blogs in that set,of the ones that I gathered.Like I said, the 2010s were good years for blogging.What I don't know is whether it'sthe same number of bloggers blogging more, or more bloggers

  • 14:59

    SARAH HEWITT [continued]: blogging.And bloggers come and bloggers go.They start off enthusiastically and then they drop off,and then you get other people coming in.I don't know who these people are,all I know is this is my set of blogsand these are the kind of topics that they were talking about.Drift is that idea of people come and people go.

  • 15:20

    SARAH HEWITT [continued]: But as a teacher, it's really interestingto see what people talk about.One of my most interesting ones was resources,because as a teacher you're desperate for ideasas to how you're going to teach whatever it is youwant to teach to your class.And it was really nice to see that one of the biggestcategories there is resources.

  • 15:41

    SARAH HEWITT [continued]: But what I don't know is what resources, what key stage,what subject they're all about.So it would be really nice to have a look at thatin a bit more detail.The other one that worries me-- well, notworries me but could do with more investigation--is the CPD training one.Now, I don't know whether that's teachers writing

  • 16:02

    SARAH HEWITT [continued]: about a training event that they've been to,or whether it's people offering ideasfor continued professional development or training,or whether it's blogs that are specificallywritten to offer those kind of opportunities,to talk about the things that teacherswould be interested in knowing.I don't know that.I suspect it may be reporting of events.

  • 16:24

    SARAH HEWITT [continued]: The brief look that I had when I was scrollingthrough these blog posts and whenI was doing my categorization mademe think that it actually probably isn't as valuableas I think it might be.But we'll see.The other one I like is the reflective practice.So this is teachers talking about what went well

  • 16:44

    SARAH HEWITT [continued]: in a lesson, what didn't go so well, whatthey're going to try next time.And it doesn't matter whether you'rea trainee teacher, a newly qualified teacher,or you've been doing it for a long time.It's always interesting to read whatother people's experience and the kind of things that theyfound worked.So that's a really nice one.[Future Work, Get all the blogs (or as many as I can),Apply my methodology, Create a timeline of important events(including speeches and media pieces)in Education in the United Kingdom (done),Represent the blog/category count proportionally,Is there a correlation between topics of conversation "inside"and events "outside"?, Create an interactive visualizationto illustrate the results]

  • 17:05

    SARAH HEWITT [continued]: So, future work.So, get all the blogs, obviously.I've got pretty much all of them,I'm just filling in the gaps now of the onesthat my code missed the first time around.And I've tweaked my code and I can pick upall of those others, which is really good.Obviously I'm going to split my data set up into yearsand then I'll apply my methodologyto each data set in turn.

  • 17:28

    SARAH HEWITT [continued]: I have already created a timelineof all the important events in education.Because what I really want to seenext is whether there's some kind of correlationbetween what happens outside--so the kind of decisions that aremade, education acts, even speeches and things thatare said by, for example, the Secretary of State

  • 17:48

    SARAH HEWITT [continued]: for Education or somebody from Ofsted--to see whether that has any influence on whatteachers blog about.Of course, my diagram doesn't represent blogs proportionally.So that's simply a straightforward exampleof x blogs this year separate out into these categories.

  • 18:09

    SARAH HEWITT [continued]: And it would be really nice to see whether--I did produce a diagram, but it wasn't anywhere near as prettyas this, so it's not in there, that's one of the reasons.But it would be really nice to seewhether actually reflective practices hasmore blogs proportionally in this year than maybethe previous year or the year that comes after.So I'd like to do that again.

  • 18:30

    SARAH HEWITT [continued]: And the final thing I've got to dois create an interactive visualizationto illustrate the results.So, literally a timeline on the bottom.You scroll along the timeline and you see the topics appear,and they'd be represented in sizeas to how big or important those topics wereas we go through time.Some of you might know, I think it's called Gapminder software.

  • 18:51

    SARAH HEWITT [continued]: And there's a really lovely video on YouTubewhere that kind of idea has been doneto plot poverty and birthrate across countriesacross the world.The guy who did that, whose name escapes me right now,only died last year.

  • 19:04


  • 19:05

    SARAH HEWITT: Thank you, that's the guy.Yep.So I want to do what he did with my data.Yeah.And I'm hoping that I'll be able to do all thatand submit by the end of August.Thank you.[APPLAUSE]


PhD candidate, Sarah Hewitt, discusses her thesis—whether it is possible to discover the topics being discussed within the education community using social media, specifically blogs—outlining the challenges she faces sifting through big data, pre-processing and processing the data into something workable, and doing future work on the topic.

Looks like you do not have access to this content.

Revealing the Content of the Edu-Blogosphere

PhD candidate, Sarah Hewitt, discusses her thesis—whether it is possible to discover the topics being discussed within the education community using social media, specifically blogs—outlining the challenges she faces sifting through big data, pre-processing and processing the data into something workable, and doing future work on the topic.

Copy and paste the following HTML into your website