Skip to main content
Search form
  • 00:11

    I'm Scott Hale, a data scientist at the Oxford InternetInstitute in the University of Oxford.I'm going to talk about my researchtoday looking at multilingualism online.When you go to a website, particularly outsideof the United States, often one of the first thingsyou're confronted with is a screen

  • 00:32

    asking you to select your region and language.And even if it's not the first screen,often there are many versions of sitesthat have been localized to different regionsand languages.On Wikipedia, for instance, you have particular languageeditions, and you only can search one of those languageeditions at a time.

  • 00:52

    Even where it's not explicit, there'soften implicit signals of your language being sent.So unbeknownst to most users, your browser'salways sending a ranked list of the languages you prefer,and many sites use this informationto prioritize content into the language that your browser saysis your preferred language.Now all of these divisions between languages

  • 01:16

    actually stand in contrast to the offline findingsof linguistics that across the worldthe norm is actually multilingualism.Most individuals speak more than one language,yet primarily our websites are designed with usersof one language in mind.

  • 01:38

    As we move towards websites that increasinglyare using user-generated content, that is,content created by their users, wefind that there's actually a large difference in the amountand the type of information availablein different languages.So if we take Wikipedia as an example,the largest edition, English, is about three times larger

  • 02:00

    than the second largest language, German.But it's not the case that the English edition simplyhas all the articles in the German edition plus more.In fact, if we look at the overlap in contentbetween these two editions, we find that only about halfof the articles in the German edition of Wikipediahave an equivalent article in the English edition.

  • 02:23

    And of course, it's a much smaller percentageof articles-- given the size differences--of English articles that have an equivalent in German.So very quickly we find that different contentis available in different languages.And in general, there's a sort of self-focused bias,so more information and content is

  • 02:43

    available about regions that speak the language,and less information is available about regionswhere the language is not spoken.In some cases, this means users evenfeel compelled to access content in another language,despite not having strong foreign language skills,as demonstrated in the study of Uzbek usersusing a survey methodology.

  • 03:06

    So my work here is to look at some of this previous work thathad been done, but then it to apply new methods usinglarge-scale data analysis to understand the extentto which language really is a barrierin user-generated content platforms,and the extent to which multilingual users might be

  • 03:27

    serving as bridges helping content move from onelanguage to another.To do this, I took a month's sample of data from Twitterand also month of data from Wikipedia.It's worth pausing here for a momentto say that actually a lot of my timewas spent collecting this data, cleaning the data,

  • 03:49

    and removing bots and vandals, and things of this nature.I don't have time to address today,but a large part of large scale data analysis, or data science,is often in this cleaning aspect of data.So once I've cleaned my data, I wantto run through five basic findings I've had.

  • 04:12

    So the very first finding is that language indeedis having a large role structuring platforms.This is most apparent in Twitter.So if we take each Twitter user and represent themas a node or a circle in this diagram,and then draw arrows between themto represent the mentions and retweets between these users,we can see that dense areas of the network

  • 04:34

    emerge where users are well connected to one another,but not so well connected to the rest of the network.So these clusters emerge naturally in the network,and if we compare that to language--we can see in this toy example, for instance,that there are two clusters where all the users havethe same common language and one cluster where there's

  • 04:54

    two languages being spoken.If we look at the real Twitter network,we can find about 20,000 clusters.Most of these have about 50 users in them,although a number have many more users.There's about seven clusters with over 10,000 users in them.Now, the majority of these clusters-- about 73% of them--

  • 05:15

    all the users have a single most common language among them.And this is much higher than we wouldexpect if language were just randomly distributedacross this network.So if we took the same distribution of languages,but randomly swapped languages between users,we'd only expect this percentage to be about 0.12%,

  • 05:36

    instead of the 73% that we observed.So language really is having a large rolestructuring this network.Now, second major finding is that these multilingual users--users who are contributing content in multiple languages--do you play a role in helping bridge these language divides.

  • 05:56

    I don't have time to discuss this in depth today,but I refer to my website, or to the papersthat I've mentioned this in this tutorial,so look for further details.So the question then remaining iswhere are these multilingual users coming from?And what languages are they editing?

  • 06:16

    Now consistent with this study I mentioned of Uzbek usersusing a survey methodology, usersin that study were feeling compelledto go to foreign language contentbecause they were aware of the limitations in their language.And indeed, if we look at the data on Wikipedia,we can see a strong correlation between the number of users

  • 06:36

    primarily editing a language edition,and the percentage of users who alsoedit a second edition beyond that first language edition.And so English is at one extreme.It's the largest edition, but alsohaving a relatively low percentage of userswho edit multiple languages.And on the other side, we have languageslike Esperanto, which are much smaller in size,

  • 06:59

    but have a much larger percentage of individualsediting multiple languages.On Twitter, the correlation isn't as clear.So on Wikipedia, the correlation is about minus 0.7.On Twitter, we see a correlation of only about minus 0.25.This may be due to users using the platform in different ways.

  • 07:21

    Of course, Twitter Wikipedia are very different typesof platforms.It also may be due to some of the inherent difficultiesin assessing the languages used on Twitter.It's very clear on Wikipedia whether youedit a language edition or not.On Twitter, whether or not you usea language has some error in that detection because users

  • 07:43

    may mix multiple languages together,or one language may be detected incorrectly.So primarily, users in smaller size languagesare more likely to be multilingual than usersin larger size languages.But when they cross to a second language,what language do they edit?In this diagram of Wikipedia editions,each edition is sized by the number of users

  • 08:05

    primarily editing that edition, and the editionsare connected together based on the number of userswho edit both those editions.And only the strongest edges are shown here.And we can see that English plays a very central rolein this network, but also that there are connectionsbetween the Romance languages, say Portuguese, Spanish,

  • 08:25

    French, Italian, and some other regional connections as well.Again, on Twitter we see a very similar pattern.Again, English is playing a very central role,but we also see the Romance languages connected, as well asKorean and Japanese.Again, these are just the strongest connectionsin the data.So these two prior findings-- the correlation

  • 08:49

    between language size and amount of multilingualism,and indeed, which languages users would cross toin a second language-- were predictedby some of the previous studies in usingdifferent methodologies.And here, with large-scale data, wewere able to confirm those findings.The last finding that I want to tell you about

  • 09:10

    was something that hadn't been predictedby any other methodology, and so this gives us new insightto confirm and further investigate this finding usingother approaches.And this finding was that multilingualism was alsocorrelated with activity.And this can be seen most clearly in Wikipedia.

  • 09:30

    So on this graph, we see a distribution of editsby multilingual users who edit more than one editionand by monolingual users who edit only one edition.And on average, the multilingual users are making nearly 2and 1/2 times as many edits as the monolingual users.Now there could be many different reasons for this

  • 09:53

    and it will be up to these other methodologiesto further dig into this finding to understandthe reasons behind this.I should note that much of this increased activityis actually in their primary language,and it's still only about 2 and 1/2% of the activitythat these multilingual users do in a second language.

  • 10:14

    So overall, using large-scale data on Twitter and Wikipedia,we've been able to show that languagehas a strong role structuring these platforms,but that users who contribute content in multiple languagescan form bridges between these languages.Overall, there's a correlation where users from smaller sizelanguages are more likely to contribute content

  • 10:35

    in multiple languages.And when they cross language divides,they're more likely to cross to larger languages,such as English, but also to regional-related languages,such as the cluster we saw with Romance languages.And finally, we saw an unexpected correlationbetween activity and multilingualism.

  • 10:58

    So what are the broader implications of this?Well, offline, linguistics tells usthat multilingualism is the norm,but online, we're seeing only about 10% of Twitter users,or 15% of Wikipedia users, contribute contentin multiple languages,Even in regions such as Catalonia,where we would expect high levels of multilingualism,

  • 11:21

    we're not necessarily seeing that online.So only 35% of the users who primarilyedit the Catalon edition of Wikipediaalso edit any other language edition at all.So unlike offline environments, of course online,we're free to change the design of sites in different ways.And my ongoing work is looking at how these different design

  • 11:44

    changes affect user behavior on the platforms.In particular, one option to increase multilingualismis to identify starter tasks and suggest these to multilinguals.Easy, low-barrier entry tasks might be somethingto do with manipulating images, for instance,which can be more easily transferred across languages.

  • 12:04

    We also need to make users aware of the information that'savailable in a language that isn't their primary language.Of course, user choice is a huge part of this.We don't want to bombard users with content in a languagethey don't understand.So we need to have personalization in the same waythat we have it now, but we need to make users aware

  • 12:26

    that this personalization is happening,and that there is additional content in another language,should they wish to interact with that.And we can make tools available to ease those interactions.I've talked in this study about multilingualism online,and I'll leave you with one closing thought, whichis a challenge, particularly if you're

  • 12:47

    an English speaker, or a speaker of another large language,to think about not only what contentis available in your primary language, but also the contentwhich isn't available in your primary language that might beavailable in another language.


Dr. Scott Hale discusses his research into multilingualism on the internet. Even though more than half of the global population is bilingual, a lot of internet content is available in only one language. Hale found that most users function in only one language online. He speculates about the role of browser prioritization based on language preferences, and offers ideas for increasing multilingualism online.

Looks like you do not have access to this content.

Researching Multilingualism Online: Big Data Methods

Dr. Scott Hale discusses his research into multilingualism on the internet. Even though more than half of the global population is bilingual, a lot of internet content is available in only one language. Hale found that most users function in only one language online. He speculates about the role of browser prioritization based on language preferences, and offers ideas for increasing multilingualism online.