Skip to main content
SAGE
Search form
  • 00:06

    [Creating Synthetic Data for Replication & PrivacyProtection Using Generative Adversarial Networks]

  • 00:10

    CHRISTIAN ARNOLD: Hi.I'm Chris Arnold.I work at Cardiff University.[Christian Arnold, PhD, Lecturer in Politics, CardiffUniversity] And I do research on applied problemsaround developing countries.And then at the same time I try to use, really,cutting edge research methods in orderto solve my substantial problems.Recently, I'm using different kind of deep learning modelsin order to find out more about my substantive research.

  • 00:33

    CHRISTIAN ARNOLD [continued]: [How did you become interested in machine learning methods?]First of all, I have a very solid trainingin research methods from my graduate school.I got a PhD at University of Mannheimwhere the focus is on a lot of stats.And after my postdoc, I worked as a data scientist

  • 00:55

    CHRISTIAN ARNOLD [continued]: for a couple of years, working a lot with computer scientists.And that's the time when I got a bit more interested in machinelearning methods.At this conference we are presentinga paper on generative adversarial nets,or also called GANs.

  • 01:16

    CHRISTIAN ARNOLD [continued]: And these have been around for maybe four or five yearsnow in computer science.And they were originally created to come upwith synthetic photos.So the original idea was you have your data setwith thousands of photos, you show it to the algorithm,and the algorithm creates completely artificial photos

  • 01:38

    CHRISTIAN ARNOLD [continued]: that look like real.So there is a famous database on celebrities, loads of celebrityphotos, and the algorithm createspictures that look like real personsbased on this celebrity data set.but as a matter of fact, these picturesare completely made up.

  • 01:59

    CHRISTIAN ARNOLD [continued]: We use these image processing algorithms.And we want to apply it to normal data sets.So our idea is that, in these times, as social scientists,we very often have to deal with sensitive data.And when you want to make your studies replicable,you have to protect the privacy of the people who

  • 02:21

    CHRISTIAN ARNOLD [continued]: are in your data set.And what you would have to do is you would have to give awayyour data, so that others can replicate your study,but since you cannot do that, there needs to be a way around.And we want to use these methods and come up with completelysynthetic data sets on the basis of the original data.

  • 02:43

    CHRISTIAN ARNOLD [continued]: So what we will be able to do in the end--the results that we are actually showing hereis that you show these GANs your original data set,and then you train the GAN, so that it becomes an expertin creating copies of the original data set,and then you have a synthetic copythat contains the same statistical information

  • 03:03

    CHRISTIAN ARNOLD [continued]: as the original data set.But the individuals in that data set, they aren't really there,right?They are really synthetic people, so to speak.And that's a great way of protecting privacyand a great way of making your studiesreplicable once you manage to do the training.[How do GANs work in practice, and how will your work improve

  • 03:26

    CHRISTIAN ARNOLD [continued]: replication?]I think it's easier than you would think,because image processing at the end of the dayis putting photos into numbers and then these peopleprocess the numbers.Right?And we simply leave out that step.We take the numbers right away.And so the actual transfer of all algorithms,

  • 03:47

    CHRISTIAN ARNOLD [continued]: that's pretty straightforward.We got the idea, actually, from a footnote in a paper, whichwas dealing with Facebook data.And you see these papers floating around in these daysquite often now, where you have a very prominent paper thatpresents awesome results on, I don't know, social networkanalysis in Facebook, large data sets, very sensitive data.And then in the end it says, hey,

  • 04:08

    CHRISTIAN ARNOLD [continued]: if you want to replicate our study, pleasego to Facebook to that super secure data vaultand you have to, I don't know, give them a two weeksadvance or something.And that goes totally against the idea of science, right?We need to make our studies replicable, otherwise,it's not science.Right?And the idea behind all our attempts

  • 04:31

    CHRISTIAN ARNOLD [continued]: to make synthetic data differentially private--that's this data privacy protection mechanismthat we put into these scans.The whole idea behind that is that you can reallysafely share your data, really give it away to other people,like the information that is within the data,the statistical information, while at the same time

  • 04:53

    CHRISTIAN ARNOLD [continued]: having an absolute fireproof guaranteefor the privacy of the people who are in the original dataset.It's definitely not mine.We borrowed it because it's really beautiful to explain it,and expanded it with our part that sort of explainsour innovation here.[How can we understand GANs in a more intuitive way?]

  • 05:16

    CHRISTIAN ARNOLD [continued]: GANs, at the end of the day, are a competitionbetween two camps.So you have a generator and a discriminator.And let's think of them for that purpose as bank robbersor criminals who want to counterfeit money on the onehand, and on the other side you havethe police who knows how real money should look like.

  • 05:36

    CHRISTIAN ARNOLD [continued]: Right?At the beginning, both of them are amateurs.Right?So the police knows a little bit how usually a banknote should look like because they'veseen it a couple of times.And the criminals are beginners as well.So they produce a green and white piece of paper.Right?And they give it to the police and they say, hmm,

  • 05:57

    CHRISTIAN ARNOLD [continued]: I'll take a look at the data.So the bank note, the real bank note.And no, it doesn't quite look like it, and gives it back.OK.And both in this very first moment,both sides learn something.Right?The police learns how a bank looks like,so they update their priors, giventhat they now know better how a bank note looks like.

  • 06:18

    CHRISTIAN ARNOLD [continued]: And the criminals, and they update their parametersas well.So the next round.And they try that again.And they give it to the police and they say,yes or no, and they give it back.And this game goes back and forth, back and forth,back and forth until both sides, at the end of the day,are really expert counter fighters on the one hand.Right?So you have expert criminals who know how to falsify money.

  • 06:39

    CHRISTIAN ARNOLD [continued]: And on the other side you have really expert police forcewho understands how to distinguish an original banknote from a fake one.And that's the whole idea behind these scams,how they were invented.You have a system of two networksthat mutually trains itself up to a really expert level.That's the start.

  • 06:59

    CHRISTIAN ARNOLD [continued]: And our contribution here is to introduce privacy mechanismon top of what has been done already.And the way how we think about that is--so imagine you really want to protectthe privacy in the original data,and you don't want to give too much information away

  • 07:20

    CHRISTIAN ARNOLD [continued]: about the people who are in that data set.And with our approach, you can say so, hey, Iwant to protect a lot of data.And it would be like putting very strong and very blurryglasses to the police and making the police wear these verystrong and blurry glasses.And the police is almost blind.

  • 07:40

    CHRISTIAN ARNOLD [continued]: And they can only have a rough idea of how this looks like.So the whole training process begins, but startsat some point, because the police can onlyintroduce so much information about the real data in there.On the other side, if you give the police really good glasseswhere they see a lot, they can become much better,and both the criminals and the police

  • 08:01

    CHRISTIAN ARNOLD [continued]: can bring this training process to much higher heights,if you want to.And the cool thing about our approachis that you select the kind of glassesthat the police should wear.And that's a good way of protectingthe privacy of the people that are in the original data set.[What kind of questions have you been able to answer using

  • 08:22

    CHRISTIAN ARNOLD [continued]: GANs?]We're in an initial stage where we areshowing that it is possible.And we use synthetic made up data.And it's not a published paper yet.So this is ongoing research here.But the idea would be to--yeah, I don't know, maybe select a study that is not--

  • 08:44

    CHRISTIAN ARNOLD [continued]: capable so far, because nobody could access the data.Or out of a sudden, you are able to release some variablesthat were so far and that had to be kept top secret.But it will also help future collaborationsbetween researchers, on the one hand,and I don't know, government entities or private actors,

  • 09:05

    CHRISTIAN ARNOLD [continued]: like companies that have lots of data about people,because so far the company, or the government agency wouldsay, hey, no, sorry, I know you guys wouldlove to do research here, but we can't give our data away.And we hope that with our tool the agency will actually

  • 09:25

    CHRISTIAN ARNOLD [continued]: be convinced that, yes, there is a wayto protect the privacy of the people whoare in the original data set, because all the researchersactually need is a synthetic copy of it.And the agency ex ante determineshow much of the privacy they wantto give away with their data.So yeah, we think it's quite neat.And we still have to roll that out, that idea.

  • 09:46

    CHRISTIAN ARNOLD [continued]: But as it stands currently, it looksquite OK and quite promising.[What would you recommend to students willing to explorethis research method further?]We, at my university, we're thinkinga lot about how to renovate the methods curriculumand how to bring that a bit more up to speed.And the big, big buzzword that has been floating around here

  • 10:09

    CHRISTIAN ARNOLD [continued]: is data science, of course.So on the one hand, obviously youneed to understand the statistics behind all that.And that's what has been taught the last couple of decades.But now with these massive new data sets,you need to know a bit about how do you actually do some datawarehousing, like how do you manage larger databases that

  • 10:29

    CHRISTIAN ARNOLD [continued]: are beyond the scope of your computer on your desktop?You need to know a little bit about real programming.So there is no way around that.I'm not saying you need to become a computer scientist,but when you want to play these things, that'swhat you need to do.Yeah.So I think these two competencies

  • 10:50

    CHRISTIAN ARNOLD [continued]: will be woven together even more in these days.[Why should social scientists work with big data and datascience methods?]I really, really believe that theseare exciting times for social scientists.For the first time in human history

  • 11:10

    CHRISTIAN ARNOLD [continued]: I'll be collecting data about humansat a scale that is really incomprehensible.And this has a lot of promises, but also a lot of dangers.I mean, I'm German.And think of the stars in East Germany having datathat Facebook or Twitter are collecting about individuals.

  • 11:31

    CHRISTIAN ARNOLD [continued]: And they would have dreamt that kind of situation.Nobody guarantees that we are doinggood with that massive amount of data that we're collecting.So far we are just beginning to understand how to use that dataand how to understand societies of their dataand how all that even transforms societies.Think of how Twitter has transformed

  • 11:52

    CHRISTIAN ARNOLD [continued]: the landscape of elections in the last couple of years.So there is a lot of stuff going on.So I think if you wanted ever to become a computer risesocial scientist, a computational social scientist,that's the time to do it.And it's super exciting.It's heaps of opportunities for research.

  • 12:14

    CHRISTIAN ARNOLD [continued]: And at the same time, it's very importantthat we as social scientists actuallyturn to these questions, because only we as social scientistscan answer questions about how societies are working, how theyshould work, how they are being governed,how they should be governed.And that against this whole new backdrop of collecting

  • 12:36

    CHRISTIAN ARNOLD [continued]: heaps amounts of data and--yeah, I don't know.I think this is totally exciting.And that's why I really love beinga researcher in these days.

Abstract

Christian Arnold, PhD, Lecturer in Politics at Cardiff University, discusses his research using generative adversarial networks (GANs) to create synthetic data for replication and privacy protection, including how GANs work, issues addressed by GANs, recommendations to students interested in research using GANs, and why social scientists should be working with big data and using data science methods.

Looks like you do not have access to this content.

Creating Synthetic Data for Replication & Privacy Protection using Generative Adversarial Networks

Christian Arnold, PhD, Lecturer in Politics at Cardiff University, discusses his research using generative adversarial networks (GANs) to create synthetic data for replication and privacy protection, including how GANs work, issues addressed by GANs, recommendations to students interested in research using GANs, and why social scientists should be working with big data and using data science methods.

Copy and paste the following HTML into your website