Skip to main content
SAGE
Search form
  • 00:00

    [MUSIC PLAYING]Big data means a lot of things to a lot of people.I think for big data, you need you

  • 00:22

    need a lot of data about a complex system.[MUSIC PLAYING]

  • 00:30

    JIM KARKANIAS: Arguably, all of the machinery of life--biology--is an information process.It's what do cells do in what way, in what order, and why,and how?And understanding that is a computational problem.Understanding the parts and how they fit togetherand why they fit together in a particular way

  • 00:51

    JIM KARKANIAS [continued]: helps us understand what health looks like whenthe parts work well together.And what disease looks like when the parts don'twork well together, and why, and what to do about it.So I lead the computation effort for Biohub.I like to call it the computation effort because itencompasses everything from data science, math, statistics,algorithms, to the technology that underpins

  • 01:13

    JIM KARKANIAS [continued]: that development, and even IT infrastructure thatmakes that all work.The group is an eclectic group.We have mathematicians and experts in graph theory,set theory, bioinformatic people, developers,technologists, imaging experts, machine learning experts.

  • 01:35

    JIM KARKANIAS [continued]: Everyone's becoming a machine learning expert.It's sort of a foundational skill.And everybody had brings their own special magic and approachto the situation.

  • 01:43

    SPEAKER 1: It's essentially doing the same thingthat other tools do.But what you're pointing out is, you're pointing outthat that information, which is in 150 base pair reads,is actually in 27-mers.That's incredible, right?That's a huge-- to me, that's a huge piece of information.Now practical applications of it?Sure, I mean, that's a big open question.But it's basic science.

  • 02:03

    ANGELA PISCO: As a data scientist here at the Biohub,we are embedded in the teams that we are working with.So my case in particular, I work really closewith the single cell transcriptomic group.I am aware of the projects they are working on.They are aware of the methods and the tools

  • 02:24

    ANGELA PISCO [continued]: that I'm developing, and it works both ways.So we both benefit with this constant interactionof expertise.

  • 02:31

    JOSH BATSON: We want to understandthe way different cells work.It's a cooperative system, very complex.Everybody has their own sort of job.The biologist wants to understandsome biological system.Maybe it's all the cells in the body.Maybe it's the evolution of pathogensas they're spreading through the human population.So I think mathematicians like interesting problems.

  • 02:51

    JOSH BATSON [continued]: And biologists generate a lot of them.Now I would say that in biological researchin particular, the data is much richerand that requires more sophisticated analysis.It's a blessing, right?It means biology is extremely complex.And so you need complex measurementsif you want to characterize what's going on.

  • 03:15

    JOSH BATSON [continued]: It's one of many paradigm shifts.

  • 03:16

    SPEAKER 2: After finishing alignment,I use this tool from the Integrative Genomics Viewerto visualize my alignments.And so I think this tool will be really greatwhen we have a genome.But right now, I have to go through transcriptby transcript and see how my alignments mapto that particular transcript.

  • 03:36

    JOSH BATSON: There are a bunch of research groupsat the Biohub that have their own things they'retrying to do.And for some of these, there is a computational componentthat is pretty complex.And so either they couldn't do the experiment they need to,or they couldn't understand that without computational help,or they could do a lot more if they

  • 03:57

    JOSH BATSON [continued]: had someone with computational expertise involved.I'm working on a machine learning method for microscopyand realized that some of the datathat Manu, a leader on the cell biology team--his group is gathering would be a perfect testingground for this.Biohub has two major domains.One is infectious disease, and oneis Cell Atlas, or cell biology.

  • 04:19

    JOSH BATSON [continued]: And that's the long-term scientific goalof understanding how cells work.

  • 04:23

    JIM KARKANIAS: The Cell Atlas programis, let's catalog all the parts of every cell, in every tissue,in every species across time and figure out what they all do.

  • 04:35

    SPYROS DARMANIS: There is a shiftin what you spend your hours doing as a molecular biologist,right?And if you think 20, 30 years ago, you probablynot even had to interact with a computer.You really didn't have to.And today, both personally and my team, most of the time

  • 04:55

    SPYROS DARMANIS [continued]: you spend is actually working with dataand not so much generating the data.And it's very easy today to do an experimentwhere you measure 25,000 genes across 10,000 different cells.We're just going to use these to visualize things, OK?So this is the don't explode part.So the emphasis becomes, how can I analyze my data?

  • 05:18

    SPYROS DARMANIS [continued]: Not so much, how we can acquire my data?

  • 05:20

    JOSH BATSON: You want to say, can Iget a picture where every cell is maybe a dot?And cells that are doing similar things are nearby.Right.And so, that may be some way of seeingall the cells in one atlas.The other thing you might want to do is ask,are there clusters of cells?All of these cells are sort of doing the same thing.We grabbed the chunk of tissue.

  • 05:41

    JOSH BATSON [continued]: Of course, every cell's unique in its own special way,but they're up to the same business.And the question is, what are the kinds of liver cells?What do they do?How abundant are they?How are they related?And so for that, you want to ask questions about ensembles.What's going on with this group?And for that, you need to find the groups.And as you can see, there are some little groups.

  • 06:02

    JOSH BATSON [continued]: These are the clusters of cells.Sort of that population, this population, this population,and this big one.And they're colored based on the tissue they came from.It was clear that if we wanted to get the biology out of theseclusters-- what is this group of cells doing?--we needed the biologists in.And so at that point, we needed to make all of them

  • 06:23

    JOSH BATSON [continued]: into data scientists, in a way.And so they could actually be able to analyze their own data,to be able to tune parameters, be able to zoom in on one part,be able to check things out.We really needed to have the computational team, whoknew the effects of the algorithm we're using, togetherwith the biologists, who knew what made sense biologically.

  • 06:44

    ANGELA PISCO: Some of the tools that I'm developingare, for example, methods that enable the research assistantshere to analyze the data that they are generating in the lab.So one thing that I'm very passionate about

  • 07:04

    ANGELA PISCO [continued]: is to give people ability to analyze their own data.So I don't want to be seen as a core facility.That people just kind of like, see me as a personthat they say, oh the data is here.Please go download it, and please analyze it for me,and provide me the results.I want them to be empowered to do their own analysis.

  • 07:24

    ANGELA PISCO [continued]: So I'm writing them code that allows them,with some guidance, to do it.So I do tutorials that, for example, helpsthem clustering cell types.So clustering is one of the techniquesthat is heavily used in single cell RNA sequencing analysis.

  • 07:45

    ANGELA PISCO [continued]: So what basically clustering does is, it breaks your data--so your collection of cells--into small groups in a two dimensional space.You can imagine you have a square,and then you have 1,000 points.And then those points, we use machine learningalgorithms for clustering.

  • 08:05

    ANGELA PISCO [continued]: They are like, standard methods in the field.And they will break your data into clouds in that 2D box.So now that we have done the cell ontologies annotated,what is the next experiment that we'regoing to be doing with it?

  • 08:23

    SPEAKER 3: One of the ways that wecan use the data moving forward is, by using the cell ontology,we can compare to other existing data sets, given that they doconform to the same vocabulary.

  • 08:35

    SPYROS DARMANIS: When you're tryingto describe a whole system--like an organ or tissue--by just mashing it all together and describing itas one number, you are prone into losing someof the details of the system.So single cell genomics basicallyaims to overcome this problem by doing the following exercise.You say, I have a piece of tissue.

  • 08:58

    SPYROS DARMANIS [continued]: So let's say the liver.And one way of understanding what the liver is doingis to take the whole tissue, mush it up,get the RNA collectively from every cell in that tissue,and then sequence it and say, OK.The liver as a whole is expressing these genesby looking at one cell at the time.

  • 09:19

    SPYROS DARMANIS [continued]: Right?So in the crazy case where you haveone type of cells in the liver that is highly expressinga gene and another set of cells that is not expressing itat all, you're actually going to see these two populationsinstead of assuming that, oh, the entire liver is expressing

  • 09:40

    SPYROS DARMANIS [continued]: this gene.It kind of so and so levels.

  • 09:42

    JOSH BATSON: In these two groups of hepatocytes from the liver--that group and that group--are the male and the female hepatocytes.They're separate enough in gene expression spacethat in this embedding, they actuallyappear as two separate groups.The liver, as you may know, cleans the blood.

  • 10:04

    JOSH BATSON [continued]: The blood passes through and it breaks down toxins, breaks downmolecules you don't want around, breaks down drugs, right?That's where drugs get metabolized.And for a long time, people have known that women and menprocess drugs differently.They have different responses to some medications,sometimes dramatically.And that has to be because of what's going on in the body

  • 10:26

    JOSH BATSON [continued]: biologically.And so sex differences in the liver are really important.And in this study, we had three female mice, I think,and four male mice--possibly vise versa.It's very common in biology to just experiment on male mice,and that has downstream consequences for health.And so in this experiment, you can

  • 10:47

    JOSH BATSON [continued]: start to get at these sex differencesin the biology of the liver that are part of a much bigger storythat we're kind of adding some fuel for.We had to switch to doing things on the big machinesin the cloud instead of our laptopsbecause the laptops couldn't handle it.I mean, this wasn't possible 10 years ago.

  • 11:08

    JOSH BATSON [continued]: And it's gone from-- you could getone cell, 10 cells, 100 cells.I mean, orders of magnitude increases every few years.And we have a machine in there called the NovaSeq, whichis capable of producing like, a billion sequencing reads.Then you get a billion data points out,and those are sort of strung out across the genomeof the organisms.Maybe actually, it's metagenomic.

  • 11:29

    JOSH BATSON [continued]: A lot of the software which had been made for big databricks on this stuff.

  • 11:36

    JIM KARKANIAS: And what's energizing me is the technologyhas advanced far enough that at the hardware level,we can simulate these things.We have vast amounts of computing at our fingertipsat fractions of the cost.Software is advanced enough to exploit these things that.You've probably heard-- everyone hasheard-- about machine learning, and deep networks, and so on.

  • 11:57

    JIM KARKANIAS [continued]: And that makes it possible to emulateparts of what the brain does, suchthat it can recognize faces, and this is a cat,and this is a dog.Famously, that's the start of the--I would say-- current machine learningera, where it was possible to distinguish what was thoughtof as only a human task.And if you could take a machine and have

  • 12:19

    JIM KARKANIAS [continued]: it think about information, you can imagine a daywhere it starts to think about the information that describesour biology and correlate all this genetic information,and sequence information, and cellular parts, proteinsand their functions and their locations,and put together a picture.Think about the biology and what it does.

  • 12:42

    JIM KARKANIAS [continued]: And in order to do that, you haveto assemble a group of experts and peoplewho are willing to work in creative, unknown waysfor a long period of time.And that's what the Biohub is.

  • 12:56

    ANGELA PISCO: Five years ago, the technologythat we are using now hasn't been developed yet.But the 10x Genomics, that really revolutionized the fieldwith the amount of cell that you can now analyzewas only recently available.And in terms of computational analysis,

  • 13:17

    ANGELA PISCO [continued]: one of the most powerful single cellRNA seq libraries for data analysis for Pythonhas not been developed yet.Back in 2012, people haven't really startthinking about how we're going to be analyzingthese data because the amount of data was not available.

  • 13:38

    ANGELA PISCO [continued]: But now, we are really at the era of big data.So computationally, this is the most exciting time to join.

  • 13:48

    JOSH BATSON: Data science isn't like,a field of scientific inquiry, right?It's a kind of collection of methods and expertisethat is meant to be applied to help understand the world.So every one who's graduating from Stanford or Berkeleyor UCSF right now in a biology departmenthas some basic programming skills.

  • 14:09

    JOSH BATSON [continued]: And they're going to have more and more,because every undergrad is learning somethingabout how to code, just because they want to manipulatethe world around them.

  • 14:18

    ANGELA PISCO: I had the opportunityof being deeply involved in the analysis.So I got to use all my computational backgroundto help on the beginning of the analysisat identifying cell types.But I could also develop new methods.So I created a new machine learning method that basically

  • 14:42

    ANGELA PISCO [continued]: takes a step further in using classical machine learningtechniques-- such as the random forest--provides you a classification system for the cell types.And this is such a unique environment,that it couldn't have been done otherwise.So for this project to become alive,

  • 15:04

    ANGELA PISCO [continued]: it's really important to have the collaborationof so many backgrounds.And the people that I work with is just incredible.

  • 15:15

    JIM KARKANIAS: Some of it is because the time is right,the technology and hardware and softwareis prime and ready to address these problems.And then the last piece of the puzzle is people.There are people who are really motivated to work in this wayand drive us into the future.

  • 15:38

    JIM KARKANIAS [continued]: We always talk at Biohub about if we need a tooland it doesn't exist, we'll make it.And that's the spirit here of figuring outhow to address the future.[MUSIC PLAYING]

Video Info

Publisher: SAGE Publications Ltd

Publication Year: 2019

Video Type:In Practice

Methods: Data science, Big data, Machine learning, Clustering, Classification

Keywords: algorithms; biological sciences; biology; cell biology; cluster analysis; collaboration; computational biology; computer models; computer science; data analysis; data management; gene mapping; microbiology; molecular biology; research methods; Sex differences ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:

Keywords:

Abstract

Chan Zuckberg Biohub's Jim Karkanias, PhD, Vice-President, Angela Pisco, PhD, Data Scientist, Josh Batson, PhD, Senior Data Scientist, and Spyros Darmanis, PhD, group leader Cell Atlas initiative, discuss how big data and computational science are changing biological research, including working with biological data, some specifics of the Cell Atlas project, new tools to analyze cellular data, and single cell genomics.

Looks like you do not have access to this content.

Applying Data Science to Biological Big Data: Chan Zuckerberg Biohub

Chan Zuckberg Biohub's Jim Karkanias, PhD, Vice-President, Angela Pisco, PhD, Data Scientist, Josh Batson, PhD, Senior Data Scientist, and Spyros Darmanis, PhD, group leader Cell Atlas initiative, discuss how big data and computational science are changing biological research, including working with biological data, some specifics of the Cell Atlas project, new tools to analyze cellular data, and single cell genomics.

Copy and paste the following HTML into your website