Skip to main content
SAGE
Search form
  • 00:00

    [MUSIC PLAYING]

  • 00:15

    STUART LYNN: CARTO is a company thatproduces tools and a platform for really helping peopleunderstand and make better businessdecisions through geospatial dataand geospatial information.I think our kind of major understandingis that the world is full of decisions that we make,

  • 00:37

    STUART LYNN [continued]: it's full of information, it's full of contextthat's all geospatial.So you behave differently when you're in home than whenyou're abroad on holiday.You behave differently when you'recommuting to work compared to hanging out on the weekends.And all of that impacts decisionsyou make about how you spend money,how you use transit, how you even just

  • 00:57

    STUART LYNN [continued]: interact with the world in a lot of different ways.And so for us, that kind of geospatial context, the contextof where you are really impacts howanything from trades to economic activity to transitreally has to be thought about and decisionsto be made with that.So CARTO's a company that really tries to bring togetherdata about location with data from clients and customers

  • 01:20

    STUART LYNN [continued]: to help them make better decisions by bringingthat context in their information.This will really help with the bottom lineof us getting new sales opportunities that will reallyhelp us with the end of quarter push.So my role here is the head of data science.I basically manage a team of four data scientists.We work a lot with clients.We also work internally on the kindof products we produce to help advise them and produce methods

  • 01:41

    STUART LYNN [continued]: for doing data science on these products.All right.So let's just go through our boards,just look at the various different projects.Let's start with marketing.We've got back pressure and pushing these stories out.And we work kind of as well with our marketing team to showcasethe kind of analysis that can be done using geospatial dataand how that can really impact the worldand how we make better decisions.

  • 02:02

    STUART LYNN [continued]: There's a lot which is about knowingthe right mathematical techniques,the right statistical techniques, the right languagesto program, and how you implement all this stuff.There's also a lot of skills around howdo you balance the requirements of businesswith requirements of data science.

  • 02:17

    WENFEI XU: There aren't that many, I think,good and open source GIS tools out there.And so in addition to kind of all the data manipulationthat you can do in CARTO in the back end,that was one of the huge benefits of kindof CARTO's front end platform.And now I think slowly what we'redoing is we're evolving from just being

  • 02:39

    WENFEI XU [continued]: a platform to doing a lot more.So we have an API where you can connect to CARTO's toolsand connect to CARTO's database and plug that into your own webapp that you're developing or your own notebook.There's one that's analysis, like kindof more general explanatory stuff and one that was more

  • 03:01

    WENFEI XU [continued]: looking at dual-time functions.My role at CARTO, it tends to vary,because everyone on the data science teamwears a lot of different hats.In particular, I do a lot of case studiesfor different kinds of industriesthat are little examples of how you wouldgo about solving a problem.The basic kind of question that he wants us to answer

  • 03:23

    WENFEI XU [continued]: is a site selection question.More and more, we're building out these tools.And what are company's focusing on right nowis we're focusing on of different kinds of specific webapplications or solutions that we can develop for logistics,optimization, site selection, and territory management.

  • 03:47

    WENFEI XU [continued]: What these kinds of stores actually dois find the commercial hotspots by just clustering POIs.Maybe companies themselves have their own kind of spatial data.And they're finding that they wantto develop these kind of customized solutionsto help them solve spatial problems,like territory management.So if I'm a company that has salespeople,

  • 04:11

    WENFEI XU [continued]: how do I use my spatial data to optimizewhere to send those salespeople and howto kind of most efficiently allocate territoriesto each of those salespeople suchthat their accounts are balanced, for instance?And we're increasingly finding that there's justa lot more interest in spatial data science.

  • 04:30

    JEFF FERZOCO: My title here is senior customer successmanager.And I interface directly with customers who have alreadybegun using our platform.And I take them from initial kickoffwhere they know very little about the projector the product and what they want to do with it.I talk to them through their goals, their needs

  • 04:51

    JEFF FERZOCO [continued]: and help them figure out what stepsto take to learn the product, to understandwhat they want to make, and how we can helpmake that a little bit better.The Million Walks project kind of emerged outof a lot of conversations and a good data set that we had.We came across a city agency that thatwas looking to measure what was going on in their parks.

  • 05:14

    JEFF FERZOCO [continued]: And the team here, the data science teamhere had been already kind of developinga little bit of that.

  • 05:19

    WENFEI XU: We know that we have some data from New Yorkand Seattle.So I might just start with those cities for now.We are using taxi data, pickup drop-offdata to make a visualization about different communitiesin Williamsburg and how community boundaries tendto be kind of fuzzy.And so basically, they saw that project, and they were like,

  • 05:39

    WENFEI XU [continued]: we collected this data.But we don't really have the in-house human resources.Can you take this data and make an interesting story out of it?We chose public parks because we understandthat there are privacy issues that at that pointwe had yet to really sink our teeth into.

  • 06:00

    WENFEI XU [continued]: And so we wanted to pick a space thatwas kind of public and neutral that we knew wouldn't reallyviolate anyone's privacy.So for us, that was an important component of the project.And also, me being an urban plannerand having studied how people use public space in general,

  • 06:21

    WENFEI XU [continued]: I knew that it was an interesting opportunityto understand a space that doesn't typically--you can't typically get usage, park usage datavery easily just because it's open public space.How would you actually count numberof people going to the park?

  • 06:41

    WENFEI XU [continued]: You'd have to hire somebody with the clicker and kind of countevery single person.

  • 06:46

    STUART LYNN: Do you think there'sa minimum park size we're going to yield the target with this?

  • 06:48

    WENFEI XU: There's a lot of zeros.I'm guessing this is maybe a lot of zeroswhere parks they're super small.And I think probably what we're goingto have a problem with is the fact that youhave parks that are too small.The first step of any of these types of big data projectsis actually to get the data.And this is not an arbitrary task at all,

  • 07:10

    WENFEI XU [continued]: because I think our data was somethinglike 300 gigabytes from a wide variety of different apps.It was large enough that we couldn't really process iton our local computers.There's some kind of attraction here,because I'm seeing more cities.But obviously, this is just the first step.There's some processing.

  • 07:31

    STUART LYNN: So that's something we wantto verify with some clustering.

  • 07:33

    WENFEI XU: So we've got the data.We thought it was super interesting.So coming from academia, getting this type of datais really rare, because first of all, it's quite big.In academia, you just have, I think, fewer connectionsto industry partners.So we knew that it was a really great opportunity for us

  • 07:53

    WENFEI XU [continued]: to do analysis on a big data set that's at the same timefine-grain, high resolution.This is cell phone GPS data from a bunch of different apps.So they could be like exercise apps or weather apps.And these apps can ping very frequently.

  • 08:14

    WENFEI XU [continued]: From this data, we had the potentialto develop a really fine-grained understandingof the spatial temporal patterns of human mobility.This green is walking.Red is running.And so you can see that during know, what, five o'clock iswhen you see most of--

  • 08:34

    STUART LYNN: --It looks as if the running tends to happena little bit early as well.These are all tightly clustered around late in the afternoon.

  • 08:41

    WENFEI XU: Yeah, yeah.

  • 08:42

    STUART LYNN: If we aggregate this across all the parks,we may be able to see a larger pattern,like build up the number statistics by combining parkstogether.

  • 08:48

    WENFEI XU: So getting the data meantretrieving the data from the serverthat it lived on to our own servers.And so I think the data lived on our clients' Amazon S3 server.And so Stuart created a script thatallowed us to transfer it on to Microsoft Azure

  • 09:10

    WENFEI XU [continued]: blobs on our servers.Clean the data had a couple of different steps.The first was to create a geofence around all the NewYork City parks.And what that means is literally making a cut outof all the data around the parks themselves.And so we wanted to do that in orderto decrease the total size of our data,

  • 09:31

    WENFEI XU [continued]: because, again, this data is generallyso big that it's hard to kind of just play around with itvery easily.So we wanted to cut down the size of our data,only work with what was essential for us,which is basically the data in the parks.The second aspect of cleaning the datawas to remove all of the noise.

  • 09:51

    WENFEI XU [continued]: I think what I'm seeing here basicallyis a lot of, again, that ambient noise.

  • 09:56

    STUART LYNN: Yes, it's like a lot of it's the edges, right?And that might be partly like--the trees going through the park, because like Fort Greenhas a lot of trees, so it's pretty ambient.I mean, yeah, it's pretty ambient.

  • 10:04

    WENFEI XU: I think this park is close to Pratt,and so there might be just like a lot of kindof ambient activity--When we get this data, first of all,we have probably more than 100 mobile apps in our dataset.We wanted to make sure that we onlyuse the ones that provided kind of useful information.So maybe there are some apps where,you know, like a handful of users use them,

  • 10:28

    WENFEI XU [continued]: so we removed those.There are also, you know, points--there are also data points that were very inaccurate.So sometimes they had--so the way that kind of cell phone GPS data accuracy worksis it's triangulated by two satellites,

  • 10:48

    WENFEI XU [continued]: and that determines where your location is.Now if you have, kind of, physical barriers--for instance, like a building or trees--that might decrease the accuracy of your point.If you're inside, you know, that might decrease the accuracy.And so, we had an accuracy measure for all of these datapoints, and we decided to kind of look at the distribution

  • 11:10

    WENFEI XU [continued]: and accuracy, see what was normal,and then remove those data points thatwere kind of abnormal, or they werekind of outlier data points.And so, they're kind of many ways of lookingat what an outlier could be.So that could be maybe apps that had very few users,or maybe that could be pings--you know, points that had, like, low accuracy.

  • 11:32

    WENFEI XU [continued]: But those are kind of all the small thingsthat we needed to figure out about our dataand then kind of remove all that noise from the data.After we did the analysis, and after wedid what is called kind of like the back end analysis--doing the clustering, doing the data analysis,we need to kind of visualize it and present it

  • 11:52

    WENFEI XU [continued]: in a way that would kind of be understandable to peoplewho are not data scientists.That was ultimately the goal for this project.And so, we conceived of making, like, a web app, or a web tool,that would allow you to explore the data.So the first section of the tool is a map and kind

  • 12:16

    WENFEI XU [continued]: of exploration tool that is builton CARTO'S kind of bread and butter platform,which is called Builder.What I did was I basically took that data.I kind of used CARTO's front end platformto then visualize the data.The second and third pieces of this datarequired me to build visualizations outsideof CARTO's platform, so we used a JavaScript software

  • 12:38

    WENFEI XU [continued]: package called D3.That is very popular from making kind of interactive datavisualizations.And we built some tools that would allowyou to explore specific parks.You can find the park by its different categoriesand its park name.And then you can understand, OK, how many visits

  • 12:58

    WENFEI XU [continued]: do I have in the park per week?How many visitors?What are the, kind of, temporal patterns that I see?When are the, kind of, most popular timesto visit the park?And then finally, you have a kindof a map that shows either the different travel modesfor that particular park--are people running, walking, biking staying--and then kind of where the hot spots are for that park.

  • 13:20

    WENFEI XU [continued]: A lot of the running and walking are basically around the paths.

  • 13:23

    STUART LYNN: Yeah, that makes sense.

  • 13:24

    WENFEI XU: The last section is kindof comparing all the different parks.So how do, you know, parks scale?For every kind of square foot area park that I have,how many visits, and how does that relationshipexist on a log-log scale?And then these yellow spots are basically stairs,

  • 13:44

    WENFEI XU [continued]: so you can see like around here.So I think the ethical privacy and data bias challengesof using this kind of data are actually extremely important.When we talk about cell phone data, it's coming from people.It's coming from you and me.And so, we want to, as much as possible,protect the privacy of people who are

  • 14:05

    WENFEI XU [continued]: producing this type of data.Those are kind of individual points that you see in a map.What we ultimately will do with this type of datais aggregate it, so we'll aggregate itto a scale that properly kind of anonymizesthe people within a particular cell, for instance.I think another issue that is really important for us--that we're kind of still doing some exploration on--

  • 14:27

    WENFEI XU [continued]: is where is the bias in this data?When we get mobile application datawe know that we're only getting datafrom people who are actually using these apps.And so, we have to understand those biases.Even kind of within typical kind of age ranges,we need to understand who is using these apps.

  • 14:49

    WENFEI XU [continued]: Who's not?Are we properly kind of capturing,you know-- is this data, like, representativeof the actual population?And if it's not, we want to make sure that we properlykind of adjust for that.

  • 15:01

    STUART LYNN: If we aggregate this across all of the parks,we may be able to see like a larger pattern.Just like build up the number statisticsby combining parks together.

  • 15:07

    WENFEI XU: I actually think that's the mostkind of difficult part of using this type of data--is really understanding, kind of,who we're dealing with, who we're missing out,how to anonymize people?And then, how to kind of properlyadjust for kind of all of these biases and kind of privacyquestions?I think that there are many different aspects

  • 15:28

    WENFEI XU [continued]: of this project that would have not have beenpossible without A, data science or B,kind of like the tools that we use in data science.The data is very big, and so without these kindof distributed computing tools, kind of Cloud computing tools,we wouldn't have been able to even getthe data onto our computers.Number two, there are various techniques and algorithms

  • 15:51

    WENFEI XU [continued]: that we use in this project to understandwhere people gathered.And for instance, the park is this really variable space,where maybe sometimes there are like a lot of people.Sometimes, like, a hot spot could be justwhere people picnic.And so, the scale and the density of how people gather

  • 16:14

    WENFEI XU [continued]: is very different.And so, without kind of some of the tools that we are usingto kind of properly cluster--so in this case, I'm using an algorithm called hierarchicalDB scan clustering algorithm.Without some of these tools, we wouldn't reallybe able to kind of find these hot spotsin this nuanced manner.

  • 16:32

    JEFF FERZOCO: The way we take these projects furtheris that once we see one customer using it successfully-- usingit, being a map, or a dashboard, or something that answersa question for one city or customer--we can then take that example, turn that into a use case,and then help other customers understandhow that could help them.Where I'd like to see A Million Walks in a Park go

  • 16:54

    JEFF FERZOCO [continued]: is to a level where cities can understand a little bit moreabout what's going on in parks.But more specifically help make them a little bit moreequitable, so that we are identifying where parksare leaning heavily towards a specific segmentof the community and ignoring another--and how to sort of redistribute access to the people

  • 17:18

    JEFF FERZOCO [continued]: that may or may not have access to a park,or may not know that they could have access.That might lead to them increasing amenities,or increasing services, or changingthe hours of the park--or actually adding another manager,so they can have more community interaction.There are a lot of different resultsthat can come from these dashboardsthat we're making for people.

Abstract

CARTO's Head of Data Science, Stuart Lynn, Data Scientist, Wenfei Xu, and Senior Customer Success Manager, Jeff Ferzoco, discuss the use of spatial data science to study usage of public parks in New York City, including the types of analysis that can be done with geospatial data; the kinds of questions geospatial data can answer; collecting geospatial data; cleaning, processing and analyzing geospatial data; challenges presented working with this data; and future projects using spatial data science.

Looks like you do not have access to this content.

Understanding Usage of NYC Public Parks with Spatial Data Science: CARTO

CARTO's Head of Data Science, Stuart Lynn, Data Scientist, Wenfei Xu, and Senior Customer Success Manager, Jeff Ferzoco, discuss the use of spatial data science to study usage of public parks in New York City, including the types of analysis that can be done with geospatial data; the kinds of questions geospatial data can answer; collecting geospatial data; cleaning, processing and analyzing geospatial data; challenges presented working with this data; and future projects using spatial data science.

Copy and paste the following HTML into your website