Skip to main content
Search form
  • 00:00

    [MUSIC PLAYING][An Introduction to Data Science and Spatial Big Data]

  • 00:09

    DR. ZHAO YANG: Hi.My name is Dr. Zhao Yang.I'm a research assistant professorat the computer science departmentof American University.[Dr. Zhao Yang, Research Assistant Professor,Computer Science, American University]My research interest is about the spatial big data analytics,high performance computing, and statistical machinelearning series.[What is big data?]

  • 00:34

    DR. ZHAO YANG [continued]: Big data is a hot topic in recent years,you know, especially for the spatial big data.We now have the GPS device, and it'sgenerated a large volume of spatial big data.So to understand the big data, weneed to understand the four V's of the big data.

  • 00:55

    DR. ZHAO YANG [continued]: [The four Vs of big data]The first V is the volume.[Volume]That means that big data has a very large size.For example, your laptop is about GB [INAUDIBLE] hard disk,but for the big data, it's maybe in the TB and the PB.The TB is 1,000 GBs and the PB is 1,000 TBs of data.

  • 01:21

    DR. ZHAO YANG [continued]: [Variety]The second V is the variety of the big data.That means the data comes from different sources,like, for example, your cell phone has a tracking data,and also we have the satellite images.And we have the surveillance drones.

  • 01:41

    DR. ZHAO YANG [continued]: It generates the video clips.All the data are in different formats.[Velocity]The third V is the velocity of the big data.For example, in those data, it grows very fast.We have drones called MQ-9's, Reaper drones.

  • 02:02

    DR. ZHAO YANG [continued]: It's 14-hour flight mission.We will generate over 70 TP, the video data.[Veracity]The fourth V is the veracity of the data.It means the data comes with uncertainty.For example, your cell phone records your locationevery second.But those GPS signal has some uncertainty.

  • 02:25

    DR. ZHAO YANG [continued]: It's about 10 meters.The arrow is about 10 meters.It's not accuracy, the locations.[What are the three types of knowledge required to be a datascientist?]The data science is including three expertises-- the domainexpertise, the mathematics, and the computer science skills.

  • 02:49

    DR. ZHAO YANG [continued]: The domain expertise, especially for the spatial big data,means you know something about the characteristicsof the spatial data.And the mathematics means you knowthe statistical skills to analyze the spatial big data.And the computer science skills means

  • 03:10

    DR. ZHAO YANG [continued]: you can program as your purpose meansyou can write the Python or the R programto analyze your spatial big data.Let's talk about the skill side of a modern-day scientist.

  • 03:31

    DR. ZHAO YANG [continued]: So you are not only mathematician, you are not onlya programmer, your are not only a business analyst.You need to understand your data,and you need to design the statistical model,and then you need write the program yourselfand generate and realize the report.

  • 03:52

    DR. ZHAO YANG [continued]: Your client is not the programmer or notthe mathematician.They may be the CEO.They are my business ownersYou need to visualize--generate the graphic report they can understand.So the report is about the business logic.It's not about any single part of the mathematics.

  • 04:16

    DR. ZHAO YANG [continued]: [What is the difference between spatial big data and othertypes of big data?]We always talk about the "spatial is special."That means the spatial data is not the traditional big data.It has its own characteristics.We have, it's called, structured data.

  • 04:38

    DR. ZHAO YANG [continued]: We have the vector data or the raster data,and then we have the semi-structured datalike the GML or KML.The people who write the application for the Google Mapsor Google Earth, we are familiar with the GML or KML.And then, we also have the unstructured data

  • 05:00

    DR. ZHAO YANG [continued]: like the satellite image of the videos clips.[What is geoinformatics?]The second term is geoinformatics.So it's maybe not familiar with most of the people,but I think you have heard of building informatics using

  • 05:24

    DR. ZHAO YANG [continued]: information science to analyze the biological data.The geoinformatics, it's a similar term.We are using information science to analyzethe geographical data.So it's including the category, the GPS--Global Positioning System-- and the global GRS--

  • 05:48

    DR. ZHAO YANG [continued]: Global Remote Sensing-- and the GIS--Geographical Information System--to analyze the spatial big data.Let's talk about the sample of the spatial bid data.For example, you know that there is a missing airplane,the Malaysia Airlines MH370.

  • 06:09

    DR. ZHAO YANG [continued]: We cannot find this airplane, because it didn't providethe tracking data to the airline companies.And so the GPS tracking data is one kind of the big data.Another kind of data is satellite image.You know that you have the remote-sensing satellite,

  • 06:31

    DR. ZHAO YANG [continued]: like the landslide.It generates TB-level satellite image per days.And we have the drones.The US army has the MQ-9, the Reaper drones.A 14-hour flight mission will generate over 70

  • 06:53

    DR. ZHAO YANG [continued]: TB-level spatial big data.We have a perfect example about the importanceof the spatial big data.In those areas of the missing airplane, the Malaysia AirlinesMH370 Series 70 and the airplane company,they just shut down the tracking.It appears the tracking device on that airplane,

  • 07:14

    DR. ZHAO YANG [continued]: they only cost $15 per month.But when the airplane is missing, we cannot find it.So they cost about, I think, a billion dollarsto locate the airplanes.We have the cell phone tracking data.There's about 6 billion cell phones on the world.

  • 07:37

    DR. ZHAO YANG [continued]: How much tracking data were generatedfor these kind of cell phones?So what's the motivation for the spatial big data analysis?For example, we can track your cell phones,and we can analyze your spatial patterns.So we can try to find your behavior patterns.So for example, the cops can do some crime analysis

  • 08:02

    DR. ZHAO YANG [continued]: or some behavior analysis for the high-risk peopleto find the terrorists.The Statistical analysis software we use is R.And there is a some traditional statistical software,like the SAS or SPSS.They are a commercial product, and they

  • 08:23

    DR. ZHAO YANG [continued]: are popular in the pharmaceuticaland the financial industry.But the licensing was very expensive,and the you cannot define your own analysis functions.But R is open source, and it's popular in academia.You can develop your own statistical analysis functionsand share with others.

  • 08:45

    DR. ZHAO YANG [continued]: Yes.And R has an open source community called CRAN.Yes.It's a very good community to communicatewith other spatial big data users.To analyze the spatial big data, we need a distributed computing

  • 09:10

    DR. ZHAO YANG [continued]: environment.For example, you have a very small Excel CSV fileyou can analyze on your laptop.But if the data size is very big, like, it's over 1 TB,you cannot analyze it on your laptop.You need a distributed cluster to upload, and process,

  • 09:31

    DR. ZHAO YANG [continued]: and analyze that spatial big data.So a popular solution is called Hadoop or Spark.The Hadoop is developed from the Apache Foundation.It's an open source solution for the big data.And the Spark is the next generation,

  • 09:52

    DR. ZHAO YANG [continued]: the distributed environment.It's faster than Hadoop, but it's a little more difficultthan the Hadoop.With Hadoop, the only thing you need to knowis the Java programming.But for Spark, you need to understand and learn

  • 10:12

    DR. ZHAO YANG [continued]: a new language called SQL.Yes.It's popular, I think, in the internet companies,but may not be popular in some traditional industries.The spatial data warehouse is the realityof the traditional data warehouse.It's designed for the spatial data.And it's helped to store, to process,

  • 10:36

    DR. ZHAO YANG [continued]: and to analyze the very large scale spatial big data.For the traditional data warehouse,we have the IBM DB2 and the Teradata.Button for the spatial data warehouse, most of the usersthey design their spatial big data housebased on an open source solution like Hadoop.

  • 10:58

    DR. ZHAO YANG [continued]: For the statistical computing environment,we're using R as the open source platform.And so traditional R is running on the single desktop.But we have an extension to let R run on distributed clusters.

  • 11:20

    DR. ZHAO YANG [continued]: Yes.That's the spatial big data analytics solutions.The spatial big data mining technologyis a little bit different to traditional datamining, because the spatial data isdifferent with traditional data.Spatial data is the data with the coordinates--

  • 11:42

    DR. ZHAO YANG [continued]: so with longitude and latitude.And also, the two adjacent spatial data, they alsohave the autocorrelation with each other.So to do the spatial data mining youneed to understand the feature of the spatial data.All this work is based on the statistical prediction models

  • 12:03

    DR. ZHAO YANG [continued]: for the spatial big data.This is a very cool example about spatial big dataanalytics--it's about the ocean wind on the global level.So it's an interactive and 3D website.You can zoom in, and click, and get the detailed informationabout ocean wind.

  • 12:25

    DR. ZHAO YANG [continued]: The backend of this project is a supercomputerfrom NOAA, the National Office of the Oceanography,and the front end is an interactive web.So you definitely want to try this, because it's very cool.The second project is the cyber attack maps.It demonstrates the cyber attacks

  • 12:47

    DR. ZHAO YANG [continued]: on the internet in real-time.So you can see the origin of the attackersand the target of the attackers.So probably, you can see that the attacker is from the NorthKorea, and the target of the attacksis maybe in Washington DC.And you also can trace and analyze the cyber attacks

  • 13:11

    DR. ZHAO YANG [continued]: that--one specific cyber attacks.The third project is the taxi tracking.And this project is by the Microsoft of Beijing.And you install the GPS tracking device on 10,000 taxis.And you can see that this is the taxi tracker 3.

  • 13:31

    DR. ZHAO YANG [continued]: In one week, you see that the different taxis havedifferent behavior patterns.Some taxi may pick up the customers at the airport,but some taxi drivers may prefer to stay at the downtown areas.So this is a very interesting projects.I like it.

  • 13:58

    DR. ZHAO YANG [continued]: Big data is pervasive data type.So almost everything is carrying with the locationand the timestamp.And the spatial data is very special data.It's not the traditional data.The spatial data has its own characteristics and features.

  • 14:20

    DR. ZHAO YANG [continued]: To analyze the spatial big data, youshould have some domain knowledge of the geographies.You also need to understand some statistical results to analyzethe spatial big data.Also, you need to have some computer science skillslike the R or Python to write a program

  • 14:41

    DR. ZHAO YANG [continued]: to implement the spatial statistical algorithms.The traditional data mining technologyusually doesn't apply to the spatial data.Spatial data is carried with the locations-- the longitudeand latitude.So the traditional data mining algorithm

  • 15:04

    DR. ZHAO YANG [continued]: has done the work for the spatial data, wherefor the spatial data mining community,we developed some specific algorithmsabout the spatial data mining.Also, you can see one important wayto analyze the spatial data is the visualization.All the three user cases we just demonstrate

  • 15:26

    DR. ZHAO YANG [continued]: the spatial big data on the map.Yes.That's the difference between the spatial big dataand the traditional big data.Well, they must do everything on the map.We need three kinds of skills you have.You need the domain expertise.Like, you need to understand the what's the GPS signal?

  • 15:46

    DR. ZHAO YANG [continued]: What's the specifics of the geography knowledge?And the you need some statistical knowledgeto about the basic spatial data mining algorithms.Also, you need some basic computer science skills,like you need to writ the R or Pythonprogram to implement the spatial data mining algorithms.

  • 16:08

    DR. ZHAO YANG [continued]: I think the spatial big data is very, very interesting.You see that we are not the boring mathematics whenwe have some interactive maps that you can interactwith the data on the map.


Dr. Zhao Yang, PhD, Research Assistant Professor in Computer Science at the American University, discusses data science and spatial big data including, how big data is defined, three types of knowledge required to be a data scientist, other skills a modern data scientist needs, the difference between spatial big data and other types of data, what geoinformatics is, analyzing spatial big data, and recommendations for students interested in working with spatial big data.

Looks like you do not have access to this content.

An Introduction to Data Science & Spatial Big Data

Dr. Zhao Yang, PhD, Research Assistant Professor in Computer Science at the American University, discusses data science and spatial big data including, how big data is defined, three types of knowledge required to be a data scientist, other skills a modern data scientist needs, the difference between spatial big data and other types of data, what geoinformatics is, analyzing spatial big data, and recommendations for students interested in working with spatial big data.

Copy and paste the following HTML into your website