Skip to main content
SAGE
Search form
  • 00:01

    [MUSIC PLAYING][An introduction to Data Mining and Complexity Science]

  • 00:10

    BRIAN CASTELLANI: My name is Brian Castellani.My master's degree is in clinical psychology.And my doctorate is in sociology.I was formerly at Kent State Universityin the United States.And I am now at Durham University here in the UK.[What is data mining?]

  • 00:31

    BRIAN CASTELLANI [continued]: Well, there's lots of definitions of data mining.I think probably the easiest way to think about data miningis that conventional statistics did a fantastic job.They've helped us to understand a lot of phenomenonsociologically in terms of healthand even in the natural sciences.But as things have become increasingly complex--

  • 00:56

    BRIAN CASTELLANI [continued]: and we live now in a world of these massive quoteunquote, "big data databases--" new techniques are needed.And data mining is really about creating those techniquesfor doing that type of data analysis.Now, that doesn't mean the conventional thingsthat students learn in school aren't still used

  • 01:18

    BRIAN CASTELLANI [continued]: like regression or different things.But it also means that there's a new set or a new repertoireof techniques.And what's exciting about those techniquesI think for particularly people just coming into academianow is that these techniques resonate betterwith the world in which they liveor the worlds in terms of living in the world of social media

  • 01:42

    BRIAN CASTELLANI [continued]: and Twitter and Instagram and geospatial sort of analysesthat we have a find your friends and what's going onand how do you turn something viral.Well, companies are interested in those same questions.My nephew, he has an Apple Watch for example.And he had-- we were sitting there talking about data.

  • 02:02

    BRIAN CASTELLANI [continued]: And he said, well, what's big data?I said, well, bring up your steps dailyon your Apple Watch.And there he had five years of every single dayof his life, every single step that he had takenand all this information.And so the question was, well, how would you analyze that?it's taking place in real time.It's taking place across a pretty considerable amount

  • 02:25

    BRIAN CASTELLANI [continued]: of time and then you times that bysay 200 million people or two billion people.How would you actually go about doing that?So that's the challenge data mining is up against.Again, to make that point though,it doesn't replace conventional statistics.It adds to them.So as far as the actual act of doing data mining, which

  • 02:49

    BRIAN CASTELLANI [continued]: is a bit different from what I justsaid in terms of just the general purpose or focus,data mining is of one of two types.It's either working with new data based on ideasthat we already have.So for example, I already know, say,that a particular shopper likes a particular type of crisp.

  • 03:12

    BRIAN CASTELLANI [continued]: And so I know that that shopper goes daily to the storeor weekly to the store and buys those crisps.So then my question is, what if I put a slightly differentcrisp next to that.Would they buy it?So that's having pre-existing ideas about thingsand then seeing if you can use that to predict new behavior.And that's obviously important when

  • 03:33

    BRIAN CASTELLANI [continued]: you think about predicting elections, predicting sales,predicting health trends, Google Trends, for example,looking at how flu or different types of disease progressed.And then it gets quite sophisticated say in my fieldwith community health or the Centerfor Disease Control in the states or the World HealthOrganization where now you're very, very concerned

  • 03:55

    BRIAN CASTELLANI [continued]: with pandemic outbreaks.And how could you follow what youknow to where don't or predict?The other side is exploratory.And that's where you have no clue.You don't know what's going on in the data.And so your scraping or crawling.These are the sorts of terms you hear a lot.

  • 04:17

    BRIAN CASTELLANI [continued]: Scraping data is taking data that youknow is out there for a particular topic.Like for example, you might be interested in viewspeople have on a particular politicianor a particular social topic.And so you scrape the internet, existing websites for that.Or you can crawl to see how things are moving across time.

  • 04:40

    BRIAN CASTELLANI [continued]: So somebody posted something hereand you want to follow it through.And then again, it gets more concrete in termsof the type of work that people likemyself do which is how in the area of, saymedical informatics for example, howare physicians treating a particular disease?And how is that disease treatment trending across time?

  • 05:03

    BRIAN CASTELLANI [continued]: What sorts of results are coming from that?And can we identify best practices from that?And so that extends out to anything-- the airline industryto community health to environmental protectionspecies and so forth.So, again, one is having existing knowledgeand trying to predict in very large databases

  • 05:24

    BRIAN CASTELLANI [continued]: or trying to work through very large databasesto hopefully come into or come across patterns that aren'texpected or easily identified.[What other techniques are used in data mining?]I talked about what data mining is then what data mining does.

  • 05:50

    BRIAN CASTELLANI [continued]: And then it gets down in terms of the actual techniquesthemselves--which range across an entire compendium of toolsand possibilities methods?There's machine intelligence.There's artificial neural nets.There's genetic algorithms, geospatial modeling.

  • 06:11

    BRIAN CASTELLANI [continued]: Things most people are probably more familiar withis social networking, complex social network analysis.But then also agent-based modeling.What's nice about agent-based modelingis you can then take these big data, data mininginsights that you've had in and simulate them,which is obviously extremely important.We see every day with the weather.

  • 06:32

    BRIAN CASTELLANI [continued]: We have these models of what's supposed to happen.But we see that even with massive, massive data.It's still incredibly difficult to predict--if it's going to rain in three hours.It's not-- so there's wonderful techniques.But it's not always as easy to achieve the goalsthat we're after.

  • 06:53

    BRIAN CASTELLANI [continued]: So what I encourage people constantly is youhave to use a repertoire of techniques, whichis new to the social sciences.In the hard sciences and the computational sciencesthey're used to picking tools based on the researchquestion being asked.In the social sciences, we tend to ask research questions thatfit the tools we've learned.And that in my mind is probably the biggest cultural shift

  • 07:16

    BRIAN CASTELLANI [continued]: that has to take place in the social sciencesand if these sort of data mining techniquesare going to be embraced.[What is big data?]Nobody really knows.It's a neologism in the sense that if you went back

  • 07:38

    BRIAN CASTELLANI [continued]: to the first computer I had which ran on floppy drives,big data was two megabytes of datawhich shut my machine down.So I think in many ways, big data means for a lot of peoplejust data larger than what they're normallyused to dealing with.And that's not in my mind what the real definition of big data

  • 08:01

    BRIAN CASTELLANI [continued]: is.In my mind big data has to do with complexity.And it has to do with the reach.So big data in my mind is best viewed as complex datain that it's multi-level.You might have an individual user.Then you might have a group level, thena state level, then a regional level to the global level.

  • 08:24

    BRIAN CASTELLANI [continued]: Variables are no longer singular.You have to study variables in terms of howthey relate to one another.And then you've also got this issueof time, which is very difficult for social scienceto deal with time.When you put all that together--and then the last piece, which is that a lot of this data

  • 08:46

    BRIAN CASTELLANI [continued]: is what we call unstructured.That means data that isn't alreadyin an existing format for analysis.Text data is a great example of that.So big data is really about all those things.It's about time.It's about complexity.It's about the degree of interdependencebetween the variables.

  • 09:07

    BRIAN CASTELLANI [continued]: And it also has to do with the structure the data comes in.And then I would say lastly it reallyis then-- to go back to what most people say--is the sheer volume of information.I think the big challenge right nowis what's the difference between big data versus good data?And that's probably the biggest struggle in the field.

  • 09:30

    BRIAN CASTELLANI [continued]: [What are the limitations of conventional methodsand statistics of data mining?]In terms of struggle, it's not-- the struggledoesn't just exist in big data.In a book that I'm presently completingfor Sage called Data Mining Big Data:Complex and Critical Inquiry, we take

  • 09:50

    BRIAN CASTELLANI [continued]: to task not just big data and datamining but also the conventions of social science research.The problem with the current conventionsis really a manifestation of what I was just talking about.The challenges of time and interdependence and complexity

  • 10:10

    BRIAN CASTELLANI [continued]: and data flowing in real time and lacking structure--all those sorts of things are a major challengeto the conventions of statistical social scientificanalysis.Historically, statistics was createdin the field of statistical mechanics

  • 10:32

    BRIAN CASTELLANI [continued]: to get averages, to get a sense of, say, wherea gas molecule was in a jar.Well, you can't pinpoint the gas molecule.But you could take an average and a guesstimate.And then we had good ideas of howmolecules distributed within some say, system, like a glass.Then we found well, you could try that with voting behavior,

  • 10:53

    BRIAN CASTELLANI [continued]: for example, the average vote.But we see now in the world of data mining and big datathat oftentimes elections are coming downto very particular areas within a very particular kindof country within a particular settingand a particular group of people basedon a particular set of factors.Conventional statistics-- they're

  • 11:13

    BRIAN CASTELLANI [continued]: not designed to do that.And we've seen that, because both say, for example,in recent elections throughout the United States and in the UKand throughout Europe where the survey researchers said,oh, this is going to happen.And it didn't happen.So there is this sense that conventional statistics

  • 11:34

    BRIAN CASTELLANI [continued]: are very good.But when it comes to predicting differences and predictingparticulars, conventional statisticsis not very good at that.Because they're looking for the aggregate, the average.So say, for example, a patient comes in to a doctor's office.

  • 11:55

    BRIAN CASTELLANI [continued]: And the patient presents with a certain set of symptoms.And then the physician says, well,based on this I think you have this.And then this is the course of treatment.That would be a conventional approach based on statistics.The average person with these symptomswould therefore most likely have this.

  • 12:16

    BRIAN CASTELLANI [continued]: And then this is the best option for treatment.Then after that doesn't work, they suddenly say, maybethis person isn't the average.And then they kick in a different form of treatment.Well, if it's just a cold, that's not that significant.However, if it's cancer or if it's some very serious disease,

  • 12:38

    BRIAN CASTELLANI [continued]: that kind of time doesn't always exist.The physician doesn't have that level of luxury.What conventional statistics needs to do better is to say,rather than there's just a patient average,it needs to say there are multiple trajectories,multiple trends.And the physician's job is to figure outwhich trend the patient is on, not just based on the symptoms

  • 13:02

    BRIAN CASTELLANI [continued]: they present with--which is what everybody does-- but the particularsof that person.So if somebody comes in and they're, say, 10and they have chest pain, and they're palpitating,and they're sweating, I'm less likely to thinkthat they're having a heart attack than if somebody was 90.So those would be an obvious thing.But when you get into more nuanced sorts of distinctions,

  • 13:28

    BRIAN CASTELLANI [continued]: say, for example, reaction to a medication or something--and the way I--when I explain this to people and they say,well, Brian, OK, that's a good and fine argument but--well, then I refer them to what in the statesis called the Physician Desk Reference Manual.And it's about that thick.The pages here which are lined in red

  • 13:48

    BRIAN CASTELLANI [continued]: are the actual medications.And the rest of the book which is tissue paper thinare the side effects because of differences.It's because a medicine is averagedon a particular person or a particular group of people.And then this is what happens to everybody else thattakes that medicine who's not the average.So the challenge that data mining has put forward

  • 14:10

    BRIAN CASTELLANI [continued]: is to say maybe we need to get rid of the word average.Maybe there is really no such thing as an average.Maybe there are just differences.And then in those differences, an averagewould be a smaller type of thing.So I would be much more interestedin the average response to a medicationamongst the smaller group of people

  • 14:31

    BRIAN CASTELLANI [continued]: than I would to just take an aggregate average of say,millions.That takes a lot of difference in thinking,because conventional statistics hasbuilt a very, very good repertoire of validityand reliability techniques.Once you start breaking populationsinto so many small subgroups, the first and immediate

  • 14:54

    BRIAN CASTELLANI [continued]: question is how do you know that's correct?And I think that's where data mining has its own limitations.So this is where moving from the limitationsof conventional statistics to the limitations of datamining--data mining still struggles with issuesof validity and reliability in that respect.And so back to say, for example, Pandora the radio website.

  • 15:20

    BRIAN CASTELLANI [continued]: So I go on Pandora and I'll say, oh, Ilike this particular band Genesis.And then it says, oh, you like Genesis?So you're probably like, yes.OK, great.Then you're like, yes, maybe you'll like ELO.And then after about a month I'm bored.Because the algorithm just keeps giving me the samestuff over and over again.How do you get new in there?How do you get something more interesting?

  • 15:41

    BRIAN CASTELLANI [continued]: So the dangers is you can say, oh, yes, datamining does a better job than conventional statisticsbecause it breaks things up better.But in turn, that doesn't necessarilymean that that breaking apart is sufficiently valid or reliable.And people oftentimes, as we know myself included,

  • 16:04

    BRIAN CASTELLANI [continued]: we're not that singular in any particular way of being.We're complex and contradictory people.So all of these things have limitations.And I think it's important for people watching this videoto take a critical approach.

  • 16:26

    BRIAN CASTELLANI [continued]: David Byrne and Jill Callahan wrote a book in 2013on social complexity which I strongly recommend.And that's their argument is you can't just--OK, great.Data mining's wonderful.And big data is interesting.And this is the world we live in.But if you're not critical and thoughtfulabout what you're doing and how you're doing it,

  • 16:47

    BRIAN CASTELLANI [continued]: you risk being arrogant in the opposite direction.Oh, I know all these new techniques.And conventional statistics has all of these problems.But that doesn't mean the new doesn'thave its problems as well.So you see this with Facebook recentlyin terms of how things can be easily manipulatedwithout people really understanding what's happening.

  • 17:08

    BRIAN CASTELLANI [continued]: So it's very important to take a critical perspectiveon all of it.And to not just assume that a limitation means somethingisn't any good.[What do people need to consider when working with more complexdata mining methods?]So probability for those thinking about how would they

  • 17:29

    BRIAN CASTELLANI [continued]: critically move from, say, what they're doing to embracingthese new ideas but in a thoughtful way that actuallyhelp them do their work-- be it an undergraduate studentor doctoral student or a researcher, professor--there's a map of the complexity sciences that I created.You can Google it.

  • 17:50

    BRIAN CASTELLANI [continued]: And it really shows a nice overviewof the different methods and areas that have been created.And it takes me to a larger point,which is the emergence of this field called the complexitysciences.Alternatively talked about as complexity theory.The complexity sciences have, I think,

  • 18:12

    BRIAN CASTELLANI [continued]: provided a good model for how to proceed--which is the goal in scientific analysisshould not be to fit data to method.But it should be the fit method to data.And the hard sciences and the computational scienceshave always just done a better job of that.

  • 18:33

    BRIAN CASTELLANI [continued]: You don't see the type of dividesthat you see in the social sciencesbetween quantitative and qualitativeor these sorts of philosophical argumentsabout how one method is better than the other.You're trained in engineering or in physicsor in applied mathematics to simply lookat the problem at hand and then ask what

  • 18:54

    BRIAN CASTELLANI [continued]: is the tool that best works--and more importantly, which combination of tools work well?And I think that that's what I would stronglyrecommend both students and my colleaguesto start doing more of and do betteris to actually start thinking what is my research question?

  • 19:14

    BRIAN CASTELLANI [continued]: And what set of techniques would bestfit analyzing that question or modeling that questionor exploring that question?I think the other thing that students and colleagues needto do is they need to get much more interdisciplinaryin their work.

  • 19:34

    BRIAN CASTELLANI [continued]: I think that's already happening.I see that amongst my students--certainly, a much more open-minded view on,well, there might be these other areasthat influence what I'm doing.But you see in academia in UK, for example, and the states,there's this strong emphasis on specifying and focusing.

  • 19:56

    BRIAN CASTELLANI [continued]: And you do social science or you do maths,or you do this or that.I think that's somewhat problematic,because it silos people into thinkingthey've learned what they need to knowand that this works for everything.And I think getting people to be more interdisciplinary--

  • 20:17

    BRIAN CASTELLANI [continued]: start that again.I think getting people to be more interdisciplinaryand getting people to think about what method best fitsthe question, although those things sounds simple,they're a lot harder to do than they actually sound.Because you've got to move out of your comfort zone.And that would be the third thing I would say.We ran an ESRC workshop on interdisciplinary mixed methods

  • 20:44

    BRIAN CASTELLANI [continued]: at University of Warwick.And in the room we had artists, physicists, mathematicians,sociologists, psychologists, policy evaluators, nursing,social work, the entire-- we just argued for three years.It was wonderful.We had to challenge each other on, well,when you say the word non-linear what does that mean?

  • 21:05

    BRIAN CASTELLANI [continued]: The physicist has a very specific ideaof what that means versus, say, a policy analyst.And so if you really want to move criticallyinto these new areas, you have to bewilling to take the risk of being wrong.You have to take the risk of not knowing everything, which we'reunfortunately educated into thinking we shouldn't admit,

  • 21:29

    BRIAN CASTELLANI [continued]: and being willing to work in teamsand be open to ways of thinking that you haven't been educatedin or haven't been trained in.And those are really, I think, significant limitationsthat are above and beyond the methods themselves.

  • 21:50

    BRIAN CASTELLANI [continued]: When you start dealing with those,then the methods aren't so difficult.Because oftentimes if you do that well you havea colleague who can help you.[Tell us about your case based complexity project.]So as concerned as this issue of figuring out

  • 22:10

    BRIAN CASTELLANI [continued]: which method or tool or techniquebest fits with the question being askedor the question one's interested in exploring,my colleagues and I have developeda particular repertoire of techniquesthat go under the name of case-based complexity.This starts with the case-based methods traditionthat Reagan and others started in the states with Reagan

  • 22:33

    BRIAN CASTELLANI [continued]: and Becker's famous book What Is a Case.What made that so radical was that the social sciencesand health sciences when it came to doing actual research alwaysstudied variables--race, class, gender, occupation.But human beings aren't variables.They're profiles.

  • 22:54

    BRIAN CASTELLANI [continued]: They're configurations of variables.So as I'm sitting here, a series of different factorsdescribe who I am and how I act and howI do the things that I do.It's not a particular variable.And so Reagan challenged the fieldby saying we should be studying cases.Now, what's ironic about this is a field like medicine

  • 23:18

    BRIAN CASTELLANI [continued]: or community health or social work or psychologyor any sort of discipline that's focused on caring for peopleor caring for communities is those communities,those people are cases--the case in front of me.This particular person with this particular problemsuffering this particular poverty.

  • 23:39

    BRIAN CASTELLANI [continued]: And yet when we go to analyze those caseswe analyze variables instead.And then we look at the relative impactthat a variable has on others.So you get ridiculous types of questions as,which is more important in the usage of, say, new computertechnology, educational background or occupation?It's obviously both in the varying degrees

  • 24:01

    BRIAN CASTELLANI [continued]: in which they intersect.So I think would say, for example, Patricia HillCollins and the work that Sylvia Walby and others are doingin this field known as intersectionality, is reallystarting to move people closer to whaton a methodological level case-based methods are tryingto accomplish as well, which is to start thinking

  • 24:22

    BRIAN CASTELLANI [continued]: about the way in which things intersectand how those particular configurations manifestthemselves as different types of cases.So you see that takes this back to what I originallystarted talking about, which is data miningbeing focused on differences, subsetting populationsand breaking them apart.

  • 24:42

    BRIAN CASTELLANI [continued]: Well, that's what a case-based approach is trying to do.So moving from Reagan forward, a lot of peoplehave developed a lot of different techniquesand styles.Probably the most notable is Reagan's ownwhich is called QCA, also known as Quantitative ComparativeAnalysis.And Qualitative Comparative Analysis

  • 25:03

    BRIAN CASTELLANI [continued]: is attempting to do this is identifythe dominant major/minor profiles in a particular dataset.And then explore how causality or complexityamongst those variables explains some particular outcome-- maybeeducational success.What my colleagues and I did starting with David Byrne

  • 25:24

    BRIAN CASTELLANI [continued]: and really taking leave from him and othersis to try to take what Reagan and colleagues were doingand integrate it into this data mining,big data complexity science framework.So to take those traditional methods and then,again, with this idea of a mixed methods perspective,intersect them with these more recent developments

  • 25:48

    BRIAN CASTELLANI [continued]: in computational thinking and so forth.So what we've been doing is tryingto study cases by using a lot of the techniquesthat I talked about earlier.So as far as this idea of case-based methodsgoes and taking the ideas that Reagan had developedand then integrating them, we're still

  • 26:09

    BRIAN CASTELLANI [continued]: confronted with this issue of how do you actually do that?And it goes back to this issue of how does onestart, say for example, from a qualitative traditionor from a conventional statistical positionand move into these methods?So to come around to this issue of then, well,how does one actually get about doing this?And how does one move, say, based

  • 26:31

    BRIAN CASTELLANI [continued]: on what one has been trained in, whether it be qualitative,historical, or conventional statistics to actually usingthese techniques, even with a willingnessand an open-mindedness towards theseand doing the things I was talking about-- beinginterdisciplinary, working with colleagues from other fieldsand being open to new ideas and being wrong

  • 26:52

    BRIAN CASTELLANI [continued]: or not knowing everything?You still get down to the practical issueof using these methods.And the reality is a lot of them are quite difficult to use--say for example, deep learning like with machine intelligence.So what my colleagues and I have been doing

  • 27:12

    BRIAN CASTELLANI [continued]: is trying to develop a software app thatcan help people easily make use of a lot of these techniques.So I received a fellowship through CECAN,which is the Center for Evaluation of ComplexityAcross the Nexus.It's housed at the University of Surrey.Nigel Gilbert is head of project.

  • 27:35

    BRIAN CASTELLANI [continued]: And this has been our task is to create a more seamless,user-friendly environment.And we've put it in this package called RStudio,which is a freeware package.And so that means people can go to GitHuband download this for free.And use it, modify it.Do what they want with it.

  • 27:56

    BRIAN CASTELLANI [continued]: Presently, we have in there cluster analysis,topographical neural net, some machine learning,data visualization techniques.And we're now working on a tab thatdeals with simulating different scenarioswith the eventual goal of integratingagent-based modeling into it as well.

  • 28:17

    BRIAN CASTELLANI [continued]: It's a great place, I think, for people to start.Because it's not particularly difficult to use.And it is a good way to get in and just sort of explorebasic Excel files or basic databases,and try to start using those techniques.And then graduate up from that to maybe takinga stab at MATLAB or something like that in programming.

  • 28:40

    BRIAN CASTELLANI [continued]: But I don't think the field will proceed if everybody thinksthat after having listened to what I said that they'vegot to go out and now become experts in allthese different techniques.I can't do that myself.But you can learn them enough to make use of them.And that's what complexity is about.

  • 29:05

    BRIAN CASTELLANI [continued]: [MUSIC PLAYING]

Abstract

Professor Brian Castellani, PhD, discusses data mining and big data, including the limitations of conventional methods and the statistics of data mining, considerations when working with complex data mining methods, and concludes with an overview of his case-base complexity project.

Looks like you do not have access to this content.

An Introduction to Data Mining & Complexity Science

Professor Brian Castellani, PhD, discusses data mining and big data, including the limitations of conventional methods and the statistics of data mining, considerations when working with complex data mining methods, and concludes with an overview of his case-base complexity project.

Copy and paste the following HTML into your website