Skip to main content
Search form
  • 00:01

    [MUSIC PLAYING]My name's Louise Corti.I'm one of the associate directorsat the UK Data Archive here at Essex.And my main role is in director of collections developmentand producer relations.Collections development means getting in high quality data

  • 00:23

    from depositors, in various domains,and then making data available here.Producer relations is really about supportingthose creators of data to give us the highest quality data.OK.So now, we've negotiated with our depositor.We've got them to sign a license, whichmeans we can legally distribute the data.They've sent us in data files and adequate documentation

  • 00:46

    to describe the data.And now, once it comes in, it goes throughto the ingest team, or the processing team.Ingest team also processing, and what they dois they're the ones who get the data inand they kind of tidy up file names,they package the data into different folders,they convert data into different formatsto make it available to users, and they assemble

  • 01:06

    lots of bits of documentation.For example, you might have a questionnaire,you may have some show cards, youmight have some interview instructions.They package them into a PDF and they bookmark them,so you have like a book, a kind of reportthat people can look through.So they spend a few weeks sometimeson Data Sense packaging this stuff.But the other thing they do is they go back to depositors

  • 01:27

    if there's a query at all.So something's not clear, or something's missing,they go back and make sure that they have everything they need.We have a room right at the end of the corridor wherewe have seven or eight people who process the data, whobring different data formats in, and prepare themand create documentation and catalog records,so that users can find the data and use the data.

  • 01:49

    And I'll introduce you to Melanie in here.The first thing we do is that we lookat the data and the documentation for the study,and we develop a processing plan for the data.And this plan makes an assessmentof what the probable use of the dataare, and therefore, how much effort wewant to put into preparing the data for researchers.

  • 02:10

    If it's going to get a lot of views,we'll put a lot more effort into it.But it's also an assessment of what statethe data and documentation are in, and what kind of workwill have to be done to bring them up to a standardwhere they're easily used.And another part of that initial assessmentis looking at the level of confidentiality of the data,and making sure that the appropriate access

  • 02:34

    mechanism matches up with the sensitivity level of the data.For data that are extremely sensitive,they'll need to go into our secure laband be prepared for that.Whereas data that are less sensitivemight be available as an open download.Some data may be disclosive, or theymay be personal pieces of information

  • 02:54

    before it gets to us.And it may be removed when it comes here.But we have what's called kind of secure transfer of data,where we have a service that's encrypted.So people upload data to an encrypted siteand then we download it.So depositors should never send data via an email.Another way is to come to us with a memorystick, encrypted memory stick, and give it to us.

  • 03:16

    So it's all really about transferring data security.It's not really that difficult. Itsounds more difficult than it is,but it's really quite straightforward.So once a processing plan is complete,we apply standard processing procedures to all of our data,so that we have a consistent approach.But we also have a flexible approach,

  • 03:37

    because we get all kinds of different datainto the data archive.Now, it's important to remember that whenwe're processing data, we are ingesting it for use.We're not just doing it for the hell of it.We're doing it so that researchers can easilyuse the data.And therefore, we do things like produce

  • 03:58

    multiple different software formats,so that researchers who are used to using say SPSS, or Stata,or want a tab-delimited format that theycan read into R, for example, that they're all accommodated.But in addition to that, we also produce a platform independent,a software independent version, for long term preservation.

  • 04:22

    And that means that no matter what softwarehas come in and out of fashion over the years, the dataand the documentation will still be usable and readablefor future generations.Now, a big part of data being usableis that they have to be findable,and for that to happen, we have an extensiveand really detailed data catalog called Discover.

  • 04:45

    And in Discover, you can find lots and lots of informationabout individual data sets.And these are all presented in a standardized format,and it's entirely searchable.In addition to that, there are a numberof keywords that we assign from a controlled vocabulary,a thesaurus, that we assign to make surethat data are findable all across the spectrum.

  • 05:08

    And that the searches will come up,not like a Google search, where youhave thousands and thousands of hits, most of whichare irrelevant, but that the search results willbe extremely relevant to what you are interested in.To compile the catalog record, weuse information that's given to us by the depositor,but it is presented in a very standardized format that makes

  • 05:29

    it easy to read, easy to use.From the catalog, users can go ahead and order data directly,or apply for access to data if they can't get it directlyfrom the download server.Processing is important because sometimes the datadon't come to us in a kind of perfect package.And what we like to try to do is assemble data

  • 05:50

    in a fairly rigorous way, so that whenthe user comes to use it, they see a familiar format.So you'll see a catalog record, and all our collections,our 7,000 collections have the same kind of catalog record,with the same feels, the same descriptions.The documentation, like the questionnaires,are packaged in the same way.They're all in PDF format.So you come to shop window.

  • 06:11

    And you choose the data set, and it's packaged in the waythat you would expect to find it.So that's the main thing that we do.We kind of assemble the bits of data and documentationto make it into something that iseasily digestible and downloadable,and human readable as well.So that's just a broad overview of what we do to the datato prepare it for your use.

  • 06:34

    So now we have the data prepared and documented by ingest team.The next step is to make sure that we store the digital datasafety for the longer term, and that involves a curationand preservation of data.So we're going to go and talk to John, who's head of IT.And he will tell us about how we backup data, store it, keep itfor the longer term.We're going to go and talk to him outside the server room.

  • 06:54

    Unfortunately, we can't go into the server room,because it's a locked down zone.And it's safe and secure, and it has restricted accessto very few people.

  • 07:17

    OK, John, can you tell us a bit how we storeand preserve the data for the longer term?Yes, our aim is to make sure that the data isidentical and accessible, so that we can always get backto exactly what the depositor give us,no matter how many years later.So you talked a bit about the databeing accessible for the longer term.What does that mean?Well, two things really.

  • 07:39

    It means that we have some recommended fileformats for audio, video, image, et cetera.And we expect depositors to give us data in those formats.And it also means that we monitorthe availability of software toolsto open and read those formats.So every so often, we might have to migrate tens or hundreds

  • 08:01

    of files from one format to another, justto make sure that today's applications canwork with them.We have a difficult task in that we'resupposed to keep data forever.And if you think about 20 years ago,those was many word processing packagethat are not available now.So you try to read a format from the year 1995,you probably can't.So we assume that Word and Excel are going to be around forever,

  • 08:24

    but they're likely not.And we have to make sure that data are available in what'scalled open formats, so XML.Or I mean, Excel is an open format,but make sure the data are converted into an open format.And then they are migrated over time.So if a piece of software, or a format, goes out of date,we will need to migrate it to the next readable format.

  • 08:45

    So it's quite a lot of work in keeping your collection alive,but also making sure that the media don't dying.So media don't last forever.CD probably only last for 10 years,so we refresh the media on a regular basis,to make that the tapes are still readable.How do we store data safety and security?

  • 09:08

    Well, as you hear Louise, is that all digital mediainherently untrustworthy over the long term.So what we do is we use different types of media, tape,disc, et cetera, and we have multiple copies.So we have copies here in the server room,we have copies on site, we have copies on the campus,

  • 09:29

    and we have copies off site as well.All these copies are encrypted, and they'reon a variety of storage media, to make surethat if we have catastrophic failure of one set of copies,we have more copies we can go to.So we never lose the data.So John, because this is a digital archive, and it'sdigital access, how do we make surethat users can access our service all the time?

  • 09:50

    Well, in order to build to receive, process, store,and make available the data, we have many systemsand applications running here.And we've got to make sure that they're always available.So we back them up both incrementallyand on the disaster recovery basis,And keep copies going back many days.

  • 10:11

    So what that means is, if we havea fault or a loss of some configuration,we can roll back the clock to yesterday's versionof the system, or the day before,or the day before, et cetera.And that means that we can keep our systems up and running,so our depositors and our users can access our services.I like working here, because the challenges of data.

  • 10:34

    I mean, when I started in this job,data archiving was a very geeky job,and nobody really knew knew what it meant.But now, everywhere you go, there'sworries about data loss.We know we've got to look after our digital assets.Everything, all our photos are on digital media.So the importance of looking after data well and describingwell are all the more important.

  • 10:54

    We've got many more sources of data, digital data around,like social media data coming, transactional data,kind of surveillance data.There's so many sources or social data that could be used.So there's kind of a challenge in seeing what researchers wantto use, and they pretty much wantto use anything that's digital.So that's why it's exciting, because it ever changing.

Video Info

Publisher: SAGE Publications Ltd.

Publication Year: 2017

Video Type:In Practice

Methods: Data archives, Data management

Keywords: accessibility; cataloging; encryption; packaging; preservation; restricted zones; Software ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:



The staff of the UK Data Archive explain how research data is incorporated into their system and made accessible to other researchers. Data ingestion, cataloging, preservation, and migration are all discussed.

Looks like you do not have access to this content.

Processing Data: The UK Data Archive

The staff of the UK Data Archive explain how research data is incorporated into their system and made accessible to other researchers. Data ingestion, cataloging, preservation, and migration are all discussed.

Copy and paste the following HTML into your website