Data Compression

Neil J.Salkind

doi:10.4135/9781412952644

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Data Compression

Edited by:
Neil J. Salkind
In:Encyclopedia of Measurement and Statistics
Chapter DOI:https://doi.org/10.4135/9781412952644.n124
Subject:Anthropology, Business and Management, Criminology and Criminal Justice, Communication and Media Studies, Counseling and Psychotherapy, Economics, Education, Geography, Health, History, Marketing, Nursing, Political Science and International Relations, Psychology, Social Policy and Public Policy, Social Work, Sociology, Science, Technology, Computer Science, Engineering, Mathematics, Medicine

Request Permissions

Show page numbers Hide page numbers

Data compression is the process by which statistical structure in data is used to obtain a compact[Page 226] representation for the data. Structure can exist in data in various ways. If there is a correlation between neighboring symbols, this correlation can be used to remove the predictable portion of the data and encode only what remains. If patterns exist in the data, they can be replaced by indices to a dictionary of patterns. Even when samples of a data sequence are independent of each other, they might show bias, with some symbols occurring more often than other symbols. This bias can also be used to provide compression. Sometimes, it is easier to focus on what is not present rather than what is present in the data. For example, the low pass nature of particular data can be taken advantage of by processing the data in the spectral domain and discarding the higher frequency coefficients. In brief, the characteristics of the data guide the compression process.

Depending on the requirements of the user, data compression techniques can be classified as lossless or lossy. Lossless data compression techniques allow the exact recovery of the original. Lossy data compression permits the introduction of distortion in a controlled fashion to provide greater compression. Lossy techniques are used only in situations where the user can tolerate distortion. We will discuss some commonly used data compression techniques in the following sections.

Application Areas

Data compression is used in a wide variety of applications. WinZip and Gzip are commonly used file compression utilities on computers. Images on the Internet and in many cameras are compressed using the JPEG algorithm. Video conferencing is conducted using compressed video. Cell phones use compression techniques to provide service under limitation of bandwidth. Digital television broadcasts would not be feasible without compression. In fact, compression is the enabling technology for the multimedia revolution.

Compression Approaches

Compression can be viewed, and compression techniques classified, in terms of the models used in the compression process and how those models are obtained. We can focus on the data, examining the different kinds of structures that exist in the data without reference to the source of the data. We will call these approaches data modeling approaches. We can try to understand how the data are generated and exploit the source model for the development of data compression algorithms. Finally, we can examine the properties of the data user because these properties will impose certain constraints on the data. We begin by looking at techniques based on properties gleaned from the data.

Data Modeling Approaches

With different applications, we get different kinds of structure in the data that can be used by the compression algorithm. The simplest form of structure occurs when there is no symbol-to-symbol dependence; however, the data symbols take on different values with differing probabilities. Compression schemes that make use of this statistical skew include Huffman coding and arithmetic coding.

Huffman coding, developed as a class project by David Huffman, assigns short codewords to symbols occurring more often and long codewords to symbols that occur less often. Let's look at the example in Table 1. There are five symbols in the original file. If we were to represent them using a fixed-length code, we would need three binary digits to represent each symbol. However, if we assign codewords of different lengths to each symbol according to their probability, as shown in Table 1, the average length (l) of binary bits needed to represent a symbol will be

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Data Compression

Application Areas

Compression Approaches

Data Modeling Approaches

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends