Skip to main content icon/video/no-internet

Data compression is the process by which statistical structure in data is used to obtain a compact representation for the data. Structure can exist in data in various ways. If there is a correlation between neighboring symbols, this correlation can be used to remove the predictable portion of the data and encode only what remains. If patterns exist in the data, they can be replaced by indices to a dictionary of patterns. Even when samples of a data sequence are independent of each other, they might show bias, with some symbols occurring more often than other symbols. This bias can also be used to provide compression. Sometimes, it is easier to focus on what is not present rather than what is present in the data. For example, the low pass nature of particular data can be taken advantage of by processing the data in the spectral domain and discarding the higher frequency coefficients. In brief, the characteristics of the data guide the compression process.

Depending on the requirements of the user, data compression techniques can be classified as lossless or lossy. Lossless data compression techniques allow the exact recovery of the original. Lossy data compression permits the introduction of distortion in a controlled fashion to provide greater compression. Lossy techniques are used only in situations where the user can tolerate distortion. We will discuss some commonly used data compression techniques in the following sections.

Application Areas

Data compression is used in a wide variety of applications. WinZip and Gzip are commonly used file compression utilities on computers. Images on the Internet and in many cameras are compressed using the JPEG algorithm. Video conferencing is conducted using compressed video. Cell phones use compression techniques to provide service under limitation of bandwidth. Digital television broadcasts would not be feasible without compression. In fact, compression is the enabling technology for the multimedia revolution.

Compression Approaches

Compression can be viewed, and compression techniques classified, in terms of the models used in the compression process and how those models are obtained. We can focus on the data, examining the different kinds of structures that exist in the data without reference to the source of the data. We will call these approaches data modeling approaches. We can try to understand how the data are generated and exploit the source model for the development of data compression algorithms. Finally, we can examine the properties of the data user because these properties will impose certain constraints on the data. We begin by looking at techniques based on properties gleaned from the data.

Data Modeling Approaches

With different applications, we get different kinds of structure in the data that can be used by the compression algorithm. The simplest form of structure occurs when there is no symbol-to-symbol dependence; however, the data symbols take on different values with differing probabilities. Compression schemes that make use of this statistical skew include Huffman coding and arithmetic coding.

Huffman coding, developed as a class project by David Huffman, assigns short codewords to symbols occurring more often and long codewords to symbols that occur less often. Let's look at the example in Table 1. There are five symbols in the original file. If we were to represent them using a fixed-length code, we would need three binary digits to represent each symbol. However, if we assign codewords of different lengths to each symbol according to their probability, as shown in Table 1, the average length (l) of binary bits needed to represent a symbol will be

None

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading