Skip to main content icon/video/no-internet

Curse of dimensionality refers to the rapid increase in volume associated with adding extra dimensions to a mathematical space. In the behavioral and social sciences, the mathematical space in question refers to the multidimensional space spanned by the set of V variables collected by the researcher. Simply put, the ability to simultaneously analyze large sets of variables requires large numbers of observations due to the fact that, as the number of variables increases, the multidimensional space becomes more and more sparse. This problem manifests itself in several analytical techniques (such as multiple regression and finite mixture modeling) in which difficulties arise because the variance-covariance matrix becomes singular (i.e., noninvertible) when the number of observations, N, exceeds the number of variables, V. Additionally, as N approaches V, the parameter estimates of the aforementioned models become increasingly unstable, causing statistical inference to become less precise.

For a mathematical example, consider multiple regression in which we are predicting y from a matrix of explanatory variables, X. For ease of presentation, assume that the data are mean centered; then the unbiased estimate of the covariance matrix of X is given by

None

Furthermore, the general equation for multiple regression is

None

where

y is the N × 1 vector of responses,

X is the N × V matrix of predictor variables,

β is the V × 1 vector of parameter estimates corresponding to the predictor variables, and

∊ is the N × 1 vector of residuals.

It is well known that the estimate of β is given by

None

It is easily seen that the (X′X)–1 is proportional to the inverse of Σ. Thus, if there are any redundancies (i.e., Σ is not of full rank, or in regression terms, multicollinearity exists) in Σ, it will not be possible to take the inverse of Σ and, consequently, it will not be possible to estimate β. One possible introduction of multicollinearity into Σ is when V exceeds N .

Related to this general problem is the fact that, as V increases, the multidimensional space becomes more and more sparse. To illustrate, consider the Euclidean distance between any two points x and y,

None

the square root of the sum of squared differences across all V dimensions. To begin, consider the two points x = (1, 3) and y = (4, 7), which results in the Euclidean distance of d(x,y) = [(1 – 4)2 + (3 – 7)2]1/2 = [9 + 16]1/2 = 5. Now assume that K additional, albeit meaningless, dimensions are added to each observation by sampling from a uniform distribution with lower bound of 0 and upper bound of 1 (denoted by U(0,1). The new Euclidean distance, d(x,y)∗, is given by

None

where the 5 represents the original Euclidean distance and the remainder of d(x,y)∗ represents the additional distance that is due to random noise alone. Clearly, as K → ∞, then d(x,y)∗ → ∞, indicating that as more dimensions are added, the two points become farther and farther apart. In the extreme, an infinite amount of random noise results in the two points being infinitely far apart.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading