Skip to main content icon/video/no-internet

This entry discusses statistical models involving mixture distributions. As well as being useful in identifying and describing subpopulations within a mixed population, mixture models are useful data analytic tools, providing flexible families of distributions to fit to unusually shaped data. Theoretical advances in the past 30 years, as well as advances in computing technology, have led to the wide use of mixture models in applications as varied as ecology, machine learning, genetics, medical research, psychology, reliability, and survival analysis.

Suppose that F = {Fθ:θ∊ S} is a parametric family of distributions on a sample space X, and let Q denote a probability distribution defined on the parameter space S. The distribution

None

is a mixture distribution. An observation X drawn from FQ can be thought of as being obtained in a two-step procedure: First a random Θ is drawn from the distribution Q and then conditional on Θ=θ, X is drawn from the distribution Fθ. Suppose we have a random sample X1,…, Xn from FQ. We can view this as a missing data problem in that the “full data” consist of pairs (X11),…, (Xnn), with ΘiQ and Xii = θ ∼ Fθ, but then only the first member Xi of each pair is observed; the labels Θi are hidden.

If the distribution Q is discrete with a finite number k of mass points θ1,…, θk, then we can write

None

where qj = Qj}. The distribution FQ is called a finite mixture distribution, the distributions Fθ are the component distributions, and the qj are the component weights.

There are several reasons why mixture distributions, and in particular finite mixture distributions, are of interest. First, there are many applications where the mechanism generating the data is truly of a mixture form; we sample from a population that we know or suspect is made up of several relatively homogeneous subpopulations, in each of which the data of interest have the component distributions. We may wish to draw inferences, based on such a sample, relating to certain characteristics of the component subpopulations (parameters θj) or the relative proportions (parameters qj) of the population in each subpopulation, or both. Even the precise number of subpopulations may be unknown to us. An example is a population of fish, where the subpopulations are the yearly spawnings. Interest may focus on the relative abundances of each spawning, an unusually low proportion possibly corresponding to unfavorable conditions one year.

Second, even when there is no a priori reason to anticipate a mixture distribution, families of mixture distributions, in particular finite mixtures, provide us with particularly flexible families of probability distributions and densities that can be used to fit to unusually shaped (skewed, long-tailed, multimodal) data that would be difficult to describe otherwise with a more conventional parametric family of densities. Also, such a fit is often comparable in flexibility to a fully nonparametric estimate but structurally simpler, and often requires less subjective input, for example, in terms of choosing smoothing parameters. As another example, it has been shown that the very skewed log-normal density often can be well approximated by a two- or three-component mixture of normals, each with possibly different means and variances.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading