Skip to main content icon/video/no-internet

Modern researchers in various fields are confronted by an unprecedented wealth and complexity of data. However, the results available to these researchers through traditional data analysis techniques provide only limited solutions to complex situations. The approach to the huge demand for the analysis and interpretation of these complex data is managed under the name of data mining, or knowledge discovery. Data mining is defined as the process of extracting useful information from large data sets through the use of any relevant data analysis techniques developed to help people make better decisions. These data mining techniques themselves are defined and categorized according to their underlying statistical theories and computing algorithms. This entry discusses these various data mining methods and their applications.

Types of Data Mining

In general, data mining methods can be separated into three categories: unsupervised learning, supervised learning, and semisupervised learning methods. Unsupervised methods rely solely on the input variables (predictors) and do not take into account output (response) information. In unsupervised learning, the goal is to facilitate the extraction of implicit patterns and elicit the natural groupings within the data set without using any information from the output variable. On the other hand, supervised learning methods use information from both the input and output variables to generate the models that classify or predict the output values of future observations. The semisupervised method mixes the unsupervised and supervised methods to generate an appropriate classification or prediction model.

Unsupervised Learning Methods

Unsupervised learning methods attempt to extract important patterns from a data set without using any information from the output variable. Clustering analysis, which is one of the unsupervised learning methods, systematically partitions the data set by minimizing within-group variation and maximizing between-group variation. These variations can be measured on the basis of a variety of distance metrics between observations in the data set. Clustering analysis includes hierarchical and nonhierarchical methods.

Hierarchical clustering algorithms provide a dendrogram that represents the hierarchical structure of clusters. At the highest level of this hierarchy is a single cluster that contains all the observations, while at the lowest level are clusters containing a single observation. Examples of hierarchical clustering algorithms are single linkage, complete linkage, average linkage, and Ward's method.

Nonhierarchical clustering algorithms achieve the purpose of clustering analysis without building a hierarchical structure. The k-means clustering algorithm is one of the most popular nonhierarchical clustering methods. A brief summary of the k-means clustering algorithm is as follows: Given k seed (or starting) points, each observation is assigned to one of the k seed points close to the observation, which creates k clusters. Then seed points are replaced with the mean of the currently assigned clusters. This procedure is repeated with updated seed points until the assignments do not change. The results of the k-means clustering algorithm depend on the distance metrics, the number of clusters (k), and the location of seed points. Other nonhierarchical clustering algorithms include k-medoids and self-organizing maps.

Principal components analysis (PCA) is another unsupervised technique and is widely used, primarily for dimensional reduction and visualization. PCA is concerned with the covariance matrix of original variables, and the eigenvalues and eigenvectors are obtained from the covariance matrix. The product of the eigenvector corresponding to the largest eigenvalue and the original data matrix leads to the first principal component (PC), which expresses the maximum variance of the data set. The second PC is then obtained via the eigenvector corresponding to the second largest eigenvalue, and this process is repeated N times to obtain N PCs, where N is the number of variables in the data set. The PCs are uncorrelated to each other, and generally the first few PCs are sufficient to account for most of the variations. Thus, the PCA plot of observations using these first few PC axes facilitates visualization of high-dimensional data sets.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading