Skip to main content icon/video/no-internet

Cross validation refers to a procedure in which an analysis is performed on one dataset and the parameters used in a second dataset. At the same time, an analysis is performed on a second dataset and then the parameters are used on the first dataset. A successful cross validation would occur when the overall estimation of the process has equal accuracy for both datasets. The argument is that the resulting equation or estimation process generated using one set of data will cross validate or continue to be accurate when applied to an entirely new set of data. Similarly, going from the second set of data to the first set retains the same level of accuracy. The procedure is useful when the desire is to provide a means of predicting a score using a combination of predictors. The goal is the generation of an equation that will maintain a high level of prediction when used on different samples. If successful, additional future samples would have the same level of accuracy when using the generated equation. The technique has particularly strong application when the original sample is very large and the future samples are much smaller. The larger sample for the original estimation should provide a great deal of accuracy in the estimation of the parameters that may not be as accurate when considering smaller samples. This entry provides a detailed example that offers further explanations of cross validation as well as justification for its use. Next, the aspect of cross validation that examines whether an equation predicted by a theory can continue to be equally predictive in all contexts is examined. Finally, the value of cross validation is reviewed.

Justification for Cross Validation

The reason or justification for cross validation is the assumption that using a sample to generate an estimate is only valuable if the equation used can be applied to other samples. If the equation generated for one sample will not generalize to other samples, the value of the equation for prediction becomes seriously restricted and loses practical utility. The key to the use of any equation is the degree to which the equation will work on additional samples. The process essentially becomes a test of the generalization of the prediction. The generation of such an equation proves beneficial because a successful cross validation indicates the ability to produce a valid contribution to using indicators to provide accurate predictions.

An example of this would be using a multiple regression equation to predict an outcome. Suppose one were to use gender (1 = male, 2 = female), age (measured in years), and communication apprehension (scale score from 24 to 120) to predict the level of argumentativeness (scored from 50 to 100). The results demonstrate that all three predictor variables (gender, age, and communication apprehension) are significant predictors of argumentativeness. The overall equation generates a Multiple R of .70. The Multiple R is the correlation between the predicted score of argumentativeness (estimated by the equation using the combination of the three predictor variables) and the actual score that was measured. The equation generated using this dataset is the

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading