Skip to main content icon/video/no-internet

Phi Correlation Coefficient

The phi correlation coefficient (phi) is one of a number of correlation statistics developed to measure the strength of association between two variables. The phi is a nonparametric statistic used in cross-tabulated table data where both variables are dichotomous. Dichotomous means that there are only two possible values for a variable. As an example, the variable addressing life has only two levels, “alive” and “not alive” (or dead). So if a public health department was researching the proportion of newborns born alive versus born dead, each baby could be born alive or born dead; there are no other possibilities. Typically, such data are coded numerically for the computer. One level of the variable can be assigned the number 0 and the other level is assigned a number 1. To use the phi, both variables must be measured with only two levels. The symbol for the statistic is the lower – case Greek letter phi: ϕ.

The phi is the effect size statistic of choice for 2 × 2 (read two-by-two) table statistics such as the Fisher’s exact or a 2 × 2 chi-square. The data in columns and rows should be nominal, although it is frequently used with two-level variables measured at the ordinal level and for collapsed interval/ratio data. After providing background on the phi correlation coefficient, this entry reviews its assumptions and explains how to calculate and interpret and then concludes with a worked example.

Background

The phi was developed by Karl Pearson, who was one of the mathematicians involved in the development of the theory of general linear models. Pearson had a particular interest in correlation and developed a variety of measures, including the phi and the Pearson product moment correlation coefficient, better known today as the Pearson r. The phi is also a product moment correlation and provides correlation coefficient and significance results similar to results of the Pearson r. Many statistical computer programs (e.g., STATA, SPSS, SAS) compute the phi statistic and provide a significance level for the result.

As a correlation statistic, the phi measures the strength of an association between two variables. Correlation statistics provide four items of information:

  • They answer the question, “Do these two variables covary?” That is, does one variable change when the other changes?
  • When two variables do covary, these statistics describe the direction of the association, which can be positive or negative. A positive correlation means as one variable increases, the other also increases. A negative correlation means that as one variable increases, the other decreases.
  • Correlations describe the strength of the association. Strength in this context means how closely do the two variables change together? In a perfect correlation, for every one level of rise in one variable, the other variable would change exactly one level; it would either rise (positive correlation) or fall (negative correlation) that one level. The phi value can range from 0 to +1.0. (Given that the calculation requires the square root of a number, the result cannot be negative with the standard formula. Some other methods of calculation can return a negative number.)
  • The significance of the obtained phi value can be determined if hand calculated, and the statistical programs that produce the phi will provide a significance level.

Assumptions

The phi coefficient, like virtually all inferential statistics not specifically designed to test matched pairs or other related measures, assumes that the sample was randomly selected from a defined population. It assumes subjects were independently sampled from the population. That is, selection of one subject is unrelated to selection of any other subject. Like the chi-square, there must be an adequate sample size for the computed phi statistic to be useful. The chi-square demands that 80% or more of the cell expected values must be at least 5, and if this assumption is violated, neither the chi-square nor a phi calculated on the basis of that chi-square can be relied upon. It should be noted that samples smaller than 30 are considered to be very small samples, and small samples are less likely to be representative of the population of interest than larger samples. A sample size of 30 will, in most studies, provide a minimum of five for the expected values in all four cells.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading