Clustering subjects based on multiple measurements (SAS)

Objective: Individuals are measured for several characteristics, and we want to group the individuals based on similarities across all characteristics. Statistical options are cluster analysis, principle components (PCA), and biplots. PCA has the advantage of combining correlated variables together, reducing the complexity of explaining why individuals cluster together. And biplots adds some nice features to PCA. This post compares these choices.
Example: 10 farms measured for nutrients in grass, and the primary question is to see if/which farms have similar nutrient profiles.

Create a random dataset with 4 nutrients measured on 4 pastures in each of 10 farms.
Run the SAS code for producing biplots. Here we restrict the number of PC to 2 (n=2), in general you would use PCA to decide how many components are needed. Prinqual requires all variables to be processed by a Transform statement, here we use the identity transformation so the variables are unchanged. Id statement specifies which variable labels the individuals to be clustered. plots=mdpref requests the biplot.


data one;
do farm=1 to 10;
do pasture=1 to 4;
  adf=60+10*ranuni(3443);
  ndf=30+5*ranuni(3444);
  cp=10+2*ranuni(3445);
  ash=10*ranuni(3446);
  output;
end;end;
run;
proc prinqual data=one plots(only)=mdpref n=2 out=sss; 
  transform identity(adf ndf cp ash );  id farm;
run;

Resulting biplot shows points labeled by farm, with 4 per farm from the 4 pastures. Since example data are random, it is not surprising that pastures within farm generally show no clustering, and this makes clustering of farms very difficult to visualize. The arrows provide information about the measured variables, and from a clustering perspective, arrows going in similar directions indicate similarity in values. NDF and ash are correlated, while NDF and ADF are uncorrelated (90 degree angle means completely different directions), and CP and NDF are almost perfectly negatively correlated since they go in opposite directions. Combining farm and variable information, we see that farm 10 tends to be located in the direction of ADF, thus it must have more extreme values for ADF as compared to other farms. See Forrest Young for a quick interpretation guide, and full details can be found in textbooks.

Unless we are interested in clustering of samples within individuals, the messy scatter above would normally be removed by running the analysis on farm means.


proc means data=one noprint; by farm;
 var adf ndf cp ash;
 output out=mmm mean=;
run;
data mmm; set mmm; drop _type_ _freq_;
run;

proc prinqual data=mmm plots(only)=mdpref n=2 out=sss; 
  transform identity(adf ndf cp ash );  id farm;
run;

Now we have a clear view of farm clustering, with 9 being different, 7 and 2 clustering, and the rest forming another, or possibly 2, clusters. Using means, the 4 variables are found to be independent of each other (all 90 degree angles), except for pairs going in opposite directions.

So the biplot approach provides a nice solution to the research question. For comparison purposes only, let's look at PCA.


proc princomp data=mmm plots=all n=2;
  id farm; var adf ndf cp ash;
run;

Clustering information is given by two plots:

We see exactly the same clustering as the biplot, just separated into two graphs. The only benefit of running PCA is that it shows a 3rd PC is probably needed, although we could have guessed that were true by adding the percents displayed as axis labels in the biplot, and finding that only 69% of the information is captured by the first two PCs.
And finally, here is typical code for running a cluster analysis. SAS has multiple procedures for clustering, with multiple methods in each. Read the documentation for hints on appropriate choices.


proc cluster data=mmm method=ward outtree=tree ;

  id farm; var adf ndf cp ash;
run;
proc tree data=tree noprint out=clust ncl=4;
   copy adf ndf cp ash farm;
run;
proc sgplot data=clust;
 scatter y=adf x=cp/group=cluster datalabel=farm markerattrs=(symbol=circlefilled size=12);
run;

Proc Cluster produces a tree cluster. The X-axis is a measure of distance among the clusters. If we choose the 0.1 value, there are 4 lines, indicating 4 clusters. Agreeing with PCA, Farm 9 is in its own cluster, but other clusters show different membership. Disagreement is due to Cluster using the original variables, whereas PCA creates PC combinations of variables, then uses those to cluster.

If we want a scatterplot of the clusters, the above Tree and Sgplot code is needed. Note we must specify the number of clusters (ncl=4), and must choose 2 variables for the scatterplot axes. Thus the resulting plot may not show clear separation of clusters, as is the case here (clusters 1 and 2 intermix). But an advantage of not using PCs is we can interpret scatterplots directly. For example we see that Farm 10 had high ADF values, whereas with the biplot we knew it was extreme, but nothing else. We also get some direct evidence for why Farm 9 is unusual. To get a complete picture we would need additional scatterplots involving the other nutrients. In this example, with only 4 nutrients, it would be tempting to use this clustering approach, as the number of scatterplots would be manageable (worst case is there are 6 pairs of the 4 variables), and thus avoid the potential difficulty interpreting PCs versus the actual measurements. As the number of variables increases, biplot approach is preferred.