Skip to main content

Clustering subjects based on multiple measurements (SAS)

Objective:  Individuals are measured for several characteristics, and we want to group the individuals based on similarities across all characteristics. Statistical options are cluster analysis, principle components (PCA), and biplots.  PCA has the advantage of combining correlated variables together, reducing the complexity of explaining why individuals cluster together.  And biplots adds some nice features to PCA.  This post compares these choices.
Example:  10 farms measured for nutrients in grass, and the primary question is to see if/which farms have similar nutrient profiles.

Create a random dataset with 4 nutrients measured on 4 pastures in each of 10 farms.
Run the SAS code for producing biplots.  Here we restrict the number of PC to 2 (n=2), in general you would use PCA to decide how many components are needed.  Prinqual requires all variables to be processed by a Transform statement, here we use the identity transformation so the variables are unchanged.  Id statement specifies which variable labels the individuals to be clustered.  plots=mdpref requests the biplot.

data one; do farm=1 to 10; do pasture=1 to 4; adf=60+10*ranuni(3443); ndf=30+5*ranuni(3444); cp=10+2*ranuni(3445); ash=10*ranuni(3446); output; end;end; run; proc prinqual data=one plots(only)=mdpref n=2 out=sss; transform identity(adf ndf cp ash ); id farm; run;

Resulting biplot shows points labeled by farm, with 4 per farm from the 4 pastures.  Since example data are random, it is not surprising that pastures within farm generally show no clustering, and this makes clustering of farms very difficult to visualize.  The arrows provide information about the measured variables, and from a clustering perspective, arrows going in similar directions indicate similarity in values.  NDF and ash are correlated, while NDF and ADF are uncorrelated (90 degree angle means completely different directions), and CP and NDF are almost perfectly negatively correlated since they go in opposite directions.  Combining farm and variable information, we see that farm 10 tends to be located in the direction of ADF, thus it must have more extreme values for ADF as compared to other farms.  See Forrest Young for a quick interpretation guide, and full details can be found in textbooks.

Unless we are interested in clustering of samples within individuals, the messy scatter above would normally be removed by running the analysis on farm means.
proc means data=one noprint; by farm; var adf ndf cp ash; output out=mmm mean=; run; data mmm; set mmm; drop _type_ _freq_; run; proc prinqual data=mmm plots(only)=mdpref n=2 out=sss; transform identity(adf ndf cp ash ); id farm; run;

Now we have a clear view of farm clustering, with 9 being different, 7 and 2 clustering, and the rest forming another, or possibly 2, clusters.  Using means, the 4 variables are found to be independent of each other (all 90 degree angles), except for pairs going in opposite directions.

So the biplot approach provides a nice solution to the research question.  For comparison purposes only, let's look at PCA.
proc princomp data=mmm plots=all n=2; id farm; var adf ndf cp ash; run;
Clustering information is given by two plots:

We see exactly the same clustering as the biplot, just separated into two graphs.  The only benefit of running PCA is that it shows a 3rd PC is probably needed, although we could have guessed that were true by adding the percents displayed as axis labels in the biplot, and finding that only 69% of the information is captured by the first two PCs.
And finally, here is typical code for running a cluster analysis.  SAS has multiple procedures for clustering, with multiple methods in each.  Read the documentation for hints on appropriate choices.
proc cluster data=mmm method=ward outtree=tree ;
id farm; var adf ndf cp ash; run; proc tree data=tree noprint out=clust ncl=4; copy adf ndf cp ash farm; run; proc sgplot data=clust; scatter y=adf x=cp/group=cluster datalabel=farm markerattrs=(symbol=circlefilled size=12); run;

Proc Cluster produces a tree cluster.  The X-axis is a measure of distance among the clusters.  If we choose the 0.1 value, there are 4 lines, indicating 4 clusters.  Agreeing with PCA, Farm 9 is in its own cluster, but other clusters show different membership.  Disagreement is due to Cluster using the original variables, whereas PCA creates PC combinations of variables, then uses those to cluster.
If we want a scatterplot of the clusters, the above Tree and Sgplot code is needed.  Note we must specify the number of clusters (ncl=4), and must choose 2 variables for the scatterplot axes.  Thus the resulting plot may not show clear separation of clusters, as is the case here (clusters 1 and 2 intermix).  But an advantage of not using PCs is we can interpret scatterplots directly.  For example we see that Farm 10 had high ADF values, whereas with the biplot we knew it was extreme, but nothing else.  We also get some direct evidence for why Farm 9 is unusual.  To get a complete picture we would need additional scatterplots involving the other nutrients.  In this example, with only 4 nutrients, it would be tempting to use this clustering approach, as the number of scatterplots would be manageable (worst case is there are 6 pairs of the 4 variables), and thus avoid the potential difficulty interpreting PCs versus the actual measurements.  As the number of variables increases, biplot approach is preferred.



Comments

Post a Comment

Popular posts from this blog

DANDA - A macro collection for easier SAS statistical analysis

Objective :  You are running ANOVAs or regressions in SAS, and wish there was a way to avoid writing the dozens of commands needed to conduct the analysis and generate recommended diagnostics and summary of results, not to mention the hundreds of possible options that might be needed to access recommended methods.  A possible solution is to download a copy of danda.sas below, and use this macro collection to run the dozens of commands with one statement.  We will also have future posts covering various uses of danda.sas, giving examples as always. danda.sas is under continued development, check this page for updates. Date                       Version               Link 2021/03/15             2.12.030          danda.sas 2021/03/15       ...

Reporting results from transformed analyses

Objective :  Transformed data, for example log(y), is analyzed to correct normality or equal variance requirements.  But we want to report means and standard errors in the original units. SAS example : data one;  do treat=1 to 3;  do rep=1 to 5;    y=10 + treat+ exp(rannor(111));    logy=log(y);    output;  end;end; run; proc mixed plots=all;   class treat;   model y=treat;   lsmeans treat/pdiff; run; proc mixed plots=all;   class treat;   model logy=treat;   lsmeans treat/pdiff; run; The original data, variable y, might have units of pounds.  If a transformation is needed, we simply calculate a new variable by applying a mathematical function known to improve normality or equal variance, and run the same analysis on the new variable.  Commonly used choices are listed in the second table below. However, looking at the results for both analyses we see treat Mean Y S...

UTF character data, encoding of text

Objective and Background :  You have text data that is UTF encoded and need SAS/R to read and write datasets with that encoding.  If you have ever printed or viewed text information, and seen something like Giuffr?Ÿ’e?ƒe?Ÿƒ?ÿ?›ƒ?ªƒ?›?Ÿ’e›ƒ?ª­?Ÿƒeee, then you are running into this encoding issue.  Computers store text using numbers, with each number assigned to a particular character.  See  https://en.wikipedia.org/wiki/ASCII  to find that the character & is stored as 38 when using the ASCII encoding.  Unicode is popular internationally because it encodes special characters such as accented letters, and UTF-8 is a widely used version ( https://en.wikipedia.org/wiki/UTF-8 ).  In UTF-8 the & character is stored as 26, and you can imagine how the jumbled example above arises from the confusion of what letters are being stored. Solution 1 :  Use options to request that individual datasets be read and written in a particular encodin...