Table of Contents

Cluster analysis is a powerful tool for finding patterns and relationships in data sets. In this article, we will learn how to use PROC CLUSTER in SAS, an effective technique for clustering data. We will discuss the basic concepts and parameters of PROC CLUSTER, and then demonstrate these concepts with a practical example. Furthermore, we will discuss the advantages and limitations of PROC CLUSTER, and how the results can be used in further analysis and decision-making. By the end of this article, you will know how to use PROC CLUSTER in SAS for your own data sets.

Clustering is a technique in machine learning that attempts to find clusters of observations within a dataset.

The goal is to find clusters such that the observations within each cluster are quite similar to each other, while observations in different clusters are quite different from each other.

The easiest way to perform clustering in SAS is to use PROC CLUSTER.

The following example shows how to use PROC CLUSTER in practice.

Example: How to Use PROC CLUSTER in SAS

Suppose we have the following dataset that contains information about points, assists and rebounds for 20 different basketball players:

/*create dataset*/             
data my_data;
    input points assists rebounds;   
    datalines;
18 3 15
20 3 14
19 4 14
14 5 10
14 4 8
15 7 14
20 8 13
28 7 9
30 6 5
31 9 4
35 12 11
33 14 6
29 9 5
25 9 5
25 4 3
27 3 8
29 4 12
30 12 7
19 5 6
23 11 5
;
run;

/*view dataset*/ 
proc print data=my_data;

Suppose we would like to perform clustering to attempt to identify “clusters” of players that have similar stats to each other.

The following code shows how to use PROC CLUSTER in SAS to perform clustering:

/*perform clustering using points, assists and rebounds variables*/             
proc cluster data=my_data method=average;
    var points assists rebounds;
run;

The first tables in the output provide information about how the clustering was performed:

A dendrogram is also produced so that we can visually inspect the similarity between observations in the dataset:

The y-axis shows the individual observations and the x-axis shows the average distance between clusters.

From looking at this dendrogram, it appears that the observations naturally group themselves into three clusters:

We can then use the PROC TREE statement with ncl=3 to tell SAS to assign each observation in the original dataset to one of three clusters:

/*assign each observation to one of three clusters*/
proc tree data=clustd noprint ncl=3 out=clusts;
    copy points assists rebounds;
    id player_ID;
run;
proc sort;
   by cluster;
run;

/*view cluster assignments*/
proc print data=clusts;
    id player_ID;
run;

The resulting dataset shows each of the original observations along with the cluster they belong to:

For example, we can see: that players with ID’s 2, 3, 1, 4, 5, 7, 6 and 19 all belong to cluster 1.

This tells us that these eight players are “similar” across the points, assists and rebounds variables.

Note: For this example we chose to use average as the linkage method for clustering. Refer to the for a complete list of other linkage methods you can use.

The following tutorials explain how to perform other common tasks in SAS:

How to Use PROC CLUSTER in SAS (With Example)

Example: How to Use PROC CLUSTER in SAS

Requst a

Scale

Example: How to Use PROC CLUSTER in SAS

Related terms:

Requst a

Scale