PA3 - Clustering

Last updated: 10/23 2:45 pm (fixed typo in the “Aside”)

There are two primary categories of machine learning tactics. Supervised learning involves using labelled data to train a model to later apply labels to new data. For example, suppose I had a bunch of photos of Labradors (a dog breed) that I wanted to separate into black labs, yellow labs, and chocolate labs (different varieties of Labradors different only by coat color). If I manually pre-labelled these photos with the correct variety then I could use them to train a supervised learning algorithm to label future photos with the correct variety of Labrador.

An unsupervised model, on the other hand, does not require any pre-labelling of the training data. Instead, unsupervised models attempt to find patterns in the training data. For example, if I photos of Labradors we might be able to plot these photos as points in some cartesian space (as a simple example, we might have a 3-dimensional point per picture which represents the RGB of the coat color). With these points, an unsupervised model might identify clusters of points that are similar to one another and conclude that points in the same cluster must be the same Labrador variety. The advantage to these “clustering” algorithms is that we do not need to take the effort to pre-label, one disadvantage is that it does not tell us which cluster represents which variety. There are other advantages and disadvantages as well, but this is sufficient for now.

Problem Description

For this problem set we will write a clustering algorithm. Consider that we have n items in our data set. The input to your algorithm is going to be a \(n \times n\) array of distances (cell \((i, j)\) of this array represents the distance from item \(i\) to item \(j\), we will guarantee that cell \((i, j)\) is equal to \((j, i)\) and each cell \((i, i)\) is 0, so you may think of this as an undirected, weighted, complete graph), and an integer \(k\).

This clustering algorithm will form the items into \(k\) clusters in a way the provides the best separation. How’s that defined? We define the spacing of a given \(k\)-clustering to be the minimum distance between any pair of points lying in different clusters. From among all possible \(k\)-clusterings, we seek the one with the maximum possible spacing (i.e., the maximum of a set of minimum distances between pairs of clusters). The intuition being that this maximizes the distance between the closest two clusters.

For example, suppose we have 5 items that we would like to cluster into 3 clusters, and the item distances are those in the 2-d array below.

0 18 21 23 5
18 0 54 30 31
21 54 0 15 32
23 30 15 0 15
5 31 32 15 0

In this case the optimal clustering would be for the first cluster to contain items 0 and 4, the second to contain items 2 and 3, and the third to contain item 1. To see the maximum spacing value of this clustering, we look at the closest pair of items for each pair of clusters, and then save the smallest. So in this case the maximum spacing value of the clustering would be 15. The following explains why:

  • The distance between the first cluster and the second is 15 because the closest pair of points that crosses these clusters is items 4 and 3 which have distance 15.
  • The distance between the first cluster and the third is 18 because the closest pair of points that crosses these clusters is items 0 and 1 which have distance 18.
  • The distance between the second cluster and the third is 30 because the closest pair of points that crosses these clusters is items 1 and 3 which have distance 30.
  • The maximum spacing value of the clustering is the smallest of these distances, and so it is 15.

The goal of the clustering is to maximize the cluster spacing value.

It may help you understand this better if we look at a non-optimal clustering. Say the first cluster contained items 0 and 1, the second contained items 3 and 4, and the third contained item 2.

  • The distance between the first cluster and the second is 5 because the closest pair of points that crosses these clusters is items 0 and 4 which have distance 5.
  • The distance between the first cluster and the third is 21 because the closest pair of points that crosses these clusters is items 0 and 2 which have distance 21.
  • The distance between the second cluster and the third is 15 because the closest pair of points that crosses these clusters is items 3 and 2 which have distance 15.
  • The maximum spacing value of the clustering is the smallest of these distances, and so it is 5.

But this is smaller than the spacing value for the clustering discussed earlier. You can think of this as meaning that one of the pairs in this clustering is not separated as much, so overall the clustering is considered worse. This is why we talked about finding the *maximum** of the smallest distances between clusters.

Input

Your algorithm will receive as input the contents of a test file called input.txt. The input takes the following form:

3
5
0 18 21 23 5
18 0 54 30 31
21 54 0 15 32
23 30 15 0 15
5 31 32 15 0

Where:

  • The first line is the number of clusters, \(k\)
  • The second line is the number of points, \(n\)
  • The rest of the file is an adjacency matrix of floating point numbers, space separated per line.

Unlike prior PAs, we’ve provided code in Main to parse the input file and pass to your compute() method the number \(k\) and the \(n \times n\) array of doubles that indicates the pairwise distances of all items.

Output

Your output will be the maximum spacing value of the clustering, so in the above example your algorithm should return 15.0.

Implementation

Your program should have the following properties:

  • Your algorithm must be written in Python 3.10.6 or Java 19.0.2 (OpenJDK)
  • You must download the appropriate wrapper code based on the language you choose: Main.py and Clustering.py for Python, Main.java and Clustering.java for Java.
  • The compute() method in Clustering receives as input the body of the input file as a list of strings. You must have compute() return the results described above.
  • You may modify the Main files to test your algorithm, but they will not be used during grading.
  • You must submit your Clustering file to Gradescope. Do not submit Main.java, Main.py, or any test files.
  • You may submit your code a maximum of 15 times to Gradescope. After the 15th submission, Gradescope will not run your code.
  • A few other notes:
    • Your code will be run as:
      python3 Main.py for Python, or javac *.java && java Main for Java.
    • You may upload multiple files if you need additional classes for Java, but do not assign packages to the files.
    • You may use any editor or IDE you prefer as long as the code can be executed as stated above.
    • We strongly encourage you to practice good development as you write code: use meaningful variable names, break your code into functions, “code a little then test a little”, use the debugger, etc.

Rules on Collaboration and Outside Sources

You may discuss the problem and the overall strategy with up to 4 other students, but you must list those people in your submission under collaborators. You may not share code, look at others’ code, or help others debug their code. You must write your own code. Not just type it (though you need to do that too): compose it yourself, as your own original work. Please read the syllabus carefully around coding. Do not seek published or online solutions for any assignments. If you use any published or online resources (which may not include solutions) when completing this assignment, be sure to cite them. Do not submit a solution that you are unable to explain orally to a member of the course staff.

Use of Generative AI

For PA3, you are not allowed to use generative AI tools to help you solve this problem.

Helpful Hints

To help you to get started with thinking about this problem, keep these in mind:

  • If we consider the points and their distances as a weighted and undirected graph then the distance that ends up representing the final spacing value must be an edge in a minimum spanning tree of the graph. Suppose that the maximum spacing value of the clustering is the weight of edge \((x, y)\). We will define a cut in the graph such that the cluster which contains \(x\) is on one side of the cut and all other nodes are on the other side. In this case, any edge going from a node in \(x\)’s cluster to a node in the second must cross the cut. The closest pair must then be the lightest edge which crosses the cut, and therefore is part of a minimum spanning tree.
  • Being a spanning tree, a minimum spanning tree is connected and acyclic and contains \(n − 1\) edges. If any edge is removed then the graph is no longer connected, instead it will have two separate components. If any two edges are removed then it will have three separate components.
  • If you are confused about clustering, we suggest you draw the 5 nodes for the examples given above with the clusterings given. Make sure you understand what the smallest distance is between each pair of cluster. Also, make sure you understand how the spacing value we gave you is determined from these smallest distances. You might also consider how the other hints given in this section relate to these examples.

Copyright © 2023 John Hott and Thomas Horton.
Released under the CC-BY-NC-SA 4.0 license.
Creative Commons License