# PA3 - Clustering

There are two primary categories of machine learning tactics. Supervised learning involves using labelled data to train a model to later apply labels to new data. For example, suppose I had a bunch of photos of Labradors (a dog breed) that I wanted to separate into black labs, yellow labs, and chocolate labs (different varieties of Labradors different only by coat color). If I manually pre-labelled these photos with the correct variety then I could use them to train a supervised learning algorithm to label future photos with the correct variety of Labrador.

An unsupervised model, on the other hand, does not require any pre-labelling of the training data. Instead, unsupervised models attempt to find patterns in the training data. For example, if I photos of Labradors we might be able to plot these photos as points in some Cartesian space (as a simple example, we might have a 3-dimensional point per picture which represents the RGB of the coat color). With these points, an unsupervised model might identify clusters of points that are similar to one another and conclude that points in the same cluster must be the same Labrador variety. The advantage to these “clustering” algorithms is that we do not need to take the effort to pre-label; one disadvantage is that it does not tell us which cluster represents which variety. There are other advantages and disadvantages as well, but this is sufficient for now.

## Problem Description

For this problem set we will write a clustering algorithm. Consider that we have n items in our data set. The input to your algorithm is going to be a \(n \times n\) array of distances (cell \((i, j)\) of this array represents the distance from item \(i\) to item \(j\), we will guarantee that cell \((i, j)\) is equal to \((j, i)\) and each cell \((i, i)\) is 0, so you may think of this as an undirected, weighted, complete graph), and an integer \(k\).

This clustering algorithm will form the items into \(k\) clusters in a way the provides the best separation. *How’s that defined?* We define the *spacing* of a given \(k\)-clustering to be the minimum distance between any pair of points lying in different clusters. From among all possible \(k\)-clusterings, we seek the one with the **maximum** possible spacing (i.e., the maximum of a set of minimum distances between pairs of clusters). The intuition being that this maximizes the distance between the closest two clusters.

For example, suppose we have 5 items that we would like to cluster into 3 clusters, and the item distances are those in the 2-d array below.

```
0 18 21 23 5
18 0 54 30 31
21 54 0 15 32
23 30 15 0 15
5 31 32 15 0
```

In this case the optimal clustering would be for the first cluster to contain items 0 and 4, the second to contain items 2 and 3, and the third to contain item 1. To see the maximum spacing value of this clustering, we look at the closest pair of items for each pair of clusters, and then save the smallest. So in this case the maximum spacing value of the clustering would be 15. The following explains why:

- The distance between the first cluster and the second is 15 because the closest pair of points that crosses these clusters is items 4 and 3 which have distance 15.
- The distance between the first cluster and the third is 18 because the closest pair of points that crosses these clusters is items 0 and 1 which have distance 18.
- The distance between the second cluster and the third is 30 because the closest pair of points that crosses these clusters is items 1 and 3 which have distance 30.
- The maximum spacing value of the clustering is the smallest of these distances, and so it is 15.

The goal of the clustering is to maximize the cluster spacing value.

It may help you understand this better if we look at a non-optimal clustering. Say the first cluster contained items 0 and 1, the second contained items 3 and 4, and the third contained item 2.

- The distance between the first cluster and the second is 5 because the closest pair of points that crosses these clusters is items 0 and 4 which have distance 5.
- The distance between the first cluster and the third is 21 because the closest pair of points that crosses these clusters is items 0 and 2 which have distance 21.
- The distance between the second cluster and the third is 15 because the closest pair of points that crosses these clusters is items 3 and 2 which have distance 15.
- The maximum spacing value of the clustering is the smallest of these distances, and so it is 5.
But this is smaller than the spacing value for the clustering discussed earlier. You can think of this as meaning that one of the pairs in this clustering is not separated as much, so overall the clustering is considered worse. This is why we talked about finding the *maximum** of the smallest distances between clusters.

## Requirements

Construct a **greedy** algorithm that is given the number of clusters and an adjacency matrix (described above) and returns the maximum spacing value of the clustering. You **must** implement a minimum spanning tree algorithm to solve this problem.

### Input

Your algorithm will receive as input the contents of a test file called `input.txt`

. The input takes the following form:

```
3
5
0 18 21 23 5
18 0 54 30 31
21 54 0 15 32
23 30 15 0 15
5 31 32 15 0
```

Where:

- The first line is the number of clusters, \(k\)
- The second line is the number of points, \(n\)
- The rest of the file is an adjacency matrix of floating point numbers, space separated per line.

Your `compute()`

method will accept as parameters: the number \(k\) and the \(n \times n\) array of **double**s that indicates the pairwise distances of all items.

### Output

Your output will be the maximum spacing value of the clustering, so in the above example your algorithm should return `15.0`

.

### Running Time Requirements

The worst-case asymptotic running time of your program should belong to \(O(n^2 \log n)\).

## Implementation

Your program should have the following properties:

- Your algorithm must be written in Python 3.10.6 or Java 19.0.2 (OpenJDK)
- You must download the appropriate wrapper code based on the language you choose:
`Main.py`

and`Clustering.py`

for Python,`Main.java`

and`Clustering.java`

for Java.- Java package: pa3-java.zip
- Python package: pa3-python.zip

- The
`compute()`

method in`Clustering`

receives as input the body of the input file as a list of strings. You must have`compute()`

return the results described above. - You
*may*modify the`Main`

files to test your algorithm, but they**will not**be used during grading. - You must submit your
`Clustering`

file to Gradescope. Do**not**submit`Main.java`

,`Main.py`

, or any test files. - You may submit your code a
**maximum of 15 times**to Gradescope. After the 15th submission, Gradescope will not run your code. - A few other notes:
- Your code will be run as:

`python3 Main.py`

for Python, or`javac *.java && java Main`

for Java. - You may
**not**use any graph packages for this assignment. - You may upload multiple files if you need additional classes for Java, but
**do not assign packages**to the files. - You may use any editor or IDE you prefer as long as the code can be executed as stated above.
- Please note that you are responsible for analyzing the running time of any algorithm you use and ensuring that they satisfy the runtime requirements for this assignment.
- We strongly encourage you to practice good development as you write code: use meaningful variable names, break your code into functions, “code a little then test a little”, use the debugger, etc.

- Your code will be run as:

### Rules on Collaboration and Outside Sources

You must follow the rules about Collaboration and Outside Sources in the syllabus.

### Use of Generative AI

For PA3, you are not allowed to use generative AI tools to help you solve this problem.

## Helpful Hints

To help you to get started with thinking about this problem, keep these in mind:

- If we consider the points and their distances as a weighted and undirected graph then the distance that ends up representing the final spacing value must be an edge in a minimum spanning tree of the graph. Suppose that the maximum spacing value of the clustering is the weight of edge \((x, y)\). We will define a cut in the graph such that the cluster which contains \(x\) is on one side of the cut and all other nodes are on the other side. In this case, any edge going from a node in \(x\)’s cluster to a node in the second must cross the cut. The closest pair must then be the lightest edge which crosses the cut, and therefore is part of a minimum spanning tree.
- Being a spanning tree, a minimum spanning tree is connected and acyclic and contains \(|V| - 1\) edges. If any edge is removed then the graph is no longer connected, instead it will have two separate components. If any two edges are removed then it will have three separate components.
- If you are confused about clustering, we suggest you draw the 5 nodes for the examples given above with the clusterings given. Make sure you understand what the smallest distance is between each pair of cluster. Also, make sure you understand how the spacing value we gave you is determined from these smallest distances. You might also consider how the other hints given in this section relate to these examples.