Characterization of the k-means algorithm for spectral profiles
Author
Sola Viladesau, EvaDate
2023Abstract
The k-means algorithm is a Machine Learning clustering method that has gained
popularity both for its scalability and its simplicity. The output of this method
contains a distribution of the input data in k groups as well as k representative
examples.
The aim of this Bachelor’s Thesis is to test k-means clustering results under
controlled conditions by means of an artificial dataset. The data mimic solar
observations from the Interface Region Imaging Spectrograph (IRIS) in the Mg II
h&k lines. The situation is made incrementally more complex and the impact
on the clustering is studied on a case by case basis. The goal is to consistently
obtain a distribution that accurately separates the different profiles in the dataset.
Furthermore, the results are compared to those of hierarchical clustering methods
and the effect of two common preprocessing schemes is analyzed.
The k-means final results are considered satisfactory, given that the main goal of
discerning between spectral behavior patterns is achieved with very low error rates,
even when the data are purposefully contaminated with defective profiles and noise.
Nevertheless, when these impediments become too widespread, masking becomes
necessary, allowing for the previous statistics to be recovered. The hierarchical
methods are deemed equal or inferior to k-means in terms of performance, depending
on the specific criterion.