Lasso and silhouette score

Emma_Smith · February 10, 2024, 4:10pm

Seems like a simple task. Im clustering a dataframe of assets. That works well. But I would like to know how well. So Im trying to measure the silhouette score.

#%%from sklearn import cluster, covariance, manifoldimport yfinance as yfi - Pastebin.com

But Im getting an error while trying to compute the silhouette score. Any help appreciate.

Akshay_Choudhary · February 14, 2024, 4:37am

Hi Emma,

It looks like there is a mismatch in the number of samples between the input data X and the labels passed to the metrics.silhouette_score function. To address this issue, you need to ensure that the number of samples (rows) in your input data X matches the number of labels. Ensure that the clustering algorithm is correctly assigning labels to each sample in your input data. The number of unique labels should match the number of clusters found.

Hope this helps!

Emma_Smith · February 15, 2024, 12:36pm

Given the code. How can that be ? The rows are time serries price data. The algorithm is wrong? I dont think the clustering is incorrect. But the inputs going into the silhouette score may be wrong.

Akshay_Choudhary · February 16, 2024, 3:15pm

Hi Emma,

There can be multiple reasons for this. I would suggest you debug the code by adding multiple print statements. This will help you to understand where the issue is.

Emma_Smith · February 16, 2024, 4:18pm

I know the code well. I'm asking for help because I have tried. How can anyone know how many labels exist only after they are produced? The rows are time serries stock data.

varun_kumar_pothula · February 19, 2024, 1:29pm

Hello Emma,

The number of labels (or clusters) in a clustering algorithm like AffinityPropagation is not predetermined but is rather derived from the data during the clustering process. This can make it challenging to know the exact number of clusters beforehand.

After running the code provided by you and checking the output, I have observed that the error message indicates that the number of samples in the input data X is not consistent with the number of labels assigned by the clustering algorithm. This inconsistency could occur due to various reasons, such as incorrect data preprocessing, label assignment, or clustering algorithm parameters.

Given that the number of samples in X is 260, and the number of labels is 24, there seems to be a discrepancy. To diagnose and resolve this issue, here are some potential areas to investigate:

Data Preprocessing: Ensure that the data preprocessing steps (e.g., handling missing values, scaling) are performed correctly and consistently for both X and labels. Make sure there are no missing values or NaNs in the input data, as they can cause inconsistencies.

Clustering Algorithm Parameters: Review the parameters used for the clustering algorithm (AffinityPropagation in this case). Adjusting parameters such as preference may affect the number of clusters generated by the algorithm. Ensure that the parameters are set appropriately for your data.

Label Assignment: Verify how labels are assigned to data points after clustering. Ensure that each data point has a corresponding label and that the label array has the correct shape.

Debugging: Print out intermediate results such as the shapes of X and labels, as well as the unique values in the labels array. This can help identify any inconsistencies or unexpected values.

I hope this helps!

Emma_Smith · June 9, 2024, 9:51am

I tried. Can you fix this code?

Emma_Smith · June 9, 2024, 10:02am

Silhouette score is not typically meaningful for covariance matrix clustering but is there a better way to do this?

Here is code that demonstrates how I get it done.

from sklearn import cluster, covarianceimport yfinance as yfimport pandas as - Pastebin.com

Thank you in advance for your help.

Akshay_Choudhary · June 10, 2024, 9:44am

Hi Emma,

Here are some approaches for clustering covariance matrices:

Divergence-Based Measures: Covariance matrices represent distributions, so it’s more appropriate to use measures that consider the geometry of distributions. Ex. Kullback-Leibler (KL) Divergence, etc.
Riemannian Manifold-Based Methods: Covariance matrices can be seen as points on a Riemannian manifold, and clustering can be performed in this space
Model-based methods like Gaussian Mixture Models (GMMs) can naturally handle covariance matrices by modelling the data a s a mixture of Gaussian distributions
By representing covariance matrices as nodes in a graph, where edges represent some similarity measure, graph clustering algorithms can be used

Emma_Smith · June 12, 2024, 3:13pm

What about Adjusted Rand Index ?

_Rushda_Ansari · June 13, 2024, 9:37am

Hi Emma,

Adjusted Rand Index is not a clustering tool in itself. But, you can use it to evaluate the similarity between two clustering results by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. It's particularly useful for validating the quality of clustering algorithms.

Emma_Smith · June 13, 2024, 1:54pm

I know what it is for. Im asking if I can use it on any clustering tool Lasso/DBSCAN etc.

_Rushda_Ansari · June 14, 2024, 1:12pm

Hi Emma,

Yes, you can use it to evaluate the performance of clustering results. Here are a couple of resources that you might find helpful:

1. Unsupervised Learning K-means Clustering and DBSCAN

2. DBSCAN Clustering in ML | Density based clustering

Thanks

Rushda