python - AgglomerativeClustering on a correlation matrix -
i have correlation matrix of typical structure of size 288x288 defined by:
from sklearn.cluster import agglomerativeclustering df = read_returns() correl_matrix = df.corr()
where read_returns gives me dataframe date index, , columns of returns of assets.
now - want cluster these correlations reduce population size.
by doing reading , experimenting discovered agglomerativeclustering - , appears @ first pass appropriate solution problem.
i define distance metric ((.5*(1-correl_matrix))**.5)
, have:
cluster = agglomerativeclustering(n_clusters=40, linkage='average') cluster.fit(((.5*(1-correl_matrix))**.5).values) label_groups = cluster.labels_
to observe of data , cross check work pick out cluster 1
, observe pairwise correlations , find min correlation between 2 items group in dataset find :
single_cluster = [] in range(0,correl_matrix.shape[0]): if label_groups[i]==1: single_cluster.append(correl_matrix.index[i]) min_correl = 1.0 x in single_cluster: y in single_cluster: if x<>y: if correl_matrix[x][y]<min_correl: min_correl = correl_matrix[x][y] print min_correl
and min pairwise correlation of .20
to me seems quite low - "low based off what?" fair question have no answer.
i anticipate/enforce each pairwise correlation of cluster >=.7 or this.
is possible in agglomerativeclustering?
am accidentally going down wrong path?
hierarchical clustering supports different "linkage" strategies.
- single-link: connects points on minimum distance others in cluster
- complete-link: connects based on maximum distance cluster
- ...
if want high minimum correlation = small maximum distance, calls complete linkage.
you may want treat negative correlations "good", too. i.e. use dist = 1 - abs(corr)
.
make sure use ghe dendrogram. if have outliers in data, want cut (n_clusters+n_outliers) partitions.