Logic.core.clustering package#

Logic.core.clustering.clustering_metrics module#

class ClusteringMetrics#

Bases: object

adjusted_rand_score(true_labels: List, cluster_labels: List) float#

Calculate the adjusted Rand index for the given cluster assignments and ground truth labels.

Parameters:
  • true_labels (List) – A list of ground truth labels for each data point (Genres).

  • cluster_labels (List) – A list of cluster assignments for each data point.

Returns:

The adjusted Rand index, ranging from -1 to 1, where a higher value indicates better clustering.

Return type:

float

purity_score(true_labels: List, cluster_labels: List) float#

Calculate the purity score for the given cluster assignments and ground truth labels.

Parameters:
  • true_labels (List) – A list of ground truth labels for each data point (Genres).

  • cluster_labels (List) – A list of cluster assignments for each data point.

Returns:

The purity score, ranging from 0 to 1, where a higher value indicates better clustering.

Return type:

float

silhouette_score(embeddings: List, cluster_labels: List) float#

Calculate the average silhouette score for the given cluster assignments.

Parameters:
  • embeddings (List) – A list of vectors representing the data points.

  • cluster_labels (List) – A list of cluster assignments for each data point.

Returns:

The average silhouette score, ranging from -1 to 1, where a higher value indicates better clustering.

Return type:

float

Logic.core.clustering.clustering_utils module#

class ClusteringUtils#

Bases: object

cluster_hierarchical_average(emb_vecs: List) List#

Clusters input vectors using the hierarchical clustering method with average linkage.

Parameters:

emb_vecs (List) – A list of vectors to be clustered.

Returns:

A list containing the cluster index for each input vector.

Return type:

List

cluster_hierarchical_complete(emb_vecs: List) List#

Clusters input vectors using the hierarchical clustering method with complete linkage.

Parameters:

emb_vecs (List) – A list of vectors to be clustered.

Returns:

A list containing the cluster index for each input vector.

Return type:

List

cluster_hierarchical_single(emb_vecs: List) List#

Clusters input vectors using the hierarchical clustering method with single linkage.

Parameters:

emb_vecs (List) – A list of vectors to be clustered.

Returns:

A list containing the cluster index for each input vector.

Return type:

List

cluster_hierarchical_ward(emb_vecs: List) List#

Clusters input vectors using the hierarchical clustering method with Ward’s method.

Parameters:

emb_vecs (List) – A list of vectors to be clustered.

Returns:

A list containing the cluster index for each input vector.

Return type:

List

cluster_kmeans(emb_vecs: List, n_clusters: int, max_iter: int = 100) Tuple[List, List]#

Clusters input vectors using the K-means method.

Parameters:
  • emb_vecs (List) – A list of vectors to be clustered.

  • n_clusters (int) – The number of clusters to form.

Returns:

Two lists: 1. A list containing the cluster centers. 2. A list containing the cluster index for each input vector.

Return type:

Tuple[List, List]

cluster_kmeans_WCSS(emb_vecs: List, n_clusters: int) Tuple[List, List, float]#

This function performs K-means clustering on a list of input vectors and calculates the Within-Cluster Sum of Squares (WCSS) for the resulting clusters.

This function implements the K-means algorithm and returns the cluster centroids, cluster assignments for each input vector, and the WCSS value.

The WCSS is a measure of the compactness of the clustering, and it is calculated as the sum of squared distances between each data point and its assigned cluster centroid. A lower WCSS value indicates that the data points are closer to their respective cluster centroids, suggesting a more compact and well-defined clustering.

The K-means algorithm works by iteratively updating the cluster centroids and reassigning data points to the closest centroid until convergence or a maximum number of iterations is reached. This function uses a random initialization of the centroids and runs the algorithm for a maximum of 100 iterations.

Parameters:
  • emb_vecs (List) – A list of vectors to be clustered.

  • n_clusters (int) – The number of clusters to form.

Returns:

Three elements: 1) A list containing the cluster centers. 2) A list containing the cluster index for each input vector. 3) The Within-Cluster Sum of Squares (WCSS) value for the clustering.

Return type:

Tuple[List, List, float]

get_most_frequent_words(documents: List[str], top_n: int = 10) List[Tuple[str, int]]#

Finds the most frequent words in a list of documents.

Parameters:
  • documents (List[str]) – A list of documents, where each document is a string representing a list of words.

  • top_n (int, optional) – The number of most frequent words to return. Default is 10.

Returns:

A list of tuples, where each tuple contains a word and its frequency, sorted in descending order of frequency.

Return type:

List[Tuple[str, int]]

plot_kmeans_cluster_scores(embeddings: List, true_labels: List, k_values: List[int], project_name=None, run_name=None)#

This function, using implemented metrics in clustering_metrics, calculates and plots both purity scores and silhouette scores for various numbers of clusters. Then using wandb plots the respective scores (each with a different color) for each k value.

Parameters:
  • embeddings (List) – A list of vectors representing the data points.

  • true_labels (List) – A list of ground truth labels for each data point.

  • k_values (List[int]) – A list containing the various values of ‘k’ (number of clusters) for which the scores will be calculated. Default is range(2, 9), which means it will calculate scores for k values from 2 to 8.

  • project_name (str) – Your wandb project name. If None, the plot will not be logged to wandb. Default is None.

  • run_name (str) – Your wandb run name. If None, the plot will not be logged to wandb. Default is None.

Return type:

None

visualize_elbow_method_wcss(embeddings: List, k_values: List[int], project_name: str, run_name: str)#

This function implements the elbow method to determine the optimal number of clusters for K-means clustering based on the Within-Cluster Sum of Squares (WCSS).

The elbow method is a heuristic used to determine the optimal number of clusters in K-means clustering. It involves plotting the WCSS values for different values of K (number of clusters) and finding the “elbow” point in the curve, where the marginal improvement in WCSS starts to diminish. This point is considered as the optimal number of clusters.

The function performs the following steps: 1. Iterate over the specified range of K values. 2. For each K value, perform K-means clustering using the cluster_kmeans_WCSS function and store the resulting WCSS value. 3. Create a line plot of WCSS values against the number of clusters (K). 4. Log the plot to Weights & Biases (wandb) for visualization and tracking.

Parameters:
  • embeddings (List) – A list of vectors representing the data points to be clustered.

  • k_values (List[int]) – A list of K values (number of clusters) to explore for the elbow method.

  • project_name (str) – The name of the wandb project to log the elbow method plot.

  • run_name (str) – The name of the wandb run to log the elbow method plot.

Return type:

None

visualize_kmeans_clustering_wandb(data, n_clusters, project_name, run_name)#

This function performs K-means clustering on the input data and visualizes the resulting clusters by logging a scatter plot to Weights & Biases (wandb).

This function applies the K-means algorithm to the input data and generates a scatter plot where each data point is colored according to its assigned cluster. For visualization use convert_to_2d_tsne to make your scatter plot 2d and visualizable. The function performs the following steps: 1. Initialize a new wandb run with the provided project and run names. 2. Perform K-means clustering on the input data with the specified number of clusters. 3. Obtain the cluster labels for each data point from the K-means model. 4. Create a scatter plot of the data, coloring each point according to its cluster label. 5. Log the scatter plot as an image to the wandb run, allowing visualization of the clustering results. 6. Close the plot display window to conserve system resources (optional).

Parameters:
  • data (np.ndarray) – The input data to perform K-means clustering on.

  • n_clusters (int) – The number of clusters to form during the K-means clustering process.

  • project_name (str) – The name of the wandb project to log the clustering visualization.

  • run_name (str) – The name of the wandb run to log the clustering visualization.

Return type:

None

wandb_plot_hierarchical_clustering_dendrogram(data, project_name, linkage_method, run_name)#

This function performs hierarchical clustering on the provided data and generates a dendrogram plot, which is then logged to Weights & Biases (wandb).

The dendrogram is a tree-like diagram that visualizes the hierarchical clustering process. It shows how the data points (or clusters) are progressively merged into larger clusters based on their similarity or distance.

The function performs the following steps: 1. Initialize a new wandb run with the provided project and run names. 2. Perform hierarchical clustering on the input data using the specified linkage method. 3. Create a linkage matrix, which represents the merging of clusters at each step of the hierarchical clustering process. 4. Generate a dendrogram plot using the linkage matrix. 5. Log the dendrogram plot as an image to the wandb run. 6. Close the plot display window to conserve system resources.

Parameters:
  • data (np.ndarray) – The input data to perform hierarchical clustering on.

  • linkage_method (str) – The linkage method for hierarchical clustering. It can be one of the following: “average”, “ward”, “complete”, or “single”.

  • project_name (str) – The name of the wandb project to log the dendrogram plot.

  • run_name (str) – The name of the wandb run to log the dendrogram plot.

Return type:

None

Logic.core.clustering.dimension_reduction module#

class DimensionReduction#

Bases: object

convert_to_2d_tsne(emb_vecs)#

Converts each raw embedding vector to a 2D vector.

Parameters:

(list) (emb_vecs)

Returns:

list

Return type:

A list of 2D vectors.

pca_reduce_dimension(embeddings, n_components)#

Performs dimensional reduction using PCA with n components left behind.

Parameters:
  • (list) (embeddings)

  • (int) (n_components)

Returns:

list

Return type:

A list of reduced embeddings.

wandb_plot_2d_tsne(data, project_name, run_name)#

This function performs t-SNE (t-Distributed Stochastic Neighbor Embedding) dimensionality reduction on the input data and visualizes the resulting 2D embeddings by logging a scatter plot to Weights & Biases (wandb).

t-SNE is a widely used technique for visualizing high-dimensional data in a lower-dimensional space, typically 2D. It aims to preserve the local structure of the data points while capturing the global structure as well. This function applies t-SNE to the input data and generates a scatter plot of the resulting 2D embeddings, allowing for visual exploration and analysis of the data’s structure and potential clusters.

The scatter plot is a useful way to visualize the t-SNE embeddings, as it shows the spatial distribution of the data points in the reduced 2D space.

The function performs the following steps: 1. Initialize a new wandb run with the provided project and run names. 2. Perform t-SNE dimensionality reduction on the input data, obtaining 2D embeddings. 3. Create a scatter plot of the 2D embeddings using matplotlib. 4. Log the scatter plot as an image to the wandb run, allowing visualization of the t-SNE embeddings.

Parameters:
  • data (np.ndarray) – The input data to perform t-SNE dimensionality reduction on.

  • project_name (str) – The name of the wandb project to log the t-SNE scatter plot.

  • run_name (str) – The name of the wandb run to log the t-SNE scatter plot.

Return type:

None

wandb_plot_explained_variance_by_components(data, project_name, run_name)#

This function plots the cumulative explained variance ratio against the number of components for a given dataset and logs the plot to Weights & Biases (wandb).

The cumulative explained variance ratio is a metric used in dimensionality reduction techniques, such as Principal Component Analysis (PCA), to determine the amount of information (variance) retained by the selected number of components. It helps in deciding how many components to keep while balancing the trade-off between retaining valuable information and reducing the dimensionality of the data.

The function performs the following steps: 1. Fit a PCA model to the input data and compute the cumulative explained variance ratio. 2. Create a line plot using Matplotlib, where the x-axis represents the number of components, and the y-axis represents the corresponding cumulative explained variance ratio. 3. Initialize a new wandb run with the provided project and run names. 4. Log the plot as an image to the wandb run, allowing visualization of the explained variance by components.

Parameters:
  • data (np.ndarray) – The input data for which the explained variance by components will be computed and plotted.

  • project_name (str) – The name of the wandb project to log the explained variance plot.

  • run_name (str) – The name of the wandb run to log the explained variance plot.

Return type:

None

Logic.core.clustering.main module#