Quantum-Powered Asset Clustering: Correlation Clustering with GCS-Q

A step-by-step guide to clustering financial assets using signed correlation graphs and quantum annealing with the GCS-Q algorithm.

Posted Dec 12, 2025 Updated Feb 17, 2026

Overview of the GCS-Q approach for correlation clustering of financial assets.

By Supreeth Mysore Venkatesh

40 min read

Quantum-Powered Asset Clustering: Correlation Clustering with GCS-Q

Introduction

Portfolio managers and quantitative analysts spend a great deal of effort trying to understand how financial assets move together. If two stocks go up and down in near lockstep, holding both does little for diversification. Conversely, assets that move in opposite directions can hedge one another. The standard tool for quantifying these co-movements is the Pearson correlation coefficient, which yields values in $[-1, 1]$: a value near $+1$ means the two assets are highly correlated, near $-1$ means they are anticorrelated, and near $0$ means they are essentially uncorrelated.

A natural next step is to cluster assets into groups such that intra-cluster correlations are high (the assets in each group move together) and inter-cluster correlations are low or negative (different groups capture different market factors). This is the asset clustering problem, and it lies at the heart of:

Portfolio optimization — Markowitz mean-variance optimization [5] benefits from well-separated clusters, since each cluster represents a distinct risk-return profile.
Statistical arbitrage — pairs or basket trading strategies rely on identifying groups of cointegrated or highly correlated assets [6].
Risk management — understanding correlation structures helps quantify systemic risk and contagion.

Why Classical Methods Struggle

Classical clustering algorithms such as $k$-Means, $k$-Medoids (PAM), hierarchical clustering, and even spectral methods like SPONGE [7] face four fundamental limitations when applied to signed correlation graphs:

(i) Number of clusters $k$ is not known in advance

Most classical methods require the user to specify $k$ before running the algorithm. In practice, the “right” number of clusters in a correlation matrix is unknown and changes over time as market regimes shift. Heuristics like the elbow method, silhouette analysis, or the eigengap criterion are expensive, data-dependent, and do not generalize across datasets.

(ii) Reformulation is lossy

Since classical methods require a positive-definite distance metric, Pearson correlations ($\rho \in [-1, 1]$) must be transformed into non-negative distances via mappings such as:

\[d_{ij}^{(\alpha)} = \sqrt{\alpha(1 - \rho_{ij})}, \quad \alpha > 0\]

While this preserves ranking, it loses semantic fidelity. For example, a zero correlation $\rho_{ij} = 0$ (no relationship) maps to $\sqrt{\alpha}$, falsely implying a fixed non-zero dissimilarity. The rich signed structure of the original correlation graph is irreversibly discarded.

(iii) Optimization objective is different

Centroid-based methods (e.g., $k$-Means, PAM) optimize a proxy objective — minimizing distances to cluster centroids — rather than the true objective of maximizing intra-cluster correlations and minimizing inter-cluster correlations. Even when the distance transformation is applied, optimally solving the transformed problem does not recover the optimal clustering under the original signed graph.

(iv) Not robust for varying-sized clusters

Spectral methods like SPONGE perform well on sparse graphs with discrete weights $\in {-1, 0, +1}$ and approximately equal cluster sizes, but degrade significantly on dense financial correlation graphs with continuous weights and heterogeneous cluster structures. Similarly, $k$-Means assumes spherical, equally-sized clusters — an assumption that rarely holds in practice.

Enter GCS-Q

The Graph-based Coalition Structure Generation algorithm (GCS-Q) [3] was originally developed for cooperative game theory but is a natural fit for signed graph clustering. It operates directly on the signed, weighted correlation graph, avoiding lossy distance transformations. Starting with all assets in a single cluster, it recursively bipartitions each subgraph by solving a minimum-cut formulated as a QUBO (Quadratic Unconstrained Binary Optimization) problem, which is solved using a D-Wave quantum annealer. The algorithm automatically determines $k$, terminating when no further split improves intra-cluster agreement.

This post walks through the entire pipeline, from downloading stock data to running GCS-Q on D-Wave hardware, step by step.

Overview of the GCS-Q approach for correlation clustering of financial assets [1]

Problem Formulation

From Returns to a Signed Graph

Consider a set of financial assets $\mathcal{A} = {a_1, a_2, \dots, a_n}$ whose historical returns over a rolling window are denoted by $r_i(t)$ for asset $a_i$ at time $t$. The pairwise Pearson correlation coefficients are computed as:

\[\rho_{ij} = \frac{\text{Cov}(r_i, r_j)}{\sigma_{r_i} \sigma_{r_j}}, \quad \rho_{ij} \in [-1, 1]\]

These correlations define a signed, weighted, undirected graph $G = (V, E, w)$ where:

Each vertex $v_i \in V$ corresponds to asset $a_i$.
Each edge $(v_i, v_j) \in E$ carries the weight $w_{ij} = \rho_{ij}$.

Positive edges indicate co-movement; negative edges indicate anti-correlation. Unlike classical methods, we retain the full signed structure of this graph.

The Clustering Objective

We seek a partition $\Pi = {C_1, C_2, \dots, C_k}$ of $V$ into disjoint clusters that maximizes intra-cluster agreement:

\[\max_{\Pi} \sum_{C \in \Pi} \sum_{\substack{i,j \in C \\ i < j}} w_{ij}\]

This objective promotes grouping assets with strong positive correlations while penalizing the inclusion of negatively correlated pairs within the same cluster. Importantly, the number of clusters $k$ is not an input — it is determined by the algorithm.

The Penalty Metric

Since ground-truth clusters are unavailable for real financial data, we evaluate clustering quality using the structural balance penalty, which penalizes two types of violations:

Negative intra-cluster edges: negatively correlated assets placed in the same cluster.
Positive inter-cluster edges: positively correlated assets placed in different clusters.

\[\text{Penalty}(\Pi) = \sum_{C \in \Pi} \sum_{\substack{i,j \in C \\ i < j \\ w_{ij} < 0}} |w_{ij}| + \sum_{\substack{C_a \ne C_b}} \sum_{\substack{i \in C_a, j \in C_b \\ i < j \\ w_{ij} > 0}} w_{ij}\]

A lower penalty means better clustering, with zero representing a perfectly structurally balanced partition. Clusters with low penalty are internally cohesive and externally distinct, exactly the properties desired for portfolio diversification.

The GCS-Q Algorithm

How It Works

GCS-Q follows a top-down hierarchical divisive strategy:

Start with all $n$ assets in a single cluster (the “grand coalition”).
Bipartition the current cluster by solving a QUBO that computes the minimum cut of the subgraph.
Check the stopping criterion: if the cut value is non-positive (i.e., further splitting does not improve intra-cluster agreement), keep the current cluster intact. Otherwise, enqueue both partitions for further splitting.
Repeat until no more beneficial splits exist.

Algorithm: GCS-Q for Correlation Clustering
Input: Weighted graph $G = (V, E, w)$
Output: Clustering $\mathcal{C} = {C_1, C_2, \dots, C_k}$
$\mathcal{C} \leftarrow \emptyset$, $\text{Queue} \leftarrow {V}$
While Queue is not empty:
$S \leftarrow \text{Queue.pop}()$, $G_S \leftarrow$ subgraph induced by $S$
Solve QUBO: $\max \sum_{i,j \in S} w_{ij} \cdot \mathbb{I}[x_i = x_j]$
$C \leftarrow {i \in S \mid x_i = 1}$, $\bar{C} \leftarrow S \setminus C$
If $\text{cut}(C, \bar{C}) \leq 0$ or $C = \emptyset$ or $\bar{C} = \emptyset$:
$\mathcal{C} \leftarrow \mathcal{C} \cup {S}$
Else:
$\text{Queue} \leftarrow \text{Queue} \cup {C, \bar{C}}$
Return $\mathcal{C}$

How the QUBO Is Constructed

It is important to emphasize that the entire clustering task is not formulated as a single QUBO. Instead, GCS-Q solves a sequence of independent QUBO problems, one at each iteration of the recursive algorithm. Each QUBO corresponds to the minimum-cut problem on the current subgraph — i.e., finding the best way to split the nodes of that subgraph into two groups.

The minimum cut on a weighted graph with $n$ nodes is itself a combinatorially hard problem. An exhaustive brute-force search would need to evaluate all $2^n$ possible binary assignments to determine which bipartition minimizes the cut. For even moderately sized subgraphs, this becomes infeasible on classical hardware — beyond roughly $n \approx 25$ nodes, brute-force enumeration is computationally intractable. Commercial solvers like Gurobi and CPLEX use branch-and-bound and cutting-plane heuristics to avoid full enumeration, but they too struggle on fully dense graphs (where every pair of nodes is connected), which is exactly the structure of financial correlation matrices.

Given the weighted adjacency matrix $W$ of a subgraph, the QUBO matrix $Q$ for the minimum-cut problem is:

\[Q_{ii} = \sum_{j} W_{ij} \quad (\text{diagonal: degree of node } i)\] \[Q_{ij} = -W_{ij} \quad (\text{off-diagonal: negative adjacency})\]

Binary variables $x_i \in {0, 1}$ indicate which side of the cut each node belongs to. Minimizing $\mathbf{x}^T Q \mathbf{x}$ yields the minimum cut, which is then used to decide whether to split the subgraph.

This QUBO formulation is natively compatible with quantum annealers (D-Wave) and variational quantum algorithms (QAOA) [4].

Why Quantum?

Since each minimum-cut QUBO involves a solution space of $2^n$ binary assignments, and GCS-Q may solve up to $\mathcal{O}(n)$ such QUBOs (one per recursive split, as $k$ ranges over $[1, n]$), the overall classical complexity is $\mathcal{O}(n \cdot 2^n)$.

The quantum annealer addresses the hard inner loop: rather than enumerating $2^n$ candidate bipartitions sequentially, it explores all possible assignments simultaneously via quantum superposition and tunneling. This is particularly advantageous for financial correlation graphs, which are typically fully dense (most pairwise correlations are non-zero), making classical heuristics and pruning strategies less effective.

In other words, quantum annealing does not speed up the outer recursion (which is at most linear in $n$), but it tackles the exponentially hard minimum-cut subproblem at each step — the true computational bottleneck.

Implementation Guide

Prerequisites

Python 3.10+
D-Wave account with an API token (store it in dwave-api-token.txt)

Clone the Repository

git clone https://github.com/supreethmv/Quantum-Asset-Clustering.git
cd Quantum-Asset-Clustering

Install Dependencies

Use the provided requirements.txt to install all necessary libraries with tested, compatible versions:

pip install -r requirements.txt

Store your D-Wave API token in a file named dwave-api-token.txt in the project root.

Step 1: Import Dependencies

We begin by importing the libraries we need. These fall into four logical groups:

  
# --- Data & Numerics ---
# yfinance: download historical stock price data directly from Yahoo Finance
# pandas/numpy: data manipulation and linear algebra
import yfinance as yf
import pandas as pd
import numpy as np

# --- Graph & Visualization ---
# networkx: represent the correlation matrix as a weighted signed graph
# matplotlib/seaborn: plot penalty comparison charts
import networkx as nx
import time
import matplotlib.pyplot as plt
import seaborn as sns

# --- Quantum (D-Wave) ---
# DWaveSampler: interface to the physical quantum annealer
# EmbeddingComposite: automatically maps our problem graph onto the QPU topology
# BinaryQuadraticModel / dimod: construct and manage QUBO formulations
from dwave.system import DWaveSampler, EmbeddingComposite
from dimod import BinaryQuadraticModel
import dimod

# --- Linear Algebra ---
# eigh: compute eigenvalues of symmetric matrices (used for spectral gap estimation)
from scipy.linalg import eigh

import warnings
warnings.filterwarnings('ignore')

If you are using the Gurobi solver instead of D-Wave, you can skip the dwave.system and dimod imports and instead add import gurobipy as gp (shown in Option B below).

Step 2: Implement the GCS-Q Algorithm

2a. Graph Construction

The first step is to convert the Pearson correlation matrix into a graph that GCS-Q can operate on. Each stock becomes a node, and each pairwise correlation becomes a weighted edge. Positive weights represent assets that tend to move together; negative weights represent assets that move in opposite directions.

This graph is the foundation for everything that follows — the QUBO, the minimum cut, and ultimately the cluster assignments.

  
def construct_graph(adj_matrix):
    """Constructs a NetworkX graph from a signed adjacency matrix."""
    G = nx.Graph()
    num_nodes = len(adj_matrix)
    for i in range(num_nodes):
        for j in range(i + 1, num_nodes):
            # Only add an edge if there is a non-zero correlation.
            # For financial data, this is almost always the case
            # (most stocks are correlated), resulting in a dense graph.
            if adj_matrix[i][j] != 0:
                G.add_edge(i, j, weight=adj_matrix[i][j])
    return G

Use-case context: For a portfolio of 50 stocks, this creates a graph with 50 nodes and up to $\binom{50}{2} = 1225$ edges. In practice, nearly all pairwise correlations are non-zero, so the graph is fully connected (dense). This density is precisely what makes the minimum-cut problem hard for classical solvers.

2b. QUBO Construction and Bipartitioning

This is the heart of the formulation. The QUBO encodes the minimum-cut objective for the current subgraph. Given the weighted adjacency matrix $W$, the QUBO matrix $Q$ is constructed, and then submitted to a solver to find the optimal bipartition.

We provide two solver options: the D-Wave quantum annealer (remote, requires API key) and the Gurobi classical solver (local, requires license).

  
def get_qubo_matrix(W):
    """
    Computes the QUBO matrix for the minimum cut problem.
    
    Given the weighted adjacency matrix W of a subgraph:
      - Diagonal: Q[i,i] = sum of weights on all edges touching node i (its degree)
      - Off-diagonal: Q[i,j] = -W[i,j] (negate the adjacency weight)
    
    The resulting Q encodes the minimum-cut objective:
    minimizing x^T Q x over binary x finds the partition that
    cuts through the least total positive edge weight.
    """
    n = W.shape[0]
    Q = np.zeros((n, n))
    for i in range(n):
        # Diagonal entry: the weighted degree of node i.
        # This represents the "cost" of placing node i on one side of the cut.
        Q[i, i] = np.sum(W[i])
        for j in range(n):
            if i != j:
                # Off-diagonal entry: the negative of the edge weight.
                # When x_i ≠ x_j (nodes on different sides), this contributes
                # to the objective, penalizing cutting through strong positive edges.
                Q[i, j] = -W[i, j]
    return Q

Why this works: Consider the quadratic form $\mathbf{x}^T Q \mathbf{x}$ with binary variables $x_i \in {0, 1}$. When expanded, the terms involving $x_i (1 - x_j) w_{ij}$ accumulate the total weight of edges crossing the cut. Minimizing this is exactly the minimum cut — the partition that keeps strongly correlated assets together and separates anti-correlated ones.

In financial terms: the QUBO is asking “what is the best way to split this group of stocks into two sub-groups such that stocks within each sub-group are as positively correlated as possible?” The quantum annealer (or Gurobi) answers this question at each step of the algorithm.

Option A: Solving on D-Wave Quantum Annealer

Note: This requires you to have an account on D-Wave Leap. Upon logging in, the D-Wave Leap dashboard is displayed — you can find your private token on the left column under Solver API Token. D-Wave offers limited free access to their quantum processing units (QPUs) for new users, which is sufficient to run the experiments in this tutorial. Store your token in a file named dwave-api-token.txt in the project root.

  
def bipartition(graph):
    """
    Bipartitions a graph using D-Wave quantum annealing.
    Returns two partitions and the QPU access time.
    
    This function is called ONCE per recursive step of GCS-Q.
    Each call submits a separate QUBO to the quantum annealer,
    corresponding to the minimum-cut of the current subgraph.
    """
    # Base case: a single node cannot be split further
    if len(graph.nodes()) == 1:
        return [], [0], 0
    
    # Extract the adjacency matrix of the current subgraph
    # and build the QUBO matrix from it
    w = nx.adjacency_matrix(graph).todense()
    qubo = get_qubo_matrix(W=w)
    
    # Convert the numpy QUBO matrix into D-Wave's BinaryQuadraticModel format.
    # This is the standard input format for D-Wave samplers.
    bqm = BinaryQuadraticModel.from_qubo(qubo)
    
    # Connect to the D-Wave cloud service.
    # EmbeddingComposite handles the mapping from our logical QUBO variables
    # onto the physical qubit topology (Pegasus) of the Advantage system.
    # This embedding step is crucial: our problem graph is fully connected,
    # but the QPU has limited physical connectivity, so multiple physical
    # qubits may represent a single logical variable ("chain").
    sampler = EmbeddingComposite(
        DWaveSampler(
            token=open('dwave-api-token.txt', 'r').read().strip(),
            solver={'topology__type': 'pegasus'}
        )
    )
    
    # Submit the QUBO to the quantum annealer.
    # num_reads=1000 means the annealer runs 1000 independent annealing
    # cycles (~20 microseconds each) and returns the best solution found.
    # More reads improve robustness against hardware noise.
    sampleset = sampler.sample(bqm, num_reads=1000)
    
    # Track QPU access time for benchmarking (in microseconds)
    qpu_access_time = sampleset.info['timing']['qpu_access_time']
    
    # The best solution is a dictionary mapping node index → binary value.
    # Nodes assigned 1 go to partition1, nodes assigned 0 go to partition2.
    solution = sampleset.first.sample
    partition1 = [node for node in solution if solution[node] == 1]
    partition2 = [node for node in solution if solution[node] == 0]
    
    return partition1, partition2, qpu_access_time

Note: The num_reads=1000 parameter results in 1000 independent annealing cycles, each taking roughly 20 microseconds on the QPU. The total QPU time per call is therefore about 20 ms, but network latency to the D-Wave cloud adds additional overhead.

Option B: Solving with Gurobi (Classical)

If you do not wish to run the problem on the remote quantum annealer, or do not have a D-Wave API key, you can also use the Gurobi solver which runs locally on your machine.

  
import gurobipy as gp
from gurobipy import GRB

def gurobi_qubo_solver(qubo_matrix):
    """
    Solves a QUBO problem using Gurobi's classical optimizer.
    
    Gurobi uses branch-and-bound with cutting planes to solve
    the binary quadratic program to provable optimality (or
    near-optimality for larger instances).
    """
    n = qubo_matrix.shape[0]
    
    # Create a new Gurobi optimization model
    model = gp.Model()
    
    # Add n binary decision variables — one per node.
    # x[i] = 1 means node i is in partition 1; x[i] = 0 means partition 2.
    x = model.addVars(n, vtype=GRB.BINARY)
    
    # Build the quadratic objective: minimize x^T Q x
    # This is the same QUBO matrix we would send to D-Wave.
    obj_expr = gp.quicksum(
        qubo_matrix[i, j] * x[i] * x[j] for i in range(n) for j in range(n)
    )
    model.setObjective(obj_expr)
    
    # Suppress solver output to keep logs clean
    model.setParam('OutputFlag', 0)
    
    # Solve — Gurobi will use branch-and-bound internally
    model.optimize()
    
    if model.status == GRB.OPTIMAL:
        # Extract the binary solution vector
        solution = [int(x[i].X) for i in range(n)]
        binary_string = ''.join(str(bit) for bit in solution)
        return binary_string, model.objVal
    else:
        return None, None

def bipartition_gurobi(graph):
    """
    Bipartitions a graph using the Gurobi classical solver.
    Drop-in replacement for the D-Wave bipartition function.
    """
    if len(graph.nodes()) == 1:
        return [], [0]
    
    # Same QUBO construction as the D-Wave path
    w = nx.adjacency_matrix(graph).todense()
    qubo = get_qubo_matrix(W=w)
    
    # Solve the QUBO classically
    solution_str, objective_value = gurobi_qubo_solver(qubo)
    
    # Convert binary string to partition assignments
    solution = {idx: int(bit) for idx, bit in enumerate(solution_str)}
    partition1 = [node for node in solution if solution[node] == 1]
    partition2 = [node for node in solution if solution[node] == 0]
    return partition1, partition2

Gurobi is a commercial optimization solver — it is licensed and not open source. The free version allows problems with up to 200 variables, which corresponds to the number of nodes (assets) in our case. However, if you are a student or part of an educational institution, you can obtain an academic license for non-commercial use for free. Check this tutorial by Gurobi on how to obtain an academic license if you are a student, researcher or faculty.

2c. The Iterative GCS-Q Algorithm

The main loop processes a queue of subgraphs, recursively splitting until no beneficial cut remains. Set qubo_solver="dwave" to use the quantum annealer, or qubo_solver="gurobi" to use the classical Gurobi solver.

Notice how this is a breadth-first process: we start with all 50 stocks in one big cluster, split it into two sub-clusters, then examine each sub-cluster to see if it should be split further, and so on. Each split solves a fresh QUBO — the algorithm may solve anywhere from 1 (if no split is beneficial) to $\mathcal{O}(n)$ QUBOs total:

  
def gcs_q_algorithm(adj_matrix, qubo_solver="dwave"):
    """
    Iterative GCS-Q algorithm for correlation clustering.
    
    Starts with all nodes in one cluster (the "grand coalition")
    and recursively bipartitions until no further improvement is possible.
    
    The number of clusters k is determined AUTOMATICALLY — this is one
    of the key advantages over classical methods like PAM and SPONGE.
    
    Args:
        adj_matrix: Signed adjacency (correlation) matrix (n × n).
        qubo_solver: "dwave" for quantum annealing, "gurobi" for classical.
    
    Returns:
        CS_star: A list of lists, where each inner list contains the
                 node indices belonging to one cluster.
    """
    # Build the weighted signed graph from the correlation matrix
    G = construct_graph(adj_matrix)
    
    # Initialize: all nodes start in one cluster (the grand coalition).
    # For 50 stocks, this means one group of 50.
    grand_coalition = list(G.nodes)
    queue = [grand_coalition]  # BFS queue of subgraphs to process
    CS_star = []               # Final clustering result
    
    while queue:
        # Take the next subgraph from the queue
        C = queue.pop(0)
        subgraph = G.subgraph(C).copy()
        
        # ---------------------------------------------------------------
        # CORE STEP: Solve one minimum-cut QUBO for this subgraph.
        # This is where quantum annealing (or Gurobi) is invoked.
        # For a subgraph with m nodes, this QUBO has m binary variables
        # and a solution space of 2^m — the hard part.
        # ---------------------------------------------------------------
        if qubo_solver == "dwave":
            partition1, partition2, qpu_access_time = bipartition(subgraph)
        else:
            partition1, partition2 = bipartition_gurobi(subgraph)
        
        # The bipartition function returns indices relative to the subgraph.
        # Map them back to the original node indices (stock indices).
        partition1 = [C[idx] for idx in partition1]
        partition2 = [C[idx] for idx in partition2]
        
        # ---------------------------------------------------------------
        # STOPPING CRITERION (implicit in the QUBO solution):
        # If the minimum-cut solver returns a trivial partition (one side
        # empty), it means splitting would NOT improve intra-cluster
        # agreement. The subgraph is finalized as a cluster.
        # Otherwise, both halves are enqueued for further splitting.
        # ---------------------------------------------------------------
        if not partition2:       # No meaningful split → finalize cluster
            CS_star.append(partition1)
        elif not partition1:     # Edge case (symmetric)
            CS_star.append(partition2)
        else:                    # Beneficial split → enqueue both halves
            queue.append(partition1)
            queue.append(partition2)
    
    return CS_star

Key insight: The stopping criterion is implicit in the minimum-cut QUBO. If the minimum cut value is non-positive (all inter-partition edges are negative or zero), the solver returns a trivial partition (one side empty), and the subgraph is finalized as a cluster. This means the algorithm automatically determines $k$ — it keeps splitting only when doing so improves the clustering quality.

Use-case example: On a given trading day, GCS-Q might start with all 50 stocks, split them into tech+finance vs. healthcare+energy (first QUBO), then split tech+finance into tech vs. finance (second QUBO), and so on. The algorithm terminates when every sub-cluster is internally cohesive (e.g., all energy stocks are positively correlated with each other). The final number of clusters varies by day, typically between 2 and 11.

Step 3: Implement Classical Baselines

To benchmark GCS-Q, we implement two classical baselines: PAM ($k$-Medoids) and SPONGE. Unlike GCS-Q, both classical methods require the number of clusters $k$ as input, so we also implement a spectral method to estimate $k$ from the data (Step 3c).

3a. PAM (Partitioning Around Medoids)

PAM is the classical $k$-Medoids algorithm. Unlike $k$-Means (which uses cluster centroids), PAM selects actual data points as cluster representatives (“medoids”) and iteratively swaps them to minimize total within-cluster distance.

However, PAM requires a distance matrix, not a correlation matrix. We must transform correlations into distances, losing signed information in the process:

  
def assign_clusters(distance_matrix, medoids):
    """
    Assign each point to the nearest medoid.
    
    For each stock, we check which medoid (representative stock)
    it is closest to and assign it to that cluster.
    """
    clusters = [[] for _ in range(len(medoids))]
    for i in range(distance_matrix.shape[0]):
        distances = [distance_matrix[i, m] for m in medoids]
        clusters[np.argmin(distances)].append(i)
    return clusters

def calculate_total_cost(distance_matrix, medoids, clusters):
    """Total within-cluster distance (PAM's objective to minimize)."""
    total_cost = 0
    for medoid, cluster in zip(medoids, clusters):
        total_cost += np.sum(distance_matrix[cluster][:, medoid])
    return total_cost

def pam(distance_matrix, k, max_iter=100):
    """
    Partitioning Around Medoids (PAM) clustering.
    
    1. Randomly pick k initial medoids from the dataset.
    2. Assign every point to its nearest medoid.
    3. Try swapping each medoid with each non-medoid; keep the swap
       if it reduces total cost.
    4. Repeat until no swap improves the cost (convergence).
    
    This is a greedy local search — it finds a local optimum.
    """
    # Initialize with k random medoids
    medoids = np.random.choice(distance_matrix.shape[0], k, replace=False)
    best_medoids = medoids.copy()
    clusters = assign_clusters(distance_matrix, medoids)
    best_cost = calculate_total_cost(distance_matrix, medoids, clusters)

    for _ in range(max_iter):
        for medoid_idx in range(k):
            # Consider swapping this medoid with every non-medoid
            non_medoids = [i for i in range(distance_matrix.shape[0]) if i not in medoids]
            for new_medoid in non_medoids:
                new_medoids = medoids.copy()
                new_medoids[medoid_idx] = new_medoid
                new_clusters = assign_clusters(distance_matrix, new_medoids)
                new_cost = calculate_total_cost(distance_matrix, new_medoids, new_clusters)
                if new_cost < best_cost:
                    best_cost = new_cost
                    best_medoids = new_medoids.copy()
                    clusters = new_clusters
        # If no medoid changed, we've converged
        if np.array_equal(best_medoids, medoids):
            break
        medoids = best_medoids.copy()
    return best_medoids, clusters

Caveat: PAM operates on a distance matrix, computed from correlations as $d_{ij} = \sqrt{2(1 - \rho_{ij})}$. This transformation loses signed information — a correlation of $-0.3$ and $+0.3$ both map to nonzero distances, so PAM cannot distinguish between “weakly similar” and “weakly dissimilar” stocks. This is a fundamental limitation when working with signed data.

3b. SPONGE (Signed Positive Over Negative Generalized Eigenproblem)

SPONGE [7] is a spectral method specifically designed for signed graphs — unlike PAM, it can natively handle both positive and negative edge weights. It decomposes the adjacency matrix into positive ($A^+$) and negative ($A^-$) parts and solves a generalized eigenproblem to find cluster assignments.

The idea is elegant: SPONGE finds the embedding that simultaneously maximizes within-cluster positive connections and maximizes between-cluster negative connections.

  
from signet.cluster import Cluster
from scipy.sparse import csc_matrix

def sponge_clustering(adj_matrix, k, method='SPONGE'):
    """
    Runs SPONGE or SPONGE_sym clustering on a signed adjacency matrix.
    
    Steps:
      1. Separate the correlation matrix into positive (Ap) and negative (An) parts.
      2. Feed them into the signet library's Cluster class.
      3. Solve the generalized eigenproblem to find k cluster assignments.
    
    SPONGE and SPONGE_sym differ in their Laplacian normalization:
      - SPONGE: asymmetric normalization (handles unbalanced clusters)
      - SPONGE_sym: symmetric normalization (assumes roughly equal-sized clusters)
    
    Args:
        adj_matrix: Signed adjacency (correlation) matrix.
        k: Number of clusters (must be provided externally).
        method: 'SPONGE' or 'SPONGE_sym'.
    """
    if not isinstance(adj_matrix, np.ndarray):
        adj_matrix = np.array(adj_matrix)

    # Separate into positive and negative parts.
    # Ap contains only the positive correlations (co-moving stocks).
    # An contains only the magnitudes of negative correlations (opposing stocks).
    Ap = csc_matrix(np.maximum(0, adj_matrix))    # Positive edges
    An = csc_matrix(np.maximum(0, -adj_matrix))   # Negative edges (sign flipped)

    # The signet library expects sparse matrices for Ap and An
    cluster_model = Cluster((Ap, An))
    if method == 'SPONGE':
        predictions = cluster_model.SPONGE(k=k)
    elif method == 'SPONGE_sym':
        predictions = cluster_model.SPONGE_sym(k=k)

    # Convert prediction labels into a list-of-lists format
    # matching the output format of GCS-Q and PAM
    output_clusters = [[] for _ in range(len(np.unique(predictions)))]
    for idx, cid in enumerate(predictions):
        output_clusters[cid].append(idx)
    return output_clusters

Limitation: SPONGE was originally designed for sparse graphs with edge weights in ${-1, 0, +1}$ and balanced cluster sizes. Financial correlation matrices violate both assumptions — they are fully dense with continuous weights in $[-1, +1]$, and sector sizes are inherently unequal (e.g., 8 tech stocks vs. 3 utility stocks). This explains why SPONGE underperforms GCS-Q on financial data.

3c. Estimating $k$

As noted above, both PAM and SPONGE require $k$ as input but cannot determine it on their own. We estimate it using the eigengap heuristic applied to the dual Laplacian of the signed graph. The dual Laplacian combines information from both positive and negative subgraphs, and the largest gap between consecutive eigenvalues indicates the most natural number of clusters.

In practice, the dual Laplacian captures the block structure of the correlation matrix: if there are $k$ well-separated clusters, the first $k$ eigenvalues will be small and closely spaced, followed by a sharp jump (the “spectral gap”) to the $(k+1)$-th eigenvalue.

  
def basic_laplacian(A, normalize=True):
    """
    Compute the (normalized) graph Laplacian L = D - A.
    
    D is the diagonal degree matrix (sum of each row).
    Normalization: L_norm = D^{-1/2} L D^{-1/2}, which scales
    the Laplacian so that eigenvalues fall in [0, 2].
    """
    D = np.diag(np.sum(A, axis=1))
    L = D - A
    if normalize:
        with np.errstate(divide='ignore'):
            D_inv_sqrt = np.diag(1.0 / np.sqrt(np.sum(A, axis=1)))
            D_inv_sqrt[np.isinf(D_inv_sqrt)] = 0.0
        L = D_inv_sqrt @ L @ D_inv_sqrt
    return L

def dual_laplacian(adj_matrix):
    """
    Compute a combined (dual) Laplacian for signed graphs.
    
    Separates the signed adjacency matrix into:
      - A_pos: only the positive edges (stocks that move together)
      - A_neg: only the negative edges (stocks that move oppositely)
    
    The dual Laplacian L_pos + L_neg encodes both types of structure,
    enabling spectral methods to detect clusters in signed graphs.
    """
    A_pos = np.clip(adj_matrix, 0, 1)      # Keep only positive correlations
    A_neg = -np.clip(adj_matrix, -1, 0)     # Flip sign of negatives (→ positive)
    L_pos = basic_laplacian(A_pos, normalize=True)
    L_neg = basic_laplacian(A_neg, normalize=True)
    return L_pos + L_neg

def spectral_gap_method(eigenvalues, max_k, skip=2):
    """
    Finds k by locating the largest gap in the sorted eigenvalue spectrum.
    
    We skip the first few eigenvalues (typically noise/trivially small).
    The position of the largest gap tells us how many natural clusters exist.
    
    Example: eigenvalues = [0.01, 0.03, 0.05, 0.52, 0.9, ...]
             Largest gap is between 0.05 and 0.52 → k = 3.
    """
    diffs = np.diff(eigenvalues[skip:max_k + skip])
    return np.argmax(diffs) + 1 + skip

Note: This step is only needed for the classical baselines. GCS-Q determines $k$ automatically as part of its optimization.

Step 4: Define the Evaluation Metric

How do we objectively compare clustering results across different algorithms? We use the structural balance penalty — a metric directly aligned with the correlation clustering objective. A perfect clustering would have:

All positive correlations (co-moving stocks) within the same cluster
All negative correlations (opposing stocks) between different clusters

Any violation of these two rules incurs a penalty:

  
from itertools import combinations

def penalty_metric(adj_matrix, clusters):
    """
    Computes structural balance penalty for a given clustering.
    
    Two types of violations are penalized:
    
    1. NEGATIVE intra-cluster edges: Two stocks in the SAME cluster have
       a negative correlation. This is bad — they move in opposite
       directions and should not be grouped together.
       Penalty contribution: |w_ij| (magnitude of the negative correlation)
    
    2. POSITIVE inter-cluster edges: Two stocks in DIFFERENT clusters have
       a positive correlation. This is also bad — they co-move and should
       be in the same cluster.
       Penalty contribution: w_ij (the positive correlation value)
    
    Lower is better. Zero = perfectly balanced clustering.
    """
    penalty = 0
    
    # Check for negative edges WITHIN each cluster
    # (stocks that are anti-correlated but grouped together)
    for cluster in clusters:
        for i, j in combinations(cluster, 2):
            if adj_matrix[i, j] < 0:
                penalty += abs(adj_matrix[i, j])
    
    # Check for positive edges BETWEEN different clusters
    # (stocks that co-move but are placed in separate clusters)
    for i in range(len(adj_matrix)):
        for j in range(len(adj_matrix)):
            if i != j and adj_matrix[i, j] > 0:
                in_same_cluster = any(i in c and j in c for c in clusters)
                if not in_same_cluster:
                    penalty += adj_matrix[i, j]
    
    return penalty

Interpretation: If GCS-Q achieves a penalty of 5.2 and PAM achieves 18.7 on the same day, GCS-Q’s clustering has far fewer structural violations — it does a better job of keeping correlated stocks together and separating anti-correlated ones. For portfolio construction, this means GCS-Q’s clusters more faithfully represent the true co-movement structure of the market.

Step 5: Fetch Data and Run the Pipeline

5a. Define Stock Tickers

We select 50 stocks spanning diverse sectors to ensure a rich and realistic correlation structure. The diversity is key — stocks within the same sector (e.g., all tech stocks) tend to be positively correlated, while stocks across sectors (e.g., tech vs. energy) may be negatively correlated. This creates the signed graph structure that GCS-Q is designed to exploit:

  
tickers = [
    # Technology (8 stocks)
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META', 'NVDA', 'INTC', 'CSCO',
    # Finance (6 stocks)
    'JPM', 'GS', 'BAC', 'WFC', 'C', 'MS',
    # Healthcare (6 stocks)
    'JNJ', 'PFE', 'MRK', 'ABBV', 'MRNA', 'BMY',
    # Energy (6 stocks)
    'XOM', 'CVX', 'COP', 'SLB', 'HAL', 'OXY',
    # Consumer Discretionary (6 stocks)
    'HD', 'MCD', 'NKE', 'SBUX', 'TGT', 'DIS',
    # Consumer Staples (6 stocks)
    'PG', 'KO', 'PEP', 'WMT', 'COST', 'MDLZ',
    # Industrials (6 stocks)
    'BA', 'GE', 'UNP', 'MMM', 'CAT', 'HON',
    # Utilities & Telecom (6 stocks)
    'NEE', 'DUK', 'SO', 'EXC', 'VZ', 'T'
]

5b. Download and Process Returns

We use a rolling daily window approach: for each business day in January 2025, we download hourly intraday price data, compute log returns, and build that day’s correlation matrix. This simulates how a portfolio manager would re-cluster assets each day based on the latest intraday co-movement patterns:

  
from datetime import datetime, timedelta

def generate_days(global_start, global_end):
    """Generate a list of consecutive calendar dates.
    Weekends and holidays are handled later by checking data availability."""
    days = []
    current = global_start
    while current <= global_end:
        days.append(current)
        current += timedelta(days=1)
    return days

# Define the evaluation period: January 2025
global_start = datetime(2025, 1, 1)
global_end = datetime(2025, 2, 1)
days = generate_days(global_start, global_end)

5c. Run All Algorithms Over a Rolling Window

This loop downloads hourly return data for each business day, computes the correlation matrix, and runs all four algorithms. Each iteration represents one trading day — we build a fresh correlation matrix from that day’s intraday data and cluster accordingly:

  
algo_names = ["gcsq", "pam", "sponge", "sponge_sym"]
penalty_results = {algo: [] for algo in algo_names}
for algo in algo_names:
    penalty_results[algo + "_k"] = []   # Track the number of clusters per algo
penalty_results["date"] = []             # Track which dates were processed

interval = '1h'  # Hourly data gives us ~7 data points per day per stock

for i, date in enumerate(days[:-1]):
    try:
        start_date = date.strftime('%Y-%m-%d')
        end_date = days[i + 1].strftime('%Y-%m-%d')
        print(f"Processing: {start_date} to {end_date}")

        # ---------------------------------------------------------------
        # STEP A: Download hourly closing prices from Yahoo Finance.
        # We download one day at a time to build a daily correlation snapshot.
        # ---------------------------------------------------------------
        df = yf.download(tickers, start=start_date, end=end_date, interval=interval)['Close']
        df = df.replace([np.inf, -np.inf], np.nan).dropna(axis=1, how='any')
        if df.shape[1] < 10:
            continue  # Skip weekends/holidays (insufficient data points)

        # ---------------------------------------------------------------
        # STEP B: Compute percentage returns and the Pearson correlation matrix.
        # This correlation matrix IS the signed adjacency matrix for our graph.
        # Values range from -1 (perfectly anti-correlated) to +1 (perfectly correlated).
        # ---------------------------------------------------------------
        returns_df = df.pct_change().dropna()
        corr_matrix_np = returns_df.corr().values
        adj_matrix = corr_matrix_np.copy()

        # ---------------------------------------------------------------
        # STEP C: Estimate k for the classical baselines.
        # GCS-Q does NOT need this — it determines k automatically.
        # ---------------------------------------------------------------
        L_dual = dual_laplacian(adj_matrix)
        eigvals, _ = eigh(L_dual)
        k = spectral_gap_method(eigvals, adj_matrix.shape[0])

        # ---------------------------------------------------------------
        # STEP D: Run GCS-Q (quantum annealer).
        # This is the main algorithm — it does NOT need k as input.
        # ---------------------------------------------------------------
        clusters_gcsq = gcs_q_algorithm(adj_matrix, qubo_solver="dwave")
        penalty_results["gcsq"].append(penalty_metric(adj_matrix, clusters_gcsq))
        penalty_results["gcsq_k"].append(len(clusters_gcsq))

        # ---------------------------------------------------------------
        # STEP E: Run PAM (k-Medoids) with the estimated k.
        # We must first convert correlations to distances.
        # d_ij = sqrt(2 * (1 - rho_ij)), with alpha=2 as scaling factor.
        # ---------------------------------------------------------------
        alpha = 2
        dist_matrix = np.sqrt(alpha * (1 - adj_matrix.clip(min=-1, max=1)))
        _, clusters_pam = pam(dist_matrix, k=k)
        penalty_results["pam"].append(penalty_metric(adj_matrix, clusters_pam))
        penalty_results["pam_k"].append(len(clusters_pam))

        # ---------------------------------------------------------------
        # STEP F: Run SPONGE and SPONGE_sym with the estimated k.
        # Both are spectral methods for signed graphs.
        # ---------------------------------------------------------------
        clusters_sp = sponge_clustering(adj_matrix, k, method='SPONGE')
        penalty_results["sponge"].append(penalty_metric(adj_matrix, clusters_sp))
        penalty_results["sponge_k"].append(len(clusters_sp))

        clusters_ss = sponge_clustering(adj_matrix, k, method='SPONGE_sym')
        penalty_results["sponge_sym"].append(penalty_metric(adj_matrix, clusters_ss))
        penalty_results["sponge_sym_k"].append(len(clusters_ss))

        penalty_results["date"].append(start_date)

    except Exception as e:
        print(f"Error on {start_date}: {e}")
        continue

What this produces: After the loop completes, penalty_results contains the structural balance penalty and cluster count for each algorithm on each trading day. This lets us directly compare how well each method respects the signed correlation structure.

Step 6: Visualize the Results

Finally, we plot the penalty over time for all four algorithms. Each point on the x-axis is one business day; the y-axis shows how much each algorithm’s clustering violates the ideal structural balance. Lower curves indicate better-quality clusterings:

  
penalty_df = pd.DataFrame(penalty_results)

plt.figure(figsize=(14, 7))
sns.set(style="whitegrid", font_scale=1.3)

# Distinct markers and colors for each algorithm
markers = {'gcsq': 'X', 'pam': 's', 'sponge': 'o', 'sponge_sym': '^'}
colors  = {'gcsq': '#2ca02c', 'pam': '#ff7f0e', 'sponge': '#1f77b4', 'sponge_sym': '#d62728'}

for algo in algo_names:
    formatted_dates = pd.to_datetime(penalty_df['date']).dt.strftime('%d')
    sns.lineplot(
        x=formatted_dates, y=penalty_df[algo],
        label=algo.upper(), marker=markers[algo],
        linewidth=4, alpha=0.85, color=colors[algo]
    )
    plt.setp(plt.gca().lines[-1], markersize=14)

plt.xlabel('Dates (January 2025)', fontsize=24)
plt.ylabel('Penalty', fontsize=28)         # Lower = better clustering
plt.xticks(rotation=0, fontsize=22)
plt.yticks(fontsize=22)
plt.tight_layout()
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=4, fontsize=28, frameon=True)
plt.savefig('clustering_penalties.png', bbox_inches='tight', dpi=300)
plt.show()

Penalty comparison across clustering algorithms over business days in January 2025 [1]

Results and Discussion

The results demonstrate clear advantages of GCS-Q over classical baselines:

GCS-Q consistently achieves the lowest penalty across nearly all trading days, often by a large margin. On some days, GCS-Q’s penalty is an order of magnitude lower than PAM’s.
PAM performs worst overall, which is expected since it relies on the lossy distance transformation and a centroid-based objective that does not align with the correlation clustering goal.
SPONGE and SPONGE_sym perform moderately, benefiting from their ability to handle signed graphs. However, they struggle because they are designed for sparse graphs with weights in ${-1, 0, +1}$ and balanced cluster sizes — conditions rarely met in financial correlation data with continuous weights and heterogeneous cluster structures.
GCS-Q determines $k$ dynamically, typically finding 2–11 clusters depending on the day’s correlation structure, while the classical methods are constrained to the $k$ estimated by the spectral gap heuristic.

Why Does GCS-Q Win?

No information loss: GCS-Q works directly on the signed correlation matrix without any distance transformation.
Exponential search space: At each bipartition step, the quantum annealer explores $\mathcal{O}(2^n)$ possible binary assignments simultaneously.
Autonomous $k$ selection: The stopping criterion automatically identifies when further splitting would reduce clustering quality.
Robustness to heterogeneous cluster sizes: Unlike classical methods that assume approximately equal-sized clusters, GCS-Q handles skewed distributions naturally.

Hardware Details

The QUBO subproblems are solved on the D-Wave Advantage_system5.4 annealer, which comprises 5,614 physical qubits and 40,050 couplers arranged in a Pegasus topology. Despite hardware noise and cloud latency (the dominant bottleneck due to remote access), GCS-Q delivers superior solution quality. Clustering 50 assets takes approximately a few minutes, with per-QUBO solve times capped at 1 second.

Scalability

GCS-Q has been validated on up to 170 fully connected assets with $k=10$ — the practical limit for fully dense graphs on current D-Wave hardware. Sparser correlation structures would allow scaling to even larger portfolios. Future hardware topologies (e.g., D-Wave Zephyr with higher connectivity) are expected to push this limit further.

Broader Context

This work is presented in two complementary papers:

QUEST-IS 2025 [1]: Focuses on the financial application — asset clustering using real Yahoo Finance data. Demonstrates that GCS-Q outperforms PAM, SPONGE, and SPONGE_sym in terms of structural balance penalty on real stock data.
IEEE QAI 2025 [2]: Provides the general correlation clustering framework, adapting GCS-Q for arbitrary signed graphs. Benchmarks against $k$-Means, PAM, DIANA, Spectral, and Hierarchical clustering on synthetic balanced graphs and real hyperspectral remote sensing data.

Together, these works demonstrate that GCS-Q achieves quantum advantage in solution quality using real quantum hardware, validated on both synthetic and real-world datasets. This marks a concrete step toward near-term quantum utility for graph-based unsupervised learning.

Conclusion

Classical clustering methods were designed for a world of positive-definite distances and spherical clusters. Financial correlation matrices do not live in that world — they are signed, dense, and rarely structurally balanced. GCS-Q bridges this gap by formulating clustering as recursive minimum-cut optimization on the native signed graph, leveraging quantum annealing to explore the exponentially large solution space efficiently.

If you work with correlation-based data — whether in finance, bioinformatics, or remote sensing — this pipeline offers a principled, quantum-enhanced alternative to classical heuristics. The code is open-source and ready to run.

📄 References

[1] S. Sharma, S. M. Venkatesh, and P. Kachroo, “Toward Quantum Utility in Finance: A Robust Data-Driven Algorithm for Asset Clustering,” in Quantum Engineering Sciences and Technologies for Industry and Services (QUEST-IS 2025), F. Barbaresco and F. Gerin (eds), Communications in Computer and Information Science, vol. 2744. Springer, Cham, 2026, pp. 15–27. DOI: 10.1007/978-3-032-13855-2_2, arXiv:2509.07766

[2] A. Macaluso, S. M. Venkatesh, D. Arenas, M. Klusch, and A. Dengel, “Quantum-Assisted Correlation Clustering,” 2025 IEEE International Conference on Quantum Artificial Intelligence (QAI), Naples, Italy, 2025, pp. 7–12. DOI: 10.1109/QAI63978.2025.00009, arXiv:2509.03561

[3] S. M. Venkatesh, A. Macaluso, and M. Klusch, “GCS-Q: Quantum Graph Coalition Structure Generation,” in Computational Science – ICCS 2023, J. Mikyška, C. de Mulatier, M. Paszynski, V. V. Krzhizhanovskaya, J. J. Dongarra, and P. M. Sloot (eds), Lecture Notes in Computer Science, vol. 14077. Springer, Cham, 2023, pp. 138–152. DOI: 10.1007/978-3-031-36030-5_11, arXiv:2212.11372

[4] S. M. Venkatesh, A. Macaluso, M. Nuske, M. Klusch, and A. Dengel, “Qubit-Efficient Variational Quantum Algorithms for Image Segmentation,” 2024 IEEE International Conference on Quantum Computing and Engineering (QCE), Montreal, QC, Canada, 2024, pp. 450–456. DOI: 10.1109/QCE60285.2024.00059, arXiv:2405.14405

[5] H. Markowitz, “Portfolio Selection,” Journal of Finance, vol. 7, no. 1, pp. 77–91, 1952. DOI: 10.1111/j.1540-6261.1952.tb01525.x

[6] W. K. Bertram, “Analytic Solutions for Optimal Statistical Arbitrage Trading,” Physica A: Statistical Mechanics and its Applications, vol. 389, no. 11, pp. 2234–2243, 2010. DOI: 10.1016/j.physa.2010.01.045

[7] M. Cucuringu, P. Davies, A. Glielmo, and H. Tyagi, “SPONGE: A Generalized Eigenproblem for Clustering Signed Networks,” in Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (AISTATS), K. Chaudhuri and M. Sugiyama (eds), Proceedings of Machine Learning Research, vol. 89. PMLR, 2019, pp. 1088–1098. URL

[8] S. M. Venkatesh, A. Macaluso, M. Nuske, M. Klusch, and A. Dengel, “Q-Seg: Quantum Annealing-Based Unsupervised Image Segmentation,” in IEEE Computer Graphics and Applications, vol. 44, no. 5, pp. 27–39, Sept.–Oct. 2024. DOI: 10.1109/MCG.2024.3455012, arXiv:2311.12912

Quantum Algorithms, Quantum Annealing, Machine Learning

This post is licensed under CC BY 4.0 by the author.