Quantum Office Hours: From idea to quantum circuit with AI agents

Quantum Computing for Finance: How High-Level Quantum Programming and GPU Acceleration Are Changing the Game

Vincent van Wingerden
Esperanza Cuenca Gomez
Date
22 Jun 2026
Share this article
Topics
Date
22 June 2026
Our Library
Share this article

By: the Classiq and NVIDIA Teams

The financial industry runs onoptimization and uncertainty. Every day, portfolio managers balance competing assets, risk desks price derivatives under volatile conditions, and trading algorithms make decisions in milliseconds. Classical computers handle these problems well enough, until they don't. As portfolios grow, as derivative structures become more exotic, and as risk models demand higher fidelity, the computational cost explodes. Quantum computing promises a fundamentally different way to attack these hard problems. But the path from quantum theory to practical financial tooling has, until now, been full of friction.

This post shows how Classiq's high-level quantum modeling language and NVIDIA's CUDA-Q platform remove that friction. We'll walk through two of the most promising quantum algorithms for finance, QAOA for portfolio optimization, and IQAE for derivative pricing, and we’ll show how the two platforms work together to accelerate iterative workflows on GPU hardware.

The Finance Problem Space: Where Quantum Has an Edge

Before we write a single line of quantum code, it's worth being precise about where quantum computing actually helps infinance as not every problem benefits. The strongest cases today are:

Portfolio optimization — Finding the optimal allocation across N assets subject to constraints (budget, risk, sector exposure) is a combinatorial problem. With N assets, the search space grows as 2^N. The best classical solvers like Gurobi, CPLEX, branch-and-bound variants are formidably good and have decades of engineering behind them. Quantum approximate optimization (QAOA) has not yet demonstrated a clear advantage over these solvers on real hardware. What it has demonstrated is that the problem maps naturally onto quantum circuits, that those circuits run on today's devices, and that the algorithmic framework scales in the right direction as hardware improves. We are building toward advantage, not claiming it today.

Derivative pricing and risk estimation — Monte Carlo simulation is the industry workhorse for pricing options, computing Value-at-Risk (VaR), and estimating CVA. It converges at O(1/√N) and you need 100× more samples to get 10× more precision. Quantum Amplitude Estimation (QAE) achieves O(1/N) convergence, a quadratic speedup that translates into dramatic runtime reductions for high-precision pricing.

Credit risk modeling — Correlated default probabilities in loan portfolios are computationally expensive to model. Quantum approaches to distribution loading and amplitude estimation apply here as well.

The common thread: quadratic or exponential classical complexity that quantum algorithms address directly.

The Programming Challenge: From Math to Quantum Circuit

Here's where most quantum finance explorations stall. Implementing QAOA or QAE correctly requires deep knowledge of quantum circuit construction such as qubit encoding schemes, parameterized rotation gates, ancilla management, entanglement patterns, and transpilation to hardware-native gate sets. A financial engineer shouldn't need a PhD in quantum information theory to run a portfolio optimization.

Classiq's quantum modeling language flips the paradigm. Instead of specifying how to build a circuit gate by gate, you describe what the circuit should compute. Classiq's synthesis engine handles the translation to an efficient, hardware-ready implementation so quantum programs look like financial models, not physics experiments.

QAOA for Portfolio Optimization

The Problem

We want to select a portfolio of k assets from a universe of N candidates that maximizes expected return while keeping risk below a threshold. This is a constrained combinatorial problem: with N binary variables (invest or don't), it maps naturally onto the space  algorithms are built to navigate.

Given an expected return vector μ,a covariance matrix Σ, and a risk aversion parameter q, the objective is:

maximize:  μᵀx - q · xᵀΣx

subject to: Σᵢ xᵢ = k,  xᵢ ∈ {0, 1}

A word on expectations: QAOA on current hardware does not beat Gurobi or other best-in-class classical solvers on this problem. The honest value of running it now is different — it validates the end-to-end quantum workflow, it stress-tests the circuit synthesis and execution stack, and it puts teams in a position to benefit when hardware improves.

Expressing It in Classiq

What Classiq provides here is meaningfu lregardless of that hardware caveat: the problem is expressed entirely at the financial level. Classiq uses Pyomo, a standard optimization modeling framework that operations researchers already know as its interface for combinatorial problems. The QAOA circuit structure, the cost Hamiltonian encoding, and the mixer layers are all synthesized automatically from the model.

Step 1 — Define the financial problem in Pyomo:

import numpy as np
import pyomo.core as pyo
from classiq import *
from classiq.applications.combinatorial_optimization import (
    CombinatorialProblem,
    pyo_model_to_pauli_operator,
)
import cudaq

# Problem parameters: 3 assets, integer allocation weights
returns = np.array([3, 4, -1])
covariances = np.array(
    [
        [ 0.9,  0.5, -0.7],
        [ 0.5,  0.9, -0.2],
        [-0.7, -0.2,  0.9],
    ]
)
total_budget = 6

# Pyomo model — pure finance, no quantum yet
portfolio_model = pyo.ConcreteModel("portfolio_optimization")
num_assets = len(returns)
portfolio_model.w = pyo.Var(range(num_assets), domain=pyo.Integers, bounds=(0, 7))
w_array = list(portfolio_model.w.values())
portfolio_model.budget_rule = pyo.Constraint(expr=(sum(w_array) <= total_budget))
portfolio_model.expected_return = returns @ w_array
portfolio_model.risk = w_array @ covariances @ w_array
portfolio_model.cost = pyo.Objective(
    expr=portfolio_model.risk - portfolio_model.expected_return,
    sense=pyo.minimize,
)

This is a standard intege rprogramming model: minimize risk minus expected return, subject to a total budget constraint. No quantum concepts yet.

Step 2 — Synthesize the QAOA circuit with Classiq:

NUM_LAYERS = 3
PENALTY    = 10  # constraint penalty — must match penalty_energy in the CUDA-Q Hamiltonian

combi = CombinatorialProblem(pyo_model=portfolio_model, num_layers=NUM_LAYERS, penalty_factor=PENALTY)
qprog = combi.get_qprog()
show(qprog)  # inspect the synthesized circuit in the Classiq IDE

One call to CombinatorialProblem — Classiq reads the Pyomo objective and constraints, constructs the cost Hamiltonian, and synthesizes the full depth-optimized QAOA circuit. show(qprog) opens it in the Classiq platform where you can inspect gate counts, qubit layout, and depth interactively.

The Variational Loop and the CUDA-Q Bridge

QAOA is a variational algorithm: the circuit has trainable parameters (γ, β angles, one pair per layer) that aclassical optimizer tunes iteratively to minimize the cost expectation value. This outer loop can require hundreds to thousands of circuit evaluations which is exactly where GPU acceleration makes the difference between a research prototype and something practical.

The Classiq-to-CUDA-Q bridge is two function calls:

# Build the Hamiltonian using the same penalty as CombinatorialProblem,
# so both paths optimise the identical cost landscape
classiq_hamiltonian = pyo_model_to_pauli_operator(portfolio_model, penalty_energy=PENALTY)
hamiltonian = pauli_operator_to_cudaq_spin_op(classiq_hamiltonian)
# Convert the synthesized Classiq circuit to a native CUDA-Q kernel
kernel = qprog_to_cudaq_kernel(qprog, is_main_kernel=True)

The kernel is now a native CUDA-Q object runnable on any CUDA-Q backend by simply changing a single set_target() call. The circuit itself hasn't changed; it's the same gate sequence Classiq synthesized. The PENALTY constant is the critical shared parameter: it controls how heavily constraint violations are penalised in the cost Hamiltonian, and both CombinatorialProblem and pyo_model_to_pauli_operator must use the same value, otherwise the two methods are literally optimising different objective functions and their results will diverge.

Step 3 — Run the Classiq native optimizer:

combi.optimize() must run first as it initialises the execution session (_es) that combi.sample() depends on for decoding. We time it here so the comparison is straightforward.

import time
from classiq.execution import ExecutionPreferences, ClassiqBackendPreferences
execution_preferences = ExecutionPreferences(
    backend_preferences=ClassiqBackendPreferences(backend_name="simulator"),
)
t0 = time.perf_counter()
optimized_params = combi.optimize(
    execution_preferences,
    maxiter=70,
    quantile=0.7,  # CVaR alpha: focus on the best 30% of samples
)
classiq_time = time.perf_counter() - t0
classiq_result  = combi.sample(optimized_params)
best_classiq    = classiq_result.solution[classiq_result.cost.idxmin()]
classiq_weights = np.array(best_classiq["w"])
classiq_cost    = classiq_result.cost.min()
print(f"Classiq  allocation: {classiq_weights.tolist()}  cost: {classiq_cost:.4f}  time: {classiq_time:.2f}s")

Step 4 — Run the CUDA-Q GPU optimizer:

The CUDA-Q objective now uses the same quantile=0.7 as combi.optimize() — meaning both methods minimise the expected cost over the best 30% of sampled bitstrings rather than the plain expectation value. This is what aligns the two optimisers on the same landscape.

def initial_qaoa_params(NUM_LAYERS) -> np.ndarray:
    initial_gammas = math.pi * np.linspace(
        1 / (2 * NUM_LAYERS), 1 - 1 / (2 * NUM_LAYERS), NUM_LAYERS
    )
    initial_betas = math.pi * np.linspace(
        1 - 1 / (2 * NUM_LAYERS), 1 / (2 * NUM_LAYERS), NUM_LAYERS
    )
    initial_params = []
    for i in range(NUM_LAYERS):
        initial_params.append(initial_gammas[i])
        initial_params.append(initial_betas[i])
    return np.array(initial_params)
initial_params = initial_qaoa_params(NUM_LAYERS)

cudaq.set_target("nvidia")
QUANTILE     = 0.7   # must match combi.optimize(quantile=...)
SHOTS        = 1000  # samples per evaluation
parameter_count = 2 * NUM_LAYERS
def cvar_objective(parameters):
    """QAOA objective: mean cost of the best (1-QUANTILE) fraction of shots."""
    counts = cudaq.sample(kernel, *parameters, shots_count=SHOTS)
    # Evaluate cost for every observed bitstring
    costs = []
    for bitstring, count in counts.items():
        result = combi.sample(list(parameters))  # decode via combi
        cost   = result.cost.min()
        costs.extend([cost] * count)
    costs.sort()
    cutoff = max(1, int(len(costs) * (1 - QUANTILE)))
    return float(np.mean(costs[:cutoff]))
optimizer = cudaq.optimizers.COBYLA()
optimizer.max_iterations = 70
optimizer.initial_parameters = initial_params
t0 = time.perf_counter()
optimal_cvar, optimal_parameters = optimizer.optimize(
    dimensions=parameter_count, function=cvar_objective
)
cudaq_time = time.perf_counter() - t0
# Decode the final allocation using the optimal parameters
cudaq_result  = combi.sample(optimal_parameters)
best_cudaq    = cudaq_result.solution[cudaq_result.cost.idxmin()]
cudaq_weights = np.array(best_cudaq["w"])
cudaq_cost    = cudaq_result.cost.min()
print(f"CUDA-Q   allocation: {cudaq_weights.tolist()}  cost: {cudaq_cost:.4f}  time: {cudaq_time:.2f}s")

Comparing the Two Results

Both methods optimise the same Hamiltonian over the same synthesised circuit and decode through the same combi.sample() function, so their allocations and costs should agree. The timing column is the point of interest where combi.optimize() runs on Classiq's cloud simulator while the CUDA-Q path runs on a local NVIDIA GPU.

print(f"{'Method':<10} {'Allocation':>20} {'Cost':>10} {'Time (s)':>12} {'Speedup':>10}")
print("-" * 66)
print(f"{'Classiq':<10} {str(classiq_weights.tolist()):>20} {classiq_cost:>10.4f} {classiq_time:>12.2f} {classiq_time/cudaq_time:>9.1f}x")
print(f"{'CUDA-Q':<10} {str(cudaq_weights.tolist()):>20} {cudaq_cost:>10.4f} {cudaq_time:>12.2f} {'1.0x':>10}")

Comparing these methods we get these results:

Method               Allocation       Cost    Time (s)    Speedup

------------------------------------------------------------------

Classiq               [2, 2, 1]    -4.5000       160.13       2.5x

CUDA-Q                [2, 2, 1]    -4.5000        64.35       1.0x

The GPU speedup here is not about beating classical portfolio solvers yet, it's about making the quantum research loop fast enough to be useful by exploring layer depths, tuning penalty factors and benchmarking ansatz structures. GPU simulation is the development environment that makes quantum finance work today while hardware matures.

IQAE for Derivative Pricing

The Problem

Pricing a European option requires estimating an expectation value over a probability distribution of future asset prices:

Price = e^(-rT) · E[max(S_T - K, 0)]

where S_T is the asset price at maturity, K is the strike, and the expectation is taken over a log-normal distribution. Classical Monte Carlo estimates this by averaging over N random samples, converging at O(1/√N). Quantum Amplitude Estimation achieves O(1/N) convergence which is a quadratic speedup in oracle calls.

Iterative QAE (IQAE) is the near-term variant of choice: it avoids the deep controlled-phase circuits of textbook QAE, instead adaptively querying a Grover-like oracle with increasing repetition counts to home in on the true amplitude. Each iteration uses a small number of shots and narrows a confidence interval via Clopper-Pearson statistics which is practical on today's hardware.

Unlike QAOA, the quantum advantage case for IQAE is more concrete. The quadratic speedup in oracle queries is proven, and the question is whether it survives circuit overhead on real hardware. GPU-accelerated simulation lets us run the full IQAE loop at scale today and measure where the crossover point lies.

Step 1 — Define the Distribution and Payoff

The financial parameters map directly onto the Classiq model. We use 3 qubits, giving 2³ = 8 discretisation points over the log-normal distribution. The scaling_factor normalises the payoff into[0, 1] so it can be loaded as a quantum amplitude.

import numpy as np
import scipy
import cudaq
from classiq import *
from classiq.applications.iqae.iqae import IQAE
from classiq.qmod.symbolic import ceiling

# --- Financial parameters ---
num_qubits = 3
mu         = 0.7   # log-normal mean
sigma      = 0.13  # log-normal std dev
K          = 1.9   # strike price

# --- Discretise the log-normal distribution over 2^num_qubits points ---
def get_log_normal_probabilities(mu, sigma, num_points):
    mean     = np.exp(mu + sigma**2 / 2)
    variance = (np.exp(sigma**2) - 1) * np.exp(2 * mu + sigma**2)
    stddev   = np.sqrt(variance)
    low      = np.maximum(0, mean - 3 * stddev)
    high     = mean + 3 * stddev
    x        = np.linspace(low, high, num_points)
    return x, scipy.stats.lognorm.pdf(x, s=sigma, scale=np.exp(mu))
grid_points, probs = get_log_normal_probabilities(mu, sigma, 2**num_qubits)
probs          = probs / np.sum(probs)  # normalise to a valid probability distribution
scaling_factor = max(grid_points) - K
grid_step      = (max(grid_points) - min(grid_points)) / (len(grid_points) - 1)
def scale(val):
    return val * grid_step + min(grid_points)
def descale(val):
    return (val - min(grid_points)) / grid_step

Step 2 — Build the State Preparation Circuit in Classiq

The quantum algorithm needs two components: a circuit that loads the price distribution and a payoff function that marks in-the-money outcomes. Classiq synthesises both from high-level descriptions.

@qfunc
def load_distribution(asset: QNum):
    inplace_prepare_state(probabilities=probs.tolist(), bound=0, target=asset)

@qfunc
def payoff_linear(asset: Const[QNum], ind: QBit):
    # Load payoff amplitude: sqrt((S - K) / scaling_factor) for S > K
    assign_amplitude_table(
        lookup_table(lambda n: np.sqrt(abs((scale(n) - K) / scaling_factor)), asset),
        asset,
        ind,
    )

@qfunc
def payoff(asset: Const[QNum], ind: QBit):
    # Only apply the payoff when asset price exceeds the strike
    control(asset >= ceiling(descale(K)), lambda: payoff_linear(asset, ind))

@qfunc
def european_call_state_preparation(asset: QNum, ind: QBit):
    load_distribution(asset)
    payoff(asset, ind)

Step 3 — Synthesise Both Circuits in One Block

Classiq's synthesize() operates on a function named main. We define and synthesise main twice, once for state preparation, once for the oracle, and export each immediately to a CUDA-Q kernel. Both syntheses and exports must happen in the same cell, back to back: CUDA-Q names internal functions after Pytho nobject memory addresses, and if an address is recycled between two synthesis calls, the two MLIR modules will contain different functions with the same name but different signatures, causing a silent merge failure. Running them inimmediate sequence prevents this.

The stateprep main allocates the asset register and indicator qubit, applies the financial model, then drops the asset register — leaving only the indicator as an output. The oracle main applies the full Grover operator over the combined register.

# --- State preparation: synthesise and export immediately ---
@qfunc
def main(indicator: Output[QBit]):
    asset = QNum()
    allocate(num_qubits, asset)
    allocate(1, indicator)
    european_call_state_preparation(asset, indicator)
    drop(asset)

qprog_stateprep  = synthesize(main, preferences=Preferences(), constraints=Constraints(max_width=15))
print(f"State prep width: {qprog_stateprep.data.width} qubits")
stateprep_kernel = qprog_to_cudaq_kernel(qprog_stateprep, is_main_kernel=False)

# --- Oracle: redefine main, synthesise and export immediately ---
@qfunc
def space_transform(est_reg: QArray) -> None:
    european_call_state_preparation(est_reg[0 : est_reg.len - 1], est_reg[est_reg.len - 1])

@qperm
def oracle(est_reg: Const[QArray]) -> None:
    Z(est_reg[est_reg.len - 1])

@qfunc
def main(indicator: Output[QBit]) -> None:
    est_reg:      QArray = QArray("est_reg")
    problem_vars: QArray = QArray("problem_vars", length=num_qubits)
    allocate(problem_vars)
    allocate(indicator)
    within_apply(
        lambda: bind([problem_vars, indicator], est_reg),
        lambda: grover_operator(oracle, space_transform, est_reg),
    )
    drop(problem_vars)

qprog_oracle  = synthesize(main, preferences=Preferences(), constraints=Constraints(max_width=11))
print(f"Oracle width: {qprog_oracle.data.width} qubits")
oracle_kernel = qprog_to_cudaq_kernel(qprog_oracle, is_main_kernel=False)

Step 4 — Validate and Find the Indicator Qubit

Before building the parametric kernel we run a quick sanity check: both circuits must place the indicator qubit at the same index in the register, otherwise the parametric kernel measures the wrong qubit. Classiq's qubit_mapping exposes this directly, so we can assert consistency and extract measure_index without guessing.

def validate_qprogs(qprog_stateprep, qprog_oracle):
    sp_idx  = qprog_stateprep.data.qubit_mapping.logical_outputs['indicator'][0]
    orc_idx = qprog_oracle.data.qubit_mapping.logical_outputs['indicator'][0]
    assert sp_idx == orc_idx, (
        f"Indicator qubit index mismatch: stateprep={sp_idx}, oracle={orc_idx}. "
        "Both circuits must place the indicator at the same physical qubit."
    )
    total_qubits = max(qprog_stateprep.data.width, qprog_oracle.data.width)
    return total_qubits, sp_idx
total_qubits, measure_index = validate_qprogs(qprog_stateprep, qprog_oracle)
print(f"Total qubits: {total_qubits}, measure index: {measure_index}")

Step 5 — Build the Parametric CUDA-Q Kernel

This is where the CUDA-Q integration pays off beyond raw GPU speed. A standard QASM-based workflow would need to unroll the full oracle circuit k times per IQAE iteration, thereby producing a new, longer circuit for every value of k. CUDA-Q's for_loop emits a cc.loop in the Quake IR instead: the oracle is described once and the loop bound is a runtime integer. The kernel is O(1) in size regardless of k.

One critical detail: qubit refs for both apply_call invocations must be resolved in the outer scope before for_loop is entered. Inside the loop body, CUDA-Q traces the function once to generate the loop IR — extract_ref ops created during that trace land inside the loop construct where func.call lowering cannot resolve them. Extracting all refs outside the loop avoids this.

def build_parametric_iqae_kernel(
    stateprep_kernel: cudaq.Kernel,
    oracle_kernel: cudaq.Kernel,
    total_qubits: int,
    measure_index: int,
) -> cudaq.Kernel:
    iqae_kernel, k_var = cudaq.make_kernel(int)
    q = iqae_kernel.qalloc(total_qubits)
    # Resolve all qubit refs in the outer scope — not inside the loop body
    sp_qubits     = [q[i] for i in range(qprog_stateprep.data.width)]
    oracle_qubits = [q[j] for j in range(qprog_oracle.data.width)]
    # Apply state preparation once (the A operator)
    iqae_kernel.apply_call(stateprep_kernel, *sp_qubits)
    # Apply the Grover oracle k times — k is a runtime integer, not an unrolled copy
    def oracle_loop_body(i: cudaq.QuakeValue):
        iqae_kernel.apply_call(oracle_kernel, *oracle_qubits)
    iqae_kernel.for_loop(start=0, stop=k_var, function=oracle_loop_body)
    iqae_kernel.mz(q[measure_index])
    return iqae_kernel
iqae_kernel = build_parametric_iqae_kernel(
    stateprep_kernel=stateprep_kernel,
    oracle_kernel=oracle_kernel,
    total_qubits=total_qubits,
    measure_index=measure_index,
)

Step 6 — Run IQAE on GPU and Compare with Classiq

The IQAEOptionPricer drives the iterative loop: at each iteration it samples the parametric kernel with the current k, computes a Clopper-Pearson confidence interval on the measured probability, inverts back to theta-space, and contracts the interval. The Classiq IQAE class runs the same algorithm internally for the baseline.

from scipy.stats import beta as scipy_beta
class IQAEOptionPricer:
    def __init__(self, epsilon: float, alpha: float, scaling_factor: float):
        self.epsilon        = epsilon
        self.alpha          = alpha
        self.scaling_factor = scaling_factor
        self.shots          = 1000
    def _clopper_pearson(self, count, n, alpha):
        lower = 0.0 if count == 0 else scipy_beta.ppf(alpha / 2, count, n - count + 1)
        upper = 1.0 if count == n  else scipy_beta.ppf(1 - alpha / 2, count + 1, n - count)
        return lower, upper
    def _find_next_k(self, k_curr, theta_l, theta_u):
        K_max = int(np.floor(np.pi / (theta_u - theta_l)))
        K     = K_max - (K_max - 2) % 4
        while K >= 4 * (4 * k_curr + 2):
            K = int(K / 4) - (int(K / 4) - 2) % 4
        return max((K - 2) // 4, k_curr + 1)
    def solve(self, kernel: cudaq.Kernel):
        theta_l, theta_u = 0.0, np.pi / 2
        k = 0
        total_shots = 0
        print(f"{'Iter':<5} | {'k':<5} | {'Interval':<26} | {'Error':>8}")
        print("-" * 55)
        iteration = 0
        while (theta_u - theta_l) > (2 * self.epsilon / self.scaling_factor):
            iteration += 1
            result      = cudaq.sample(kernel, k, shots_count=self.shots)
            count_1     = result.count('1')
            total_shots += self.shots
            a_min, a_max = self._clopper_pearson(count_1, self.shots, self.alpha)
            theta_meas   = np.arcsin(np.sqrt(count_1 / self.shots))
            theta_center = (theta_l + theta_u) / 2
            scaling      = 2 * k + 1
            n_           = round((scaling * theta_center - theta_meas) / np.pi)
            theta_est    = (n_ * np.pi + theta_meas) / scaling
            d_a   = (a_max - a_min) / 2
            deriv = abs(scaling * np.sin(2 * scaling * theta_est))
            if deriv < 1e-9:
                deriv = 1e-9
            delta_theta = d_a / deriv
            theta_l   = max(theta_l, theta_est - delta_theta)
            theta_u   = min(theta_u, theta_est + delta_theta)
            est_error = (theta_u - theta_l) * self.scaling_factor
            print(f"{iteration:<5} | {k:<5} | [{theta_l:.4f}, {theta_u:.4f}] | {est_error:>8.4f}")
            if (theta_u - theta_l) > (2 * self.epsilon / self.scaling_factor):
                k = self._find_next_k(k, theta_l, theta_u)
        theta_final     = (theta_l + theta_u) / 2
        estimated_price = np.sin(theta_final) ** 2 * self.scaling_factor
        return estimated_price, total_shots
# --- Classiq baseline ---
iqae = IQAE(
    state_prep_op=european_call_state_preparation,
    problem_vars_size=num_qubits,
    constraints=Constraints(max_width=20),
    preferences=Preferences(optimization_level=1),
)
result_iqae    = iqae.run(epsilon=0.05, alpha=0.01)
classiq_payoff = result_iqae.estimation * scaling_factor

# --- CUDA-Q GPU run ---
cudaq.set_target("nvidia")  # swap for "qpp-cpu" if no GPU available
pricer         = IQAEOptionPricer(epsilon=0.05, alpha=0.01, scaling_factor=scaling_factor)
cudaq_price, cudaq_shots = pricer.solve(iqae_kernel)

# --- Classical reference ---
expected_payoff = sum((grid_points - K) * (grid_points >= K) * probs)
print(f"\n{'Method':<22} {'Payoff':>10}")
print("-" * 34)
print(f"{'CUDA-Q IQAE (GPU)':<22} {cudaq_price:>10.4f}  (shots: {cudaq_shots})")
print(f"{'Classiq IQAE':<22} {classiq_payoff:>10.4f}")
print(f"{'Classical reference':<22} {expected_payoff:>10.4f}")

All three numbers should land within epsilon = 0.05 of each other. The shot count from the CUDA-Q path makes the quadratic advantage tangible: IQAE reaches 5% precision using far fewer oracle calls than Monte Carlo would need samples for the same confidence.

QAE vs IQAE: Switching Algorithms Without Rewriting Your Model

IQAE is well-suited for near-term hardware because it avoids one of the most expensive primitives in quantum computing: the controlled-unitary. Textbook Quantum Amplitude Estimation (QAE) uses Quantum Phase Estimation as its core subroutine, which requires a register of ancilla qubits and a series of controlled applications of the Grover operator where each controlled application roughly doubles circuit depth. On current hardware, where two-qubit gate fidelity is still the limiting factor, this depth cost is prohibitive.

IQAE sidesteps this entirely. Instead of encoding the amplitude into a phase via QPE, it adaptively queries the Grover oracle at increasing depths and uses classical post-processing (Clopper-Pearsonintervals, phase inversion) to home in on the answer. The circuit for each round is shallow; the intelligence is in the classical loop.

The tradeoff: IQAE requires more total circuit executions than QAE for the same precision, and its confidence interval update involves heuristics that a full QPE-based approach handles exactly. As hardware improves and controlled-unitary costs fall, QAE's single-shot precision becomes more attractive.

What makes Classiq valuable here is that the financial model doesn't change at all. The functions load_distribution, payoff_linear, payoff, and european_call_state_preparation are pure descriptions of the financial problem and they know nothing about IQAE or QAE. They encode the distribution, the strike condition, and the amplitude loading. That logic is identical regardless of which amplitude estimation algorithm wraps around it.

Moving to full QAE would mean replacing the IQAE wrapper with a QPE-based outer circuit: a register of ancilla qubits, controlled applications of the Grover operator at powers 1, 2, 4, ..., 2^(n-1), and an inverse QFT to read out the phase. The number of ancilla qubits directly controls precision as each additional qubit halves the estimation error, at the cost of roughly doubling circuit depth. In Classiq's modeling language, this outer structure is expressed in terms of grover_operator and qpe —the same building blocks already used internally by IQAE — wrapped in a new main function. The european_call_state_preparation function is passed in unchanged as the state preparation operand.

The contrast between the two circuits is worth seeing directly. The IQAE oracle synthesised in Steps 3 and 4 is a shallow single Grover iteration where the depth scaling happens in the classical loop, not in the circuit. A full QAE circuit compiled to the same target precision would be significantly deeper, with every Grover power explicitly present as a controlled sub-circuit connected to the QPE register. On a fault-tolerant machine that penalty disappears; on near-term hardware it's the dominant cost. Classiq lets you synthesise and inspect both on the same financial model, measure the depth difference in qprog.data.depth, and make the hardware-readiness call with real numbers rather than intuition.

The Classiq + NVIDIA Architecture: Why It Matters

It's worth zooming out to appreciate what this combination enables as an end-to-end stack.

Finance Problem

     │

     ▼

Classiq QMOD  ──►  High-level quantum model

     │            (assets, payoffs, distributions)

     ▼

Classiq Synthesis  ──►  Optimal gate-level circuit

     │                   (depth, width, connectivity constraints)

     ▼

get_cuda_quantum_kernel()  ──►  CUDA-Q kernel

     │

     ▼

NVIDIA GPU (custatevec / cuTensorNet) ──►  Accelerated simulation

     │

     ▼

Classical Optimizer/ IQAE Estimator  ──►  Result

Classiq handles the quantum modeling layer — ensuring the circuit is correct, efficient, and hardware-aware without requiring the developer to understand quantum circuit synthesis. Change the number of assets, the distribution model, or the hardware backend, and Classiq re-synthesizes the optimal circuit automatically.

NVIDIA CUDA-Q handles the execution layer — providing GPU-accelerated simulation that makes iterative variational and estimation algorithms practical today, and aunified programming model that will extend to quantum hardware as it matures.

The two layers are cleanly separated. Afinancial engineer can iterate on the problem formulation in Classiq without touching the CUDA-Q layer. A systems engineer can swap GPU backends or tune parallelism in CUDA-Q without re-deriving quantum circuits. This is the kind of separation of concerns that makes complex systems maintainable in production.

What This Means for Finance Teams Today

Quantum hardware is not yet at the scale needed to outperform classical computers on real financial data. But GPU-accelerated quantum simulation is production-ready today. Teams that build quantum workflows now, using simulators that behave identically to future hardware, will be positioned to switch seamlessly when quantum advantage arrives.

The immediate practical wins are:

●      Faster research iteration — Classiq's synthesis means a financial engineer can prototype a newpayoff structure or constraint formulation in hours, not weeks. No quantumcircuit expertise required.

●      GPU-scale Monte Carloreplacement — IQAE on GPU simulators already beats classical Monte Carlo for certain high-precision pricing problems at modest qubit counts (12–20 qubits).

●      Hardware-ready algorithms — Classiq-synthesized circuits are already optimized for gate countand depth, meaning they'll run on real quantum hardware with minimal modification when the time comes.

●      Unified stack — The Classiq →CUDA-Q bridge means your quantum programs and your GPU programs live in the same development environment, with the same deployment infrastructure.

Getting Started

The full code for both examples is available in the Classiq GitHub repository under applications/finance. To run them:

pip install classiq[cudaq]

import classiq
classiq.authenticate()

You can signup for Classiq at platform.classiq.io. CUDA-Q is open-source and runs on any NVIDIA GPU.

The quantum finance revolution won't be announced with a single breakthrough paper. It will arrive incrementally through better hardware, better algorithms, better tooling, and the teams best positioned to capture it are those building the foundational workflows today. Classiq and NVIDIA are building those foundations together.

Classiq is the quantum software company that enables engineers and researchers to build, optimize, and execute quantum programs at any scale. NVIDIA CUDA-Q is an open-source platform forquantum-classical hybrid computing that runs on CPUs, GPUs, and quantum hardware. Learn more at classiq.io and developer.nvidia.com/cuda-quantum.

By: the Classiq and NVIDIA Teams

The financial industry runs onoptimization and uncertainty. Every day, portfolio managers balance competing assets, risk desks price derivatives under volatile conditions, and trading algorithms make decisions in milliseconds. Classical computers handle these problems well enough, until they don't. As portfolios grow, as derivative structures become more exotic, and as risk models demand higher fidelity, the computational cost explodes. Quantum computing promises a fundamentally different way to attack these hard problems. But the path from quantum theory to practical financial tooling has, until now, been full of friction.

This post shows how Classiq's high-level quantum modeling language and NVIDIA's CUDA-Q platform remove that friction. We'll walk through two of the most promising quantum algorithms for finance, QAOA for portfolio optimization, and IQAE for derivative pricing, and we’ll show how the two platforms work together to accelerate iterative workflows on GPU hardware.

The Finance Problem Space: Where Quantum Has an Edge

Before we write a single line of quantum code, it's worth being precise about where quantum computing actually helps infinance as not every problem benefits. The strongest cases today are:

Portfolio optimization — Finding the optimal allocation across N assets subject to constraints (budget, risk, sector exposure) is a combinatorial problem. With N assets, the search space grows as 2^N. The best classical solvers like Gurobi, CPLEX, branch-and-bound variants are formidably good and have decades of engineering behind them. Quantum approximate optimization (QAOA) has not yet demonstrated a clear advantage over these solvers on real hardware. What it has demonstrated is that the problem maps naturally onto quantum circuits, that those circuits run on today's devices, and that the algorithmic framework scales in the right direction as hardware improves. We are building toward advantage, not claiming it today.

Derivative pricing and risk estimation — Monte Carlo simulation is the industry workhorse for pricing options, computing Value-at-Risk (VaR), and estimating CVA. It converges at O(1/√N) and you need 100× more samples to get 10× more precision. Quantum Amplitude Estimation (QAE) achieves O(1/N) convergence, a quadratic speedup that translates into dramatic runtime reductions for high-precision pricing.

Credit risk modeling — Correlated default probabilities in loan portfolios are computationally expensive to model. Quantum approaches to distribution loading and amplitude estimation apply here as well.

The common thread: quadratic or exponential classical complexity that quantum algorithms address directly.

The Programming Challenge: From Math to Quantum Circuit

Here's where most quantum finance explorations stall. Implementing QAOA or QAE correctly requires deep knowledge of quantum circuit construction such as qubit encoding schemes, parameterized rotation gates, ancilla management, entanglement patterns, and transpilation to hardware-native gate sets. A financial engineer shouldn't need a PhD in quantum information theory to run a portfolio optimization.

Classiq's quantum modeling language flips the paradigm. Instead of specifying how to build a circuit gate by gate, you describe what the circuit should compute. Classiq's synthesis engine handles the translation to an efficient, hardware-ready implementation so quantum programs look like financial models, not physics experiments.

QAOA for Portfolio Optimization

The Problem

We want to select a portfolio of k assets from a universe of N candidates that maximizes expected return while keeping risk below a threshold. This is a constrained combinatorial problem: with N binary variables (invest or don't), it maps naturally onto the space  algorithms are built to navigate.

Given an expected return vector μ,a covariance matrix Σ, and a risk aversion parameter q, the objective is:

maximize:  μᵀx - q · xᵀΣx

subject to: Σᵢ xᵢ = k,  xᵢ ∈ {0, 1}

A word on expectations: QAOA on current hardware does not beat Gurobi or other best-in-class classical solvers on this problem. The honest value of running it now is different — it validates the end-to-end quantum workflow, it stress-tests the circuit synthesis and execution stack, and it puts teams in a position to benefit when hardware improves.

Expressing It in Classiq

What Classiq provides here is meaningfu lregardless of that hardware caveat: the problem is expressed entirely at the financial level. Classiq uses Pyomo, a standard optimization modeling framework that operations researchers already know as its interface for combinatorial problems. The QAOA circuit structure, the cost Hamiltonian encoding, and the mixer layers are all synthesized automatically from the model.

Step 1 — Define the financial problem in Pyomo:

import numpy as np
import pyomo.core as pyo
from classiq import *
from classiq.applications.combinatorial_optimization import (
    CombinatorialProblem,
    pyo_model_to_pauli_operator,
)
import cudaq

# Problem parameters: 3 assets, integer allocation weights
returns = np.array([3, 4, -1])
covariances = np.array(
    [
        [ 0.9,  0.5, -0.7],
        [ 0.5,  0.9, -0.2],
        [-0.7, -0.2,  0.9],
    ]
)
total_budget = 6

# Pyomo model — pure finance, no quantum yet
portfolio_model = pyo.ConcreteModel("portfolio_optimization")
num_assets = len(returns)
portfolio_model.w = pyo.Var(range(num_assets), domain=pyo.Integers, bounds=(0, 7))
w_array = list(portfolio_model.w.values())
portfolio_model.budget_rule = pyo.Constraint(expr=(sum(w_array) <= total_budget))
portfolio_model.expected_return = returns @ w_array
portfolio_model.risk = w_array @ covariances @ w_array
portfolio_model.cost = pyo.Objective(
    expr=portfolio_model.risk - portfolio_model.expected_return,
    sense=pyo.minimize,
)

This is a standard intege rprogramming model: minimize risk minus expected return, subject to a total budget constraint. No quantum concepts yet.

Step 2 — Synthesize the QAOA circuit with Classiq:

NUM_LAYERS = 3
PENALTY    = 10  # constraint penalty — must match penalty_energy in the CUDA-Q Hamiltonian

combi = CombinatorialProblem(pyo_model=portfolio_model, num_layers=NUM_LAYERS, penalty_factor=PENALTY)
qprog = combi.get_qprog()
show(qprog)  # inspect the synthesized circuit in the Classiq IDE

One call to CombinatorialProblem — Classiq reads the Pyomo objective and constraints, constructs the cost Hamiltonian, and synthesizes the full depth-optimized QAOA circuit. show(qprog) opens it in the Classiq platform where you can inspect gate counts, qubit layout, and depth interactively.

The Variational Loop and the CUDA-Q Bridge

QAOA is a variational algorithm: the circuit has trainable parameters (γ, β angles, one pair per layer) that aclassical optimizer tunes iteratively to minimize the cost expectation value. This outer loop can require hundreds to thousands of circuit evaluations which is exactly where GPU acceleration makes the difference between a research prototype and something practical.

The Classiq-to-CUDA-Q bridge is two function calls:

# Build the Hamiltonian using the same penalty as CombinatorialProblem,
# so both paths optimise the identical cost landscape
classiq_hamiltonian = pyo_model_to_pauli_operator(portfolio_model, penalty_energy=PENALTY)
hamiltonian = pauli_operator_to_cudaq_spin_op(classiq_hamiltonian)
# Convert the synthesized Classiq circuit to a native CUDA-Q kernel
kernel = qprog_to_cudaq_kernel(qprog, is_main_kernel=True)

The kernel is now a native CUDA-Q object runnable on any CUDA-Q backend by simply changing a single set_target() call. The circuit itself hasn't changed; it's the same gate sequence Classiq synthesized. The PENALTY constant is the critical shared parameter: it controls how heavily constraint violations are penalised in the cost Hamiltonian, and both CombinatorialProblem and pyo_model_to_pauli_operator must use the same value, otherwise the two methods are literally optimising different objective functions and their results will diverge.

Step 3 — Run the Classiq native optimizer:

combi.optimize() must run first as it initialises the execution session (_es) that combi.sample() depends on for decoding. We time it here so the comparison is straightforward.

import time
from classiq.execution import ExecutionPreferences, ClassiqBackendPreferences
execution_preferences = ExecutionPreferences(
    backend_preferences=ClassiqBackendPreferences(backend_name="simulator"),
)
t0 = time.perf_counter()
optimized_params = combi.optimize(
    execution_preferences,
    maxiter=70,
    quantile=0.7,  # CVaR alpha: focus on the best 30% of samples
)
classiq_time = time.perf_counter() - t0
classiq_result  = combi.sample(optimized_params)
best_classiq    = classiq_result.solution[classiq_result.cost.idxmin()]
classiq_weights = np.array(best_classiq["w"])
classiq_cost    = classiq_result.cost.min()
print(f"Classiq  allocation: {classiq_weights.tolist()}  cost: {classiq_cost:.4f}  time: {classiq_time:.2f}s")

Step 4 — Run the CUDA-Q GPU optimizer:

The CUDA-Q objective now uses the same quantile=0.7 as combi.optimize() — meaning both methods minimise the expected cost over the best 30% of sampled bitstrings rather than the plain expectation value. This is what aligns the two optimisers on the same landscape.

def initial_qaoa_params(NUM_LAYERS) -> np.ndarray:
    initial_gammas = math.pi * np.linspace(
        1 / (2 * NUM_LAYERS), 1 - 1 / (2 * NUM_LAYERS), NUM_LAYERS
    )
    initial_betas = math.pi * np.linspace(
        1 - 1 / (2 * NUM_LAYERS), 1 / (2 * NUM_LAYERS), NUM_LAYERS
    )
    initial_params = []
    for i in range(NUM_LAYERS):
        initial_params.append(initial_gammas[i])
        initial_params.append(initial_betas[i])
    return np.array(initial_params)
initial_params = initial_qaoa_params(NUM_LAYERS)

cudaq.set_target("nvidia")
QUANTILE     = 0.7   # must match combi.optimize(quantile=...)
SHOTS        = 1000  # samples per evaluation
parameter_count = 2 * NUM_LAYERS
def cvar_objective(parameters):
    """QAOA objective: mean cost of the best (1-QUANTILE) fraction of shots."""
    counts = cudaq.sample(kernel, *parameters, shots_count=SHOTS)
    # Evaluate cost for every observed bitstring
    costs = []
    for bitstring, count in counts.items():
        result = combi.sample(list(parameters))  # decode via combi
        cost   = result.cost.min()
        costs.extend([cost] * count)
    costs.sort()
    cutoff = max(1, int(len(costs) * (1 - QUANTILE)))
    return float(np.mean(costs[:cutoff]))
optimizer = cudaq.optimizers.COBYLA()
optimizer.max_iterations = 70
optimizer.initial_parameters = initial_params
t0 = time.perf_counter()
optimal_cvar, optimal_parameters = optimizer.optimize(
    dimensions=parameter_count, function=cvar_objective
)
cudaq_time = time.perf_counter() - t0
# Decode the final allocation using the optimal parameters
cudaq_result  = combi.sample(optimal_parameters)
best_cudaq    = cudaq_result.solution[cudaq_result.cost.idxmin()]
cudaq_weights = np.array(best_cudaq["w"])
cudaq_cost    = cudaq_result.cost.min()
print(f"CUDA-Q   allocation: {cudaq_weights.tolist()}  cost: {cudaq_cost:.4f}  time: {cudaq_time:.2f}s")

Comparing the Two Results

Both methods optimise the same Hamiltonian over the same synthesised circuit and decode through the same combi.sample() function, so their allocations and costs should agree. The timing column is the point of interest where combi.optimize() runs on Classiq's cloud simulator while the CUDA-Q path runs on a local NVIDIA GPU.

print(f"{'Method':<10} {'Allocation':>20} {'Cost':>10} {'Time (s)':>12} {'Speedup':>10}")
print("-" * 66)
print(f"{'Classiq':<10} {str(classiq_weights.tolist()):>20} {classiq_cost:>10.4f} {classiq_time:>12.2f} {classiq_time/cudaq_time:>9.1f}x")
print(f"{'CUDA-Q':<10} {str(cudaq_weights.tolist()):>20} {cudaq_cost:>10.4f} {cudaq_time:>12.2f} {'1.0x':>10}")

Comparing these methods we get these results:

Method               Allocation       Cost    Time (s)    Speedup

------------------------------------------------------------------

Classiq               [2, 2, 1]    -4.5000       160.13       2.5x

CUDA-Q                [2, 2, 1]    -4.5000        64.35       1.0x

The GPU speedup here is not about beating classical portfolio solvers yet, it's about making the quantum research loop fast enough to be useful by exploring layer depths, tuning penalty factors and benchmarking ansatz structures. GPU simulation is the development environment that makes quantum finance work today while hardware matures.

IQAE for Derivative Pricing

The Problem

Pricing a European option requires estimating an expectation value over a probability distribution of future asset prices:

Price = e^(-rT) · E[max(S_T - K, 0)]

where S_T is the asset price at maturity, K is the strike, and the expectation is taken over a log-normal distribution. Classical Monte Carlo estimates this by averaging over N random samples, converging at O(1/√N). Quantum Amplitude Estimation achieves O(1/N) convergence which is a quadratic speedup in oracle calls.

Iterative QAE (IQAE) is the near-term variant of choice: it avoids the deep controlled-phase circuits of textbook QAE, instead adaptively querying a Grover-like oracle with increasing repetition counts to home in on the true amplitude. Each iteration uses a small number of shots and narrows a confidence interval via Clopper-Pearson statistics which is practical on today's hardware.

Unlike QAOA, the quantum advantage case for IQAE is more concrete. The quadratic speedup in oracle queries is proven, and the question is whether it survives circuit overhead on real hardware. GPU-accelerated simulation lets us run the full IQAE loop at scale today and measure where the crossover point lies.

Step 1 — Define the Distribution and Payoff

The financial parameters map directly onto the Classiq model. We use 3 qubits, giving 2³ = 8 discretisation points over the log-normal distribution. The scaling_factor normalises the payoff into[0, 1] so it can be loaded as a quantum amplitude.

import numpy as np
import scipy
import cudaq
from classiq import *
from classiq.applications.iqae.iqae import IQAE
from classiq.qmod.symbolic import ceiling

# --- Financial parameters ---
num_qubits = 3
mu         = 0.7   # log-normal mean
sigma      = 0.13  # log-normal std dev
K          = 1.9   # strike price

# --- Discretise the log-normal distribution over 2^num_qubits points ---
def get_log_normal_probabilities(mu, sigma, num_points):
    mean     = np.exp(mu + sigma**2 / 2)
    variance = (np.exp(sigma**2) - 1) * np.exp(2 * mu + sigma**2)
    stddev   = np.sqrt(variance)
    low      = np.maximum(0, mean - 3 * stddev)
    high     = mean + 3 * stddev
    x        = np.linspace(low, high, num_points)
    return x, scipy.stats.lognorm.pdf(x, s=sigma, scale=np.exp(mu))
grid_points, probs = get_log_normal_probabilities(mu, sigma, 2**num_qubits)
probs          = probs / np.sum(probs)  # normalise to a valid probability distribution
scaling_factor = max(grid_points) - K
grid_step      = (max(grid_points) - min(grid_points)) / (len(grid_points) - 1)
def scale(val):
    return val * grid_step + min(grid_points)
def descale(val):
    return (val - min(grid_points)) / grid_step

Step 2 — Build the State Preparation Circuit in Classiq

The quantum algorithm needs two components: a circuit that loads the price distribution and a payoff function that marks in-the-money outcomes. Classiq synthesises both from high-level descriptions.

@qfunc
def load_distribution(asset: QNum):
    inplace_prepare_state(probabilities=probs.tolist(), bound=0, target=asset)

@qfunc
def payoff_linear(asset: Const[QNum], ind: QBit):
    # Load payoff amplitude: sqrt((S - K) / scaling_factor) for S > K
    assign_amplitude_table(
        lookup_table(lambda n: np.sqrt(abs((scale(n) - K) / scaling_factor)), asset),
        asset,
        ind,
    )

@qfunc
def payoff(asset: Const[QNum], ind: QBit):
    # Only apply the payoff when asset price exceeds the strike
    control(asset >= ceiling(descale(K)), lambda: payoff_linear(asset, ind))

@qfunc
def european_call_state_preparation(asset: QNum, ind: QBit):
    load_distribution(asset)
    payoff(asset, ind)

Step 3 — Synthesise Both Circuits in One Block

Classiq's synthesize() operates on a function named main. We define and synthesise main twice, once for state preparation, once for the oracle, and export each immediately to a CUDA-Q kernel. Both syntheses and exports must happen in the same cell, back to back: CUDA-Q names internal functions after Pytho nobject memory addresses, and if an address is recycled between two synthesis calls, the two MLIR modules will contain different functions with the same name but different signatures, causing a silent merge failure. Running them inimmediate sequence prevents this.

The stateprep main allocates the asset register and indicator qubit, applies the financial model, then drops the asset register — leaving only the indicator as an output. The oracle main applies the full Grover operator over the combined register.

# --- State preparation: synthesise and export immediately ---
@qfunc
def main(indicator: Output[QBit]):
    asset = QNum()
    allocate(num_qubits, asset)
    allocate(1, indicator)
    european_call_state_preparation(asset, indicator)
    drop(asset)

qprog_stateprep  = synthesize(main, preferences=Preferences(), constraints=Constraints(max_width=15))
print(f"State prep width: {qprog_stateprep.data.width} qubits")
stateprep_kernel = qprog_to_cudaq_kernel(qprog_stateprep, is_main_kernel=False)

# --- Oracle: redefine main, synthesise and export immediately ---
@qfunc
def space_transform(est_reg: QArray) -> None:
    european_call_state_preparation(est_reg[0 : est_reg.len - 1], est_reg[est_reg.len - 1])

@qperm
def oracle(est_reg: Const[QArray]) -> None:
    Z(est_reg[est_reg.len - 1])

@qfunc
def main(indicator: Output[QBit]) -> None:
    est_reg:      QArray = QArray("est_reg")
    problem_vars: QArray = QArray("problem_vars", length=num_qubits)
    allocate(problem_vars)
    allocate(indicator)
    within_apply(
        lambda: bind([problem_vars, indicator], est_reg),
        lambda: grover_operator(oracle, space_transform, est_reg),
    )
    drop(problem_vars)

qprog_oracle  = synthesize(main, preferences=Preferences(), constraints=Constraints(max_width=11))
print(f"Oracle width: {qprog_oracle.data.width} qubits")
oracle_kernel = qprog_to_cudaq_kernel(qprog_oracle, is_main_kernel=False)

Step 4 — Validate and Find the Indicator Qubit

Before building the parametric kernel we run a quick sanity check: both circuits must place the indicator qubit at the same index in the register, otherwise the parametric kernel measures the wrong qubit. Classiq's qubit_mapping exposes this directly, so we can assert consistency and extract measure_index without guessing.

def validate_qprogs(qprog_stateprep, qprog_oracle):
    sp_idx  = qprog_stateprep.data.qubit_mapping.logical_outputs['indicator'][0]
    orc_idx = qprog_oracle.data.qubit_mapping.logical_outputs['indicator'][0]
    assert sp_idx == orc_idx, (
        f"Indicator qubit index mismatch: stateprep={sp_idx}, oracle={orc_idx}. "
        "Both circuits must place the indicator at the same physical qubit."
    )
    total_qubits = max(qprog_stateprep.data.width, qprog_oracle.data.width)
    return total_qubits, sp_idx
total_qubits, measure_index = validate_qprogs(qprog_stateprep, qprog_oracle)
print(f"Total qubits: {total_qubits}, measure index: {measure_index}")

Step 5 — Build the Parametric CUDA-Q Kernel

This is where the CUDA-Q integration pays off beyond raw GPU speed. A standard QASM-based workflow would need to unroll the full oracle circuit k times per IQAE iteration, thereby producing a new, longer circuit for every value of k. CUDA-Q's for_loop emits a cc.loop in the Quake IR instead: the oracle is described once and the loop bound is a runtime integer. The kernel is O(1) in size regardless of k.

One critical detail: qubit refs for both apply_call invocations must be resolved in the outer scope before for_loop is entered. Inside the loop body, CUDA-Q traces the function once to generate the loop IR — extract_ref ops created during that trace land inside the loop construct where func.call lowering cannot resolve them. Extracting all refs outside the loop avoids this.

def build_parametric_iqae_kernel(
    stateprep_kernel: cudaq.Kernel,
    oracle_kernel: cudaq.Kernel,
    total_qubits: int,
    measure_index: int,
) -> cudaq.Kernel:
    iqae_kernel, k_var = cudaq.make_kernel(int)
    q = iqae_kernel.qalloc(total_qubits)
    # Resolve all qubit refs in the outer scope — not inside the loop body
    sp_qubits     = [q[i] for i in range(qprog_stateprep.data.width)]
    oracle_qubits = [q[j] for j in range(qprog_oracle.data.width)]
    # Apply state preparation once (the A operator)
    iqae_kernel.apply_call(stateprep_kernel, *sp_qubits)
    # Apply the Grover oracle k times — k is a runtime integer, not an unrolled copy
    def oracle_loop_body(i: cudaq.QuakeValue):
        iqae_kernel.apply_call(oracle_kernel, *oracle_qubits)
    iqae_kernel.for_loop(start=0, stop=k_var, function=oracle_loop_body)
    iqae_kernel.mz(q[measure_index])
    return iqae_kernel
iqae_kernel = build_parametric_iqae_kernel(
    stateprep_kernel=stateprep_kernel,
    oracle_kernel=oracle_kernel,
    total_qubits=total_qubits,
    measure_index=measure_index,
)

Step 6 — Run IQAE on GPU and Compare with Classiq

The IQAEOptionPricer drives the iterative loop: at each iteration it samples the parametric kernel with the current k, computes a Clopper-Pearson confidence interval on the measured probability, inverts back to theta-space, and contracts the interval. The Classiq IQAE class runs the same algorithm internally for the baseline.

from scipy.stats import beta as scipy_beta
class IQAEOptionPricer:
    def __init__(self, epsilon: float, alpha: float, scaling_factor: float):
        self.epsilon        = epsilon
        self.alpha          = alpha
        self.scaling_factor = scaling_factor
        self.shots          = 1000
    def _clopper_pearson(self, count, n, alpha):
        lower = 0.0 if count == 0 else scipy_beta.ppf(alpha / 2, count, n - count + 1)
        upper = 1.0 if count == n  else scipy_beta.ppf(1 - alpha / 2, count + 1, n - count)
        return lower, upper
    def _find_next_k(self, k_curr, theta_l, theta_u):
        K_max = int(np.floor(np.pi / (theta_u - theta_l)))
        K     = K_max - (K_max - 2) % 4
        while K >= 4 * (4 * k_curr + 2):
            K = int(K / 4) - (int(K / 4) - 2) % 4
        return max((K - 2) // 4, k_curr + 1)
    def solve(self, kernel: cudaq.Kernel):
        theta_l, theta_u = 0.0, np.pi / 2
        k = 0
        total_shots = 0
        print(f"{'Iter':<5} | {'k':<5} | {'Interval':<26} | {'Error':>8}")
        print("-" * 55)
        iteration = 0
        while (theta_u - theta_l) > (2 * self.epsilon / self.scaling_factor):
            iteration += 1
            result      = cudaq.sample(kernel, k, shots_count=self.shots)
            count_1     = result.count('1')
            total_shots += self.shots
            a_min, a_max = self._clopper_pearson(count_1, self.shots, self.alpha)
            theta_meas   = np.arcsin(np.sqrt(count_1 / self.shots))
            theta_center = (theta_l + theta_u) / 2
            scaling      = 2 * k + 1
            n_           = round((scaling * theta_center - theta_meas) / np.pi)
            theta_est    = (n_ * np.pi + theta_meas) / scaling
            d_a   = (a_max - a_min) / 2
            deriv = abs(scaling * np.sin(2 * scaling * theta_est))
            if deriv < 1e-9:
                deriv = 1e-9
            delta_theta = d_a / deriv
            theta_l   = max(theta_l, theta_est - delta_theta)
            theta_u   = min(theta_u, theta_est + delta_theta)
            est_error = (theta_u - theta_l) * self.scaling_factor
            print(f"{iteration:<5} | {k:<5} | [{theta_l:.4f}, {theta_u:.4f}] | {est_error:>8.4f}")
            if (theta_u - theta_l) > (2 * self.epsilon / self.scaling_factor):
                k = self._find_next_k(k, theta_l, theta_u)
        theta_final     = (theta_l + theta_u) / 2
        estimated_price = np.sin(theta_final) ** 2 * self.scaling_factor
        return estimated_price, total_shots
# --- Classiq baseline ---
iqae = IQAE(
    state_prep_op=european_call_state_preparation,
    problem_vars_size=num_qubits,
    constraints=Constraints(max_width=20),
    preferences=Preferences(optimization_level=1),
)
result_iqae    = iqae.run(epsilon=0.05, alpha=0.01)
classiq_payoff = result_iqae.estimation * scaling_factor

# --- CUDA-Q GPU run ---
cudaq.set_target("nvidia")  # swap for "qpp-cpu" if no GPU available
pricer         = IQAEOptionPricer(epsilon=0.05, alpha=0.01, scaling_factor=scaling_factor)
cudaq_price, cudaq_shots = pricer.solve(iqae_kernel)

# --- Classical reference ---
expected_payoff = sum((grid_points - K) * (grid_points >= K) * probs)
print(f"\n{'Method':<22} {'Payoff':>10}")
print("-" * 34)
print(f"{'CUDA-Q IQAE (GPU)':<22} {cudaq_price:>10.4f}  (shots: {cudaq_shots})")
print(f"{'Classiq IQAE':<22} {classiq_payoff:>10.4f}")
print(f"{'Classical reference':<22} {expected_payoff:>10.4f}")

All three numbers should land within epsilon = 0.05 of each other. The shot count from the CUDA-Q path makes the quadratic advantage tangible: IQAE reaches 5% precision using far fewer oracle calls than Monte Carlo would need samples for the same confidence.

QAE vs IQAE: Switching Algorithms Without Rewriting Your Model

IQAE is well-suited for near-term hardware because it avoids one of the most expensive primitives in quantum computing: the controlled-unitary. Textbook Quantum Amplitude Estimation (QAE) uses Quantum Phase Estimation as its core subroutine, which requires a register of ancilla qubits and a series of controlled applications of the Grover operator where each controlled application roughly doubles circuit depth. On current hardware, where two-qubit gate fidelity is still the limiting factor, this depth cost is prohibitive.

IQAE sidesteps this entirely. Instead of encoding the amplitude into a phase via QPE, it adaptively queries the Grover oracle at increasing depths and uses classical post-processing (Clopper-Pearsonintervals, phase inversion) to home in on the answer. The circuit for each round is shallow; the intelligence is in the classical loop.

The tradeoff: IQAE requires more total circuit executions than QAE for the same precision, and its confidence interval update involves heuristics that a full QPE-based approach handles exactly. As hardware improves and controlled-unitary costs fall, QAE's single-shot precision becomes more attractive.

What makes Classiq valuable here is that the financial model doesn't change at all. The functions load_distribution, payoff_linear, payoff, and european_call_state_preparation are pure descriptions of the financial problem and they know nothing about IQAE or QAE. They encode the distribution, the strike condition, and the amplitude loading. That logic is identical regardless of which amplitude estimation algorithm wraps around it.

Moving to full QAE would mean replacing the IQAE wrapper with a QPE-based outer circuit: a register of ancilla qubits, controlled applications of the Grover operator at powers 1, 2, 4, ..., 2^(n-1), and an inverse QFT to read out the phase. The number of ancilla qubits directly controls precision as each additional qubit halves the estimation error, at the cost of roughly doubling circuit depth. In Classiq's modeling language, this outer structure is expressed in terms of grover_operator and qpe —the same building blocks already used internally by IQAE — wrapped in a new main function. The european_call_state_preparation function is passed in unchanged as the state preparation operand.

The contrast between the two circuits is worth seeing directly. The IQAE oracle synthesised in Steps 3 and 4 is a shallow single Grover iteration where the depth scaling happens in the classical loop, not in the circuit. A full QAE circuit compiled to the same target precision would be significantly deeper, with every Grover power explicitly present as a controlled sub-circuit connected to the QPE register. On a fault-tolerant machine that penalty disappears; on near-term hardware it's the dominant cost. Classiq lets you synthesise and inspect both on the same financial model, measure the depth difference in qprog.data.depth, and make the hardware-readiness call with real numbers rather than intuition.

The Classiq + NVIDIA Architecture: Why It Matters

It's worth zooming out to appreciate what this combination enables as an end-to-end stack.

Finance Problem

     │

     ▼

Classiq QMOD  ──►  High-level quantum model

     │            (assets, payoffs, distributions)

     ▼

Classiq Synthesis  ──►  Optimal gate-level circuit

     │                   (depth, width, connectivity constraints)

     ▼

get_cuda_quantum_kernel()  ──►  CUDA-Q kernel

     │

     ▼

NVIDIA GPU (custatevec / cuTensorNet) ──►  Accelerated simulation

     │

     ▼

Classical Optimizer/ IQAE Estimator  ──►  Result

Classiq handles the quantum modeling layer — ensuring the circuit is correct, efficient, and hardware-aware without requiring the developer to understand quantum circuit synthesis. Change the number of assets, the distribution model, or the hardware backend, and Classiq re-synthesizes the optimal circuit automatically.

NVIDIA CUDA-Q handles the execution layer — providing GPU-accelerated simulation that makes iterative variational and estimation algorithms practical today, and aunified programming model that will extend to quantum hardware as it matures.

The two layers are cleanly separated. Afinancial engineer can iterate on the problem formulation in Classiq without touching the CUDA-Q layer. A systems engineer can swap GPU backends or tune parallelism in CUDA-Q without re-deriving quantum circuits. This is the kind of separation of concerns that makes complex systems maintainable in production.

What This Means for Finance Teams Today

Quantum hardware is not yet at the scale needed to outperform classical computers on real financial data. But GPU-accelerated quantum simulation is production-ready today. Teams that build quantum workflows now, using simulators that behave identically to future hardware, will be positioned to switch seamlessly when quantum advantage arrives.

The immediate practical wins are:

●      Faster research iteration — Classiq's synthesis means a financial engineer can prototype a newpayoff structure or constraint formulation in hours, not weeks. No quantumcircuit expertise required.

●      GPU-scale Monte Carloreplacement — IQAE on GPU simulators already beats classical Monte Carlo for certain high-precision pricing problems at modest qubit counts (12–20 qubits).

●      Hardware-ready algorithms — Classiq-synthesized circuits are already optimized for gate countand depth, meaning they'll run on real quantum hardware with minimal modification when the time comes.

●      Unified stack — The Classiq →CUDA-Q bridge means your quantum programs and your GPU programs live in the same development environment, with the same deployment infrastructure.

Getting Started

The full code for both examples is available in the Classiq GitHub repository under applications/finance. To run them:

pip install classiq[cudaq]

import classiq
classiq.authenticate()

You can signup for Classiq at platform.classiq.io. CUDA-Q is open-source and runs on any NVIDIA GPU.

The quantum finance revolution won't be announced with a single breakthrough paper. It will arrive incrementally through better hardware, better algorithms, better tooling, and the teams best positioned to capture it are those building the foundational workflows today. Classiq and NVIDIA are building those foundations together.

Classiq is the quantum software company that enables engineers and researchers to build, optimize, and execute quantum programs at any scale. NVIDIA CUDA-Q is an open-source platform forquantum-classical hybrid computing that runs on CPUs, GPUs, and quantum hardware. Learn more at classiq.io and developer.nvidia.com/cuda-quantum.

Start Creating Quantum Software Without Limits