The Badger’s Guide to Quantum Circuit Simulation on Campus

This is an introductory guide for UW-Madison students (and others who have access to Nvidia GPUs) to do beyond-laptop-scale quantum circuit simulation efficiently using Qiskit and/or cuQuantum with GPUs on the CHTC platform.

Qiskit Aer

Qiskit used to rely on QuantumInstance to run circuits and algorithms, where AerSimulator serves as its backend for classical simulation. As of writing this guide, Qiskit is transitioning to Estimator and Sampler primitives and QuantumInstance is being deprecated. The default qiskit.primitives are capable of running simple tasks, but for high-performance and feature-rich simulation, Qiskit Aer implements its own Estimator and Sampler using AerSimulator. The arguments for AerSimulator, QuantumInstance, and QuantumInstance transpiler can now be passed in to Aer primitives as backend_options, run_options, and transpile_options respectively. Below we focus on introducing how to do CPU and GPU simulation using AerSimulator.

AerSimulator

The AerSimulator doc page serves as a good reference to its options, albeit a bit cluttered. The most used arguments are method, noise_model, and device. Other arguments are used when you want to fine-tune the behavior of a method or really push the limits (e.g. blocking_enable and blocking_qubits).

For method, 99% of the times statevector is the correct choice. The rest 1% is stabilizer for running Clifford circuits. The rest are not very useful. (In my past experience, extended_stabilizer is extremely slow with only a few T gates, and matrix_product_state is not very accurate nor fast. Their implementations may have been improved.)
For noise_model, see Building Noise Models.
For device, nothing needs to be specified for CPU simulation. For GPU simulation, see below.

Qiskit Aer GPU

Qiskit Aer’s native GPU support requires CUDA at least 10.1 (sorry but “AMD No!”) and installing qiskit-aer-gpu in addition to the qiskit-aer that comes with qiskit. The former will overwrite the latter. (See https://github.com/Qiskit/qiskit-aer#installing-gpu-support.) Side-note: I’ve run into problems with this overwriting behavior on Google collab, so I recommend installing it after the installation of qiskit is complete if you use Google collab, i.e.

pip install qiskit
pip install qiskit-aer-gpu

Then setting device to "GPU" is all you need. AerSimulator.available_devices() provides a list of devices that can be used. Note that only statevector, density_matrix, and unitary methods are natively supported.

To sum up, a typical usage would be

from qiskit_aer import AerSimulator
from qiskit_aer.noise import NoiseModel

noise_model = NoiseModel()
noise_model.add_all_qubit_quantum_error(...)

backend = AerSimulator(
    method='statevector',
    device='GPU',
    noise_model=noise_model
)
quantum_instance = QuantumInstance(
    backend=backend,
    # shots=1024,
    seed_simulator=42,
    seed_transpiler=42
)

Or using the new primitives APIs:

from qiskit_aer import Estimator, Sampler

estimator = Estimator(
    backend_options = {
        'method': 'statevector',
        'device': 'GPU',
        'noise_model': noise_model
    },
    run_options = {
        # shots=1024,
        # seed=42,
    },
    transpile_options = {
        # seed_transpiler=42
    }
)

You may have noticed that there is a tensor_network method and a cuStateVec_enable option. They are for cuTensorNet and cuStateVec integrations but are only available by building Aer from source with cuQuantum support. If you want to use cuStateVec with a minimal amount of modification to your Qiskit code, the other option is cuQuantum Appliance.

cuQuantum

cuQuantum is Nvidia’s GPU quantum circuit simulator, written in C/C++ with a Python interface, cuQuantum Python. It consists of two types of simulators, cuTensorNet based on the tensor network approach and cuStateVec based on the state vector approach.

CUDA Quantum

CUDA Quantum is Nvidia’s quantum SDK that natively works with cuQuantum (like Qiskit does with Qiskit Aer) and is probably the most natural way to utilize cuQuantum. It is written in C/C++ with a Python interface. However, learning to program with a new library can be painful, especially if the library is relatively young and there are limited learning materials available. Fortunately, cuTensorNet provides a function to convert Qiskit and Cirq circuits to tensors that work with cuTensorNet APIs.

Qiskit and Cirq Circuit Converter

cuquantum.CircuitToEinsum is a class for converting qiskit.QuantumCircuit and cirq.Circuit to an einsum expression and the corresponding tensor operands that can be used by cuquantum.contract. This conversion only works for cuTensorNet, and there is no conversion function for cuStateVec. The type of the tensors is specified by the backend argument and numpy, cupy, and torch are supported. CircuitToEinsum has different functions for different operations that can be performed on the state represented by the circuit, and the resulting einsum expression and tensors will be different. Available operations are:

state_vector(): calculate the state vector represented by the circuit. Note that since the exponentially large state vector is still formed in this calculation, the tensor network approach will likely not have an advantage over the state vector approach.
expectation(): take the expectation value of an observable representated as a Pauli string, which can be obtained with qiskit.quantum_info.SparsePauliOp.to_list(). As of writing, different terms in the observable needs to be computed separately, making the overhead quite significant for complex observables, but each individual expectation calculation can take advantage of the lightcone technique to only contract the relevant part of the tensor network. This and the following operations can avoid forming the exponentially large state vector in the caculation, thus gaining a potential advantage over the state vector simulation. A sample usage would be

from cuquantum import contract, CircuitToEinsum

converter = CircuitToEinsum(circuit)
exp_val = 0
for pauli_str, coeff in operator.to_list():
    expression, operands = converter.expectation(pauli_str)
    exp_val += coeff * contract(expression, *operands)

amplitude() and batched_amplitude(): calculate the probability amplitude of one or a batch of bitstrings.
reduced_density_matrix(): given a list of qubit that should be reduced, calculate the reduced density matrix, with the option of fixing some other qubits to state 0 or 1.

After the conversion, the contract() computation can be further configured with various options available. Notably, the contraction path has a major influence on the contraction cost, though finding the optimal one is an NP-hard problem. cuTensorNet finds contraction paths using a hyper-optimizer that adjusts the configuration of a graph partitioning method. Multiple aspects of this process can be tuned via the options provided, and often it is beneficial to customize these options based on the properties of the network being contracted than using the default configuration. More details can be found in the cuQuantum paper and the cotengra paper.

cuQuantum Appliance

cuQuantum Appliance is a Docker container that comes with an adjusted and tested cuQuantum, Qiskit, and Cirq environment. In particular, cusvaer, a distributed cuStateVec-based simulator that can be used as a Qiskit Aer StatevectorSimulator, is only available in cuQuantum Appliance. In addition, its Qiskit Aer has been modified to support cuStateVec and cusvaer.

CHTC

CHTC (Center for High Throughput Computing) is UW-Madison’s high performance and high throughput computing platform. There are two components: the HPC (High Performance Computing) cluster and the HTC (High Throughput Computing) cluster. If your code needs CPU parallel computing, you should use the former; other use cases, including needs for high memory nodes and GPUs, should all go to the latter. We will focus on the HTC cluster, since it is relatively easier to use GPU simulations compared to CPU-parallelized quantum circuit simulators.

Access

To get access to CHTC, apply by filling out this form. Within a few days, you will be contacted by a CHTC staff to set up a meeting to further clarify your needs and usages.

Any connection to CHTC needs to be done through the campus network. Off-campus access requires the use of WiscVPN. If you live in university apartments, the residential network is part of the campus network.

The login nodes can be conveniently ssh’ed with VS Code. After entering your password, click the “details” button in the lower-right notification to open the two-factor pop-up. Then input “1” for a Duo push.

HTCondor Submit File

Unlike the HPC cluster, which uses Slurm and a shared file system, the HTC cluster uses the UW-Madison-developed HTCondor workload manager and independent file systems between login nodes and compute nodes. HTCondor jobs are specified and submitted with submit files. For a GPU job, a typical submit file may look like this:

universe = vanilla

executable = ./my_executable.sh
arguments = arg1_value arg2_value $(arg3)

log = joblog/job_$(Cluster)_$(Process)_$(arguments).log
error = joblog/job_$(Cluster)_$(Process)_$(arguments).err
output = joblog/job_$(Cluster)_$(Process)_$(arguments).out

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = ../my_code, ../additional_files
transfer_output_files = my_code/data

+WantGPULab = true
+GPUJobLength = "short"
require_gpus = (Capability >= 8.0)

request_gpus = 1
request_cpus = 4
request_memory = 64GB
request_disk = 20GB

queue arg3 from seq 0 15 |

The above example assumes you run command codor_submit my_submit_file.sub in the my_code directory and store the computed results in my_code/data. Note that after a job finishes, existing files in transfer_output_files directories (my_code/data in this case) on the login node will be overwritten by files with the same path on the compute node.

Here we have 3 arguments for our executable (more on that later), where the third argument is looped from 0 to 15 with the queue command. After job submission, there will be 16 jobs in a batch, each with a different arg3. If you only want to run a single job, simply write queue. Submitted jobs (in queue or running) can be viewed with the condor_q command.

Any variable defined in the submit file can be referenced with the $() syntax. $(Cluster) and $(Process) are additional variables HTCondor provides, among many others.

In this example, the +WantGPULab, +GPUJobLength, require_gpus, and request_gpus variables are for GPU jobs only. Details on the CHTC GPUs are given in this guide. More examples can be found at this repo.

More submit file examples are available in the HTCondor doc.

VSCode Extension for HTCondor

A newly-developed official HTCondor VSCode extension is now available, providing features like syntax highlighting, autocomplete suggestions, documentation on hover, submit file snippets.

Executable File Example

The executable variable in the submit file specifies the file condor will run after the job gets an assigned compute node and all listed files are transfered. It gets arguments specified in the submit file as commandline arguments. I use a shell script that first prepares the environment and then calls the commandline interface of my research code, but it can be anything that does the job:

#!/usr/bin/sh
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh -b -p ./miniconda3
source ./miniconda3/bin/activate ""
conda init
conda install numpy networkx matplotlib scipy -y
pip install qiskit qiskit_optimization mitiq qiskit-ibm-provider qiskit-nature qiskit-aer-gpu
conda install -c conda-forge cuquantum-python -y
module load CMake # CMake is available as a module if the package you use needs it to build
pip install git+https://github.com/someone/some_package.git

cd my_code
python src/run.py --arg1 $1 --arg2 $2 --arg3 $3

It is a good practice to do a supervised test run of your workflow with an interactive session by adding the -i flag to the condor_submit command, e.g. codor_submit -i my_submit_file.sub. This is particularly useful if your environment is tricky to build.

In this example, we rely on the internet to download and install the miniconda installer and required packages. If your environment is small in size, you can also prepare it on your login node and transfer it to the compute node. The third option is to start with a Docker container, as shown below.

Start with a Docker Container

If you want to load a Docker container as your starting environment, simply change the universe = vanilla in your submit file to

universe = docker
container_image = docker://user/repo:tag

and supply the container_image argument as shown. Everything else can stay the same, execept that the environment preparation part of your executable can be reduced correspondingly. The aforementioned cuQuantum Appliance Docker image can be loaded this way:

universe = docker
container_image = nvcr.io/nvidia/cuquantum-appliance:23.03