Architecture

How CosmicAC's components connect to your Kubernetes cluster and run each job type.

CosmicAC is a self-hosted platform that runs GPU workloads on your own Kubernetes cluster. This page explains the components involved, how they connect to your cluster, and how each job type runs. For deployment steps, see Installation.

Deployment architecture

Setting up your cluster is separate from deploying CosmicAC. You bring a Kubernetes cluster that already has its GPU nodes and KubeVirt configured. The CosmicAC components then connect to that cluster and run your workloads on it.

wrk-server-k8s-nvidia connects to your cluster's Kubernetes API and creates the resources each job needs. For each job request it builds a graph of Kubernetes resources and applies them through the API server. For every workload it provisions a GPU VMI, a KubeVirt Virtual Machine Instance whose root disk CDI imports from a registry image and which claims one or more whole GPUs through PCI passthrough. Multi-node workloads also claim InfiniBand.

The cluster the worker drives has two requirement tiers.

Base platform — the Kubernetes, GPU, virtualization, storage, and registry foundation that every workload needs.
Overlay networking — an optional add-on tier, needed only when a workload requires a per-instance isolated network, an OVS/VXLAN subcluster reachable through a WireGuard gateway. This capability is independent of the workload type, and we expect it to apply more broadly over time.

CosmicAC documents the cluster requirements, not the steps to build the cluster. See Requirements for those requirements.

CosmicAC components

These components make up CosmicAC. Most run outside your cluster as part of the self-hosted platform, and the per-job agents run inside each job's VMI.

app-ui — web interface. A browser dashboard for creating and managing jobs.
cosmicac-cli — command-line interface. Submits jobs, manages resources, and connects to containers from your terminal.
app-node — application server. Serves the HTTP API, authenticates requests, and routes commands to the orchestrator.
wrk-ork — orchestrator. Allocates resources, distributes jobs across the cluster, and routes requests to the workers.
wrk-server-k8s-nvidia — Kubernetes server worker. Connects to your cluster's Kubernetes API and provisions the GPU VMs.
proxy-inference — inference proxy. Authenticates Managed Inference requests, balances load, and routes them to model servers.
wrk-agent-instance — GPU Container agent. Runs inside a GPU Container Job's VMI and accepts shell sessions over hyperswarm-ssh.
wrk-agent-inference — Managed Inference agent. Runs inside a Managed Inference Job's VMI, serves the model with vLLM, and registers itself in the DHT table.
redis — in-memory data store. app-node uses it for caching and runtime state, with persistence enabled.
caddy — web entry point. Serves the UI and reverse-proxies API and inference traffic on port 5173.

Caddy is the external entry point on port 5173. The cosmicac-ui image builds the single-page application (SPA) into a shared ui_dist volume, and Caddy serves it while proxying /api to app-node and /inference to proxy-inference. Caddy and the inference HTTP worker share app-node's network namespace because the upstream HTTP services bind to container loopback.

Holepunch stack

Inside CosmicAC, the components connect to each other over the Holepunch peer-to-peer (p2p) stack rather than through a central server. Components address each other directly, so there's no central broker to route, bottleneck, or expose internal traffic:

Hyperswarm — peer-to-peer networking. Components find and connect to each other directly, without a central broker.
HRPC — Hyperswarm RPC. Carries internal calls between app-node, wrk-ork, and the workers.
hyperswarm-ssh — SSH over Hyperswarm. Lets cosmicac-cli shell directly into a running GPU Container Job.
DHT table — distributed hash table. Managed Inference model servers register here, and proxy-inference discovers them by topic.
HyperDB + Autobase — distributed database. Stores usage metrics and job metadata.

GPU Container architecture

A GPU Container Job runs your workload inside a KubeVirt VMI with a GPU and shell access.

How a job starts. When you submit a job from app-ui or cosmicac-cli, it travels through the CosmicAC components to your cluster.

app-node authenticates the request and forwards it to wrk-ork.
wrk-ork routes the job to wrk-server-k8s-nvidia.
wrk-server-k8s-nvidia instructs the Kubernetes control plane to schedule the workload.
Kubernetes creates a pod containing a VMI, with wrk-agent-instance running inside it.

How a shell connects. Once the VMI is running, cosmicac-cli connects directly to wrk-agent-instance over hyperswarm-ssh. Your commands reach the VMI over the Holepunch p2p stack rather than through app-node, so the interactive session doesn't depend on the control path that submitted the job.

Managed Inference architecture

A Managed Inference Job runs an open-source language model with vLLM inside a VMI, and exposes it through proxy-inference as an OpenAI-compatible endpoint, which authenticates requests and balances load. You reach the model through that endpoint from any OpenAI-compatible client, or by running inference directly with cosmicac-cli.

How the job starts. When you create a Managed Inference Job from app-ui, the request flows through app-node and wrk-ork to wrk-server-k8s-nvidia, which schedules a pod with a VMI running wrk-agent-inference (vLLM). On spin-up, wrk-agent-inference registers itself in the DHT table so the proxy can find it.

How a request is served. Serving traffic follows a separate path from job creation:

A client sends a request to the inference endpoint over the OpenAI-compatible API, or you run inference from cosmicac-cli.
proxy-inference authenticates the request, searches the DHT table by topic to discover a model server, and balances load across the running servers.
wrk-agent-inference runs the request with vLLM and returns the response.

Job lifecycle

A job moves through these states from when you create it until you delete it.

Stopping a job pauses it, and you can start it again later. Deleting a job removes it and its allocated resources.

Isolation and security

VM-level isolation — each job runs in its own KubeVirt VMI inside a non-privileged pod, with Kubernetes security controls applied.
Secure GPU access — CosmicAC exposes GPUs to the VMIs without privileged containers.

Deployment architecture

CosmicAC components

Holepunch stack

GPU Container architecture

Managed Inference architecture

Job lifecycle

Isolation and security

Next steps

GPU Container Job

Managed Inference

Installation

On this page