Name	Name	Last commit message	Last commit date
Latest commit History 213 Commits
.claude	.claude
.entire	.entire
.github	.github
api	api
cmd/cc-metric-store	cmd/cc-metric-store
configs	configs
endpoint-test-scripts	endpoint-test-scripts
init	init
internal	internal
tools	tools
.gitignore	.gitignore
.goreleaser.yaml	.goreleaser.yaml
CLAUDE.md	CLAUDE.md
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
ReleaseNotes.md	ReleaseNotes.md
go.mod	go.mod
go.sum	go.sum

ClusterCockpit Metric Store

The cc-metric-store provides a simple in-memory time series database for storing metrics of cluster nodes at preconfigured intervals. It is meant to be used as part of the ClusterCockpit suite. As all data is kept in-memory, accessing it is very fast. It also provides topology aware aggregations over time and nodes/sockets/cpus.

The storage engine is provided by the cc-backend package (cc-backend/pkg/metricstore). This repository provides the HTTP API wrapper.

The NATS.io based writing endpoint and the HTTP write endpoint both consume messages in this format of the InfluxDB line protocol.

Building

cc-metric-store can be built using the provided Makefile. It supports the following targets:

make: Build the application, copy an example configuration file and generate checkpoint folders if required.
make clean: Clean the golang build cache and application binary
make distclean: In addition to the clean target also remove the ./var folder and config.json
make swagger: Regenerate the Swagger files from the source comments.
make test: Run tests and basic checks (go build, go vet, go test).

Running

./cc-metric-store # Uses ./config.json ./cc-metric-store -config /path/to/config.json ./cc-metric-store -dev # Enable Swagger UI at /swagger/ ./cc-metric-store -loglevel debug # debug|info|warn (default)|err|crit ./cc-metric-store -logdate # Add date and time to log messages ./cc-metric-store -version # Show version information and exit ./cc-metric-store -gops # Enable gops agent for debugging

REST API Endpoints

The REST API is documented in swagger.json. You can explore and try the REST API using the integrated SwaggerUI web interface (requires the -dev flag).

For more information on the cc-metric-store REST API have a look at the ClusterCockpit documentation website.

All endpoints support both trailing-slash and non-trailing-slash variants:

Method	Path	Description
`GET`	`/api/query/`	Query metrics with selectors
`POST`	`/api/write/`	Write metrics (InfluxDB line protocol)
`POST`	`/api/free/`	Free buffers up to a timestamp
`GET`	`/api/debug/`	Dump internal state
`GET`	`/api/healthcheck/`	Check node health status

If jwt-public-key is set in config.json, all endpoints require JWT authentication using an Ed25519 key (Authorization: Bearer ).

Run tests

Some benchmarks concurrently access the MemoryStore, so enabling the Race Detector might be useful. The benchmarks also work as tests as they do check if the returned values are as expected.

# Tests only go test -v ./... # Benchmarks as well go test -bench=. -race -v ./...

What are these selectors mentioned in the code?

The cc-metric-store works as a time-series database and uses the InfluxDB line protocol as input format. Unlike InfluxDB, the data is indexed by one single strictly hierarchical tree structure. A selector is built out of the tags in the InfluxDB line protocol, and can be used to select a node (not in the sense of a compute node, can also be a socket, cpu, ...) in that tree. The implementation calls those nodes level to avoid confusion. It is impossible to access data only by knowing the socket or cpu tag -- all higher up levels have to be specified as well.

This is what the hierarchy currently looks like:

cluster1
- host1
  - socket0
  - socket1
  - ...
  - cpu1
  - cpu2
  - cpu3
  - cpu4
  - ...
  - gpu1
  - gpu2
- host2
- ...
cluster2
...

Example selectors:

["cluster1", "host1", "cpu0"]: Select only the cpu0 of host1 in cluster1
["cluster1", "host1", ["cpu4", "cpu5", "cpu6", "cpu7"]]: Select only CPUs 4-7 of host1 in cluster1
["cluster1", "host1"]: Select the complete node. If querying for a CPU-specific metric such as flops, all CPUs are implied

Config file

The config file is a JSON document with four top-level sections.

`main`

"main": { "addr": "0.0.0.0:8082", "https-cert-file": "", "https-key-file": "", "jwt-public-key": "", "user": "", "group": "", "backend-url": "" }

addr: Address and port to listen on (default: 0.0.0.0:8082)
https-cert-file / https-key-file: Paths to TLS certificate/key for HTTPS
jwt-public-key: Base64-encoded Ed25519 public key for JWT authentication. If empty, no auth is required.
user / group: Drop privileges to this user/group after startup
backend-url: Optional URL of a cc-backend instance used as node provider

`metrics`

Per-metric configuration. Each key is the metric name:

"metrics": { "cpu_load": { "frequency": 60, "aggregation": null }, "flops_any": { "frequency": 60, "aggregation": "sum" }, "cpu_user": { "frequency": 60, "aggregation": "avg" } }

frequency: Sampling interval in seconds
aggregation: How to aggregate sub-level data: "sum", "avg", or null (no aggregation)

`metric-store`

"metric-store": { "checkpoints": { "file-format": "wal", "directory": "./var/checkpoints" }, "memory-cap": 100, "retention-in-memory": "24h", "num-workers": 0, "cleanup": { "mode": "archive", "directory": "./var/archive" }, "nats-subscriptions": [ { "subscribe-to": "hpc-nats", "cluster-tag": "fritz" } ] }

checkpoints.file-format: Checkpoint format: "json" (default, human-readable) or "wal" (binary WAL, crash-safe). See Checkpoint formats below.
checkpoints.directory: Root directory for checkpoint files (organized as ///)
memory-cap: Approximate memory cap in MB for metric buffers
retention-in-memory: How long to keep data in memory (e.g. "48h")
num-workers: Number of parallel workers for checkpoint/archive I/O (0 = auto, capped at 10)
cleanup.mode: What to do with data older than retention-in-memory: "archive" (write Parquet) or "delete"
cleanup.directory: Root directory for Parquet archive files (required when mode is "archive")
nats-subscriptions: List of NATS subjects to subscribe to, with associated cluster tag

Checkpoint formats

The checkpoints.file-format field controls how in-memory data is persisted to disk.

"json" (default) -- human-readable JSON snapshots written periodically. Each snapshot is stored as

///.json and contains the full metric hierarchy. Easy to inspect and recover manually, but larger on disk and slower to write.

"wal" -- binary Write-Ahead Log format designed for crash safety. Two file types are used per host:

current.wal -- append-only binary log. Every incoming data point is appended immediately (magic 0xCC1DA7A1, 4-byte CRC32 per record). Truncated trailing records from unclean shutdowns are silently skipped on restart.
.bin -- binary snapshot written at each checkpoint interval (magic 0xCC5B0001). Contains the complete hierarchical metric state column-by-column. Written atomically via a .tmp rename.

On startup the most recent .bin snapshot is loaded, then any remaining WAL entries are replayed on top. The WAL is rotated (old file deleted, new one started) after each successful snapshot.

The "wal" option is the default and will be the only supported option in the future. The "json" checkpoint format is still provided to migrate from previous cc-metric-store version.

Parquet archive

When cleanup.mode is "archive", data that ages out of the in-memory retention window is written to Apache Parquet files before being freed. Files are organized as:

/ / .parquet

One Parquet file is produced per cluster per cleanup run, consolidating all hosts. Rows use a long (tidy) schema:

Column	Type	Description
`cluster`	string	Cluster name
`hostname`	string	Host name
`metric`	string	Metric name
`scope`	string	Hardware scope (`node`, `socket`, `core`, `hwthread`, `accelerator`, ...)
`scope_id`	string	Numeric ID within the scope (e.g. `"0"`)
`timestamp`	int64	Unix timestamp (seconds)
`frequency`	int64	Sampling interval in seconds
`value`	float32	Metric value

Files are compressed with Zstandard and sorted by (cluster, hostname, metric, timestamp) for efficient columnar reads. The cpu prefix in the tree is treated as an alias for hwthread scope.

`nats`

"nats": { "address": "nats://0.0.0.0:4222", "username": "root", "password": "root" }

NATS connection is optional. If not configured, only the HTTP write endpoint is available.

For more information see the ClusterCockpit documentation website.

Test the complete setup (excluding cc-backend itself)

There are two ways for sending data to the cc-metric-store, both of which are supported by the cc-metric-collector. This example uses NATS; the alternative is to use HTTP.

# Only needed once, downloads the docker image docker pull nats:latest # Start the NATS server docker run -p 4222:4222 -ti nats:latest

Second, build and start the cc-metric-collector using the following as Sink-Config:

{ "type": "nats", "host": "localhost", "port": "4222", "database": "updates" }

Third, build and start the metric store. For this example here, the config.json file already in the repository should work just fine.

# Assuming you have a clone of this repo in ./cc-metric-store: cd cc-metric-store make ./cc-metric-store

And finally, use the API to fetch some data. The API is protected by JWT based authentication if jwt-public-key is set in config.json. You can use this JWT for testing: eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw

JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw" # If the collector and store and nats-server have been running for at least 60 seconds on the same host: curl -H "Authorization: Bearer $JWT" \ "http://localhost:8082/api/query/" \ -d '{ "cluster": "testcluster", "from": '"$(expr $(date +%s) - 60)"', "to": '"$(date +%s)"', "queries": [{ "metric": "cpu_load", "host": "'"$(hostname)"'" }] }'

For debugging, the debug endpoint dumps the current content to stdout:

JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw" # Dump everything curl -H "Authorization: Bearer $JWT" "http://localhost:8082/api/debug/" # Dump a specific selector (colon-separated path) curl -H "Authorization: Bearer $JWT" "http://localhost:8082/api/debug/?selector=testcluster:host1"

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterCockpit/cc-metric-store

Folders and files

Latest commit

History

Repository files navigation

ClusterCockpit Metric Store

Building

Running

REST API Endpoints

Run tests

What are these selectors mentioned in the code?

Config file

`main`

`metrics`

`metric-store`

Checkpoint formats

Parquet archive

`nats`

Test the complete setup (excluding cc-backend itself)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Contributors

Uh oh!

Languages