TheJJ / ceph-cheatsheet Public

Notifications You must be signed in to change notification settings
Fork 78
Star 334

All(tm) you ever wanted to know about operating a Ceph cluster!

334 stars 78 forks Branches Tags Activity

TheJJ/ceph-cheatsheet

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
README.md		README.md

Repository files navigation

Ceph Cheatsheet

Released under GPLv3 or any later version.

Table of Contents

Ceph Cheatsheet
- What?
- Components
- Principle
  - Data placement
- Setup
  - Architecture
  - Setup Links
  - Monitor Setup
    - Settings
    - Monmap
  - Manager Setup
    - Crash dump collection
  - Object Storage Devices
    - OSD Memory Amount
    - Add BlueStore OSD
      - Automatic Discovery and Startup
      - Adding many OSDs at once
      - Reformat an OSD
    - Separate OSD Database and Bulk Storage
      - Migrate of OSD journal and database
  - Metadata Server
  - Balancer
    - Shard sizes
    - JJ's Ceph Balancer
    - Ceph's built-in Balancer
  - Erasure Coding
  - Pools
    - Placement group autoscaling
  - Crushmap
  - CephFS
    - CephFS Setup
      - Add Users
        
        CephFS Quotas and Layouts
        
        Snapshot Permissions
      - Subdatapool
      - RBD datapool for client
      - Namespace
      - Quota Config
      - CephFS Snapshots
        
        Finding CephFS Snapshots
    - CephFS Inodes
    - CephFS Status
  - MDS Online Scrub
    - Tuning CephFS
    - MDS Slow ops
  - RADOS Block Devices RBD
    - Kernel RBD client
      - Tuning KRBD
    - Useful RBD commands
    - Automatic mapping
    - Manual mapping
    - LVM on RBD
    - RBD Status
  - Cluster Performance
    - Slow OSD commits
    - Messenger logging
    - Fast Pool reads
    - Recovery Speed
    - Crush parameter tuning
  - Random infos
  - Minimal client config
- Operation
  - Cluster status
  - Utilization
  - Daemon control and info
  - OSD adding and removing
  - Device control
  - Device performance
    - Raw device speed
    - OSD speed
    - RADOS speed
    - RBD speed
    - CephFS speed
  - PG Autoscale
  - PGs not starting
    - Stuck after adding new OSDs
  - Recovery
  - Data Corruption
    - OSD crashes
    - Incomplete PGs
  - Decrypt OSDs
- Kernel Feature List
- Large Clusters
  - OSDMap Cache
- Tricks
  - Combine Multiple RBDs
  - RBD client local cache
  - Available Space
  - Tips

What?

Ceph is a distributed storage cluster.

In this file, I try to compress my knowledge and recommendations about operating Ceph clusters.

I'm sorry for the chaos in this cheatsheet, it basically serve(s/d) as my searchable command reference... If you think it's worth a shot, please submit pullrequests.

How to debug.

Components

Component	Description
Client	Something that connects to the Ceph cluster to access data.
Cluster	All Ceph components as a whole, providing storage (RADOS)
Object Store Device (OSD)	Actually stores data on single drive (key-value db)
Monitor (MON)	Coordinates the cluster (odd number, >=1), stores state on local disk
Metadata Server (MDS)	Handles filesystem inode transactions, stores its state on OSDs
Manager (MGR)	Collects statistics, balances, hosts webui, collects crashes, ...

Thing	Description
Object	Data stored under a key (name), like a C++ `unordered_map` or Python `dict`
Pool	Group of objects to store in the same way (redundancy, placement), access realm
Namespace	Object name prefix to create access realms
Placement Group	Partition of a pool by object key (name) hashes

Principle

The gist of how Ceph works:

All services store their data as "objects", usually 4MiB size. A huge file or a block device is thus split up into 4MiB pieces.

An object is "randomly" placed on some OSDs, depending on placement rules to ensure desired redundancy.

Ceph provides basically 4 services to clients:

Block device (RBD)
Network filesystem (CephFS)
Object gateway (RGW, S3, Swift)
Raw key-value storage via (lib)rados

A ceph-client is e.g.

Linux kernel that has CephFS mounted, e.g. at /srv/hugestorage
- which provides you a mounted directory where you can store whatever you want
Linux kernel that has a RBD mapped as /dev/rbd42, on top of which can be LVM, a filesystem, ...
- which you can then use like a regular block device/filesystem for whatever Linux service you want
QEMU that uses a RBD as virtual disk for the VM
NFS Ganesha that converts a CephFS directory to NFS
Samba that uses vfs_ceph to provide CephFS directories via CIFS
The ceph, rbd, rados, ... commandline tools (yes, they're just clients).

Data placement

Why does Ceph scale? Why is it secure(tm) and safe(tm)?

Basically, data is stored on multiple OSDs, while respecting constraints. When OSDs fail, the missing copy is automatically recreated somewhere else.

Cluster partitioning in pools, PGs and shards:

A cluster consists of pools, each can have custom redundancy and placement settings
A pool is partitioned in 2^x placement groups (PG) - an object is stored in one of its pools PGs
The chosen pool redundancy now affects each PG e.g. "3 copies on separate servers" (3 OSDs) or "raid 6 on 12 servers" (12 OSDs)
Each OSD participating in serving a PG is called a PG shard. Which OSD should serve what PG shard is determined by CRUSH, respecting the desired redundancy and topology constraints

To read/write an object:

Needed: Pool of the object, object's key ("name")
Hash the object's key
Take the last x bits of the hash and choose the placement group (that's why we have 2^x PGs)
Use the CRUSH algorithm (and upmap updates) to find the primary OSD id of this placement group
Look in the "OSDMap" to figure out the for the IP address of the PG's primary OSD
Establish a connection and talk to the primary OSD of this placement group and retrieve the object data
- In case the pool stores data as erasure code ("RAID"), the primary OSD contacts the remaining needed OSDs in its PG for reconstructing the data
- When writing, the PG's primary OSD contacts all other OSDs in the same PG to let them write, and then waits until they have acknowledged, and then confirms the write to the client

Since all data is chunked into (usually) 4MiB blocks, each of the blocks of a file is in a different PG, i.e. on different OSDs, thus we talk to a different primary OSD for each block.

-> All requests are spread accross all OSDs of the whole cluster

Recovery: When the desired object redundancy is no longer met (due to unavailable OSDs - drive-, server-, rack-failure), Ceph recreates the missing data automatically
Adding OSDs: move ("backfill") some of the existing placement groups to the new OSDs so every OSD hopefully stores roughly the same amount of data.
Removing OSDs: move ("backfill") the placement groups existing on the to-be-removed OSDs to others that are not removed.

Setup

Ceph generally works very well if you use it like others are using it. Once you deviate from the default, prepare for unforseen consequences.

But I've never lost a bit of data so far, I had to apply extensive massage a few times, but I got it to recover every time (so far).

Architecture

Ceph is pretty flexible, and things work "better" the more uniform your setup is. All in all you have to run Ceph's components on the machines you have, so storage is created magically.

As a reminder, the components:

client: e.g. a Linux kernel, a QEMU process, a samba or NFS server to provide storage for a VM, a webserver, BigBlueButton recordings, whatever.
MON: cluster synchronizer, you should have 2n+1 many of them, usually 3 or 5 so you can tolerate 1 or 2 outages (or rather: maintenances).
OSD: a single key-value storage device, which actually stores all data
MGR: statistics collector, where e.g. OSDs report to
MDS: CephFS inode metadata service, which uses OSDs to store its metadata, and clients write and retrieve data pointed to by the MDS metadata directly from OSDs.

How you create it, is up to you, but know this:

OSD:
- Each OSD requires >1GiB RAM, I recommend 4GiB. It's mainly for caching, configurable with osd_memory_target.
- It needs a quite capable CPU (e.g. the erasure coding, compression, and of course the regular storage request path).
MON:
- The CPU doesn't need to be too crazy, but should be good enough.
- The MON storage should be fast: All management actions are quicker then.
MDS:
- The CPU should be quite capable when handling lots of requests, otherwise not super important.
- The MDS metadata-pool (the CephFS metadata) should be very fast.
- The MDS itself needs lots of ram, depending on your filesystem open file count (usually >4GiB, but can be >32GiB for millions of files).
MGR:
- Doesn't need any storage, just a medium-grade CPU and 2G RAM maybe.

You can run these services on any device that's suitable, combine them, provide the actual storage drives over FibreChannel to a Linux server running the Ceph OSDs, whatever.

For example:

1 Server. Run one MON, one MGR and OSDs, and two MDS (if you use CephFS).
- I'd say if you want at least some performance, have at least 6 OSDs
- You can store data in replicated or erasure coded pools
- In such a "cluster" you can still store >500T, I know such a thing...
3 Servers. All have a MON and a couple of OSDs. You store data by in a replicated pool, each data copy on a separate server.
Many Servers. Each has some OSDs (4 to 64) and all are clustered together.
- Separate servers for MONs and OSDS.
- Ideally also independent MDS servers (if you use CephFS), but you can run them on the MON server with enough RAM.
- You can store data in erasure coded (EC, a bit like a RAID5/6/...) pools, but you need sufficiently many servers then, or else it has to behave like one big server and things go down whenever you restart only one of them.
- That's the "standard" setup.

Each server should have 2x10GiB bond/link aggregation or at least 10GiB connectivity.

I wouldn't create separate "public" and "cluster" networks, since having one big link provides more peak performance for both scenarios - more internal and external traffic bandwidth.

Setup Links

What hardware?

Ceph needs an odd number of >=1 MONs to get a quorum.

For better understanding the setup, I recommend the manual method.

For config options, see the upstream documentation.

Monitor Setup

Monitor config documentaion

Follow the 'manual method' above to add a ceph-$monid monitor, where $monid usually is a letter from a-z, but we use creative names (the host name).

Monitors don't use ceph.conf for their addressing, they store it internally. IP changing guide.

For more hosts, monitors can be added.

Settings

http://docs.ceph.com/docs/mimic/rados/configuration/mon-osd-interaction/

# if too many osds are out, don't operate. mon osd min in ratio = 100

Monmap

To edit the monitor map (e.g. to change names and IP addresses of monitors):

ceph-mon --extract-monmap --name mon.monname /tmp/monmap monmaptool --print /tmp/monmap monmaptool --help # edit the monmap ceph-mon --inject-monmap --name mon.monname /tmp/monmap

Manager Setup

Manager config documentation

You should run one manager for each monitor, but having more doesn't hurt. Offloads work from MONs and allows scaling beyond 1000 OSDs (statistics and other unimportant stuff like disk usage) One MGR is active, all others are on standby.

This creates a access key for cephx for the manager.

sudo -u ceph mkdir -p /var/lib/ceph/mgr/ceph-$mgrid ceph auth get-or-create mgr.$mgrid mon 'allow profile mgr' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mgr/ceph-$mgrid/keyring # test it: sudo -u ceph ceph-mgr -i $mgrid -d # actually activate it sudo systemctl enable --now ceph-mgr@$mgrid.service

The manager provides a shiny dashboard and other plugins (e.g. the balancer)

Crash dump collection

The manager has a crash collection module.

Enable it:

ceph mgr module enable crash

Setup crash collection with ceph-crash.service:

On each of your servers with OSDs etc, deploy /etc/ceph/ceph.client.crash.keyring, containing the key for client.crash, created like this:

ceph auth get-or-create client.crash mon 'profile crash' mgr 'profile crash' # or, restrict the crash reports to a specific subnet! ceph auth get-or-create client.crash mon 'profile crash' mgr 'profile crash network 1337.42.42.0/24'

Then enable and start ceph-crash.service. ceph-crash.service runs on on each server periodically looks inside /var/lib/ceph/crash for new-to-report crashes, and then uses ceph crash post to send them to the MGR. In order to post, it tries client.crash as username, and to submit we need both mgr and mon crash-caps, otherwise the upload will fail with [errno 13] error connecting to the cluster and Error EACCES: access denied: does your client key have mgr caps?.

The profile crash allows you running ceph crash post (which the ceph-crash uses to actually report stuff).

New crashes appear in ceph status. Details: ceph crash

Object Storage Devices

How to add devices.

Before adding OSDs, you can do ceph osd set noin so new disks are not filled automatically. Once you want it to be filled, do ceph osd in $osdid.

OSD Memory Amount

Ceph since v15 supports global cluster configuration, so you don't need to mess with distributing a ceph.conf any more.

To configure the amount of memory each OSD should occupy (in total: workmem + rest for cache):

ceph config set osd osd_memory_target $size_in_byte

1073741824 = 1GiB etc.
I'd recommend 2GiB or if you can spare it 4GiB. 1GiB works pretty ok, though.
The rest of the RAM will be used by the Linux block cache anyway

Configure the per-osd memory differently by masking the config to a host name:

ceph config set osd/host:yourspecialcheapservername osd_memory_target $size_in_byte

To disable automatic resizing of the OSD memory:

ceph config set osd/host:yourspecialcheapservername bluestore_cache_autotune false # set the cache size with: bluestore_cache_size, bluestore_cache_size_hdd, bluestore_cache_size_ssd

This can also be set in a host's ceph.conf to test without updating the cluster config:

[osd] osd_memory_target = 4294967296 # 4GiB

Or set it non-permanently at runtime (use osd.somenumber to target just one OSD instead of all):

ceph tell 'osd.*' injectargs '--osd_memory_target=2147483648'

Add BlueStore OSD

Add a BlueStore device, with the help of LVM. In the LVM tags, metainfo is stored.

The data and journal (WAL) and keyvalue-DB can be placed on different devices (HDD and SSD).

Use --dmcrypt to encrypt the HDD. This just uses LUKS! The key are stored in the MONs.

Use --crush-device-class somename to assign a device class (any name is possible), autodetected are hdd, ssd and nvme.

sudo ceph-volume lvm create --dmcrypt --data /dev/partition

If you use /dev/partition, an LVM LV is created:

The VG is ceph-$(uuidgen) (generated with python3 -c "import uuid; print(uuid.uuid4())")
The LV is osd-block-$(uuidgen)

If you use vg/lvname instead of /dev/partition, the lv is luksFormated and then initialized.

SSDs should be set up with noop ioscheduler, HDDs with deadline. These are settings of Linux.

Adding/removing OSDs.

Failed OSDs can be removed with ceph osd purge-new $id.

Information about the drive is placed in LVM in its tags:

sudo lvs --readonly --separator=" " -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size [optional_path_to_lv]

Device setup (opening, decryption, ...) is done with ceph-volume and ceph-volume-systemd.

The secret hdd keys for --dmcrypt are stored in the config-key database in the mons.

ceph config-key dump | grep dm-crypt

Automatic Discovery and Startup

ceph-volume can enumerate all attached disks and start up the OSDs. This will create the systemd service files for starting the OSDs at the next boot, too.

sudo ceph-volume lvm activate --all

This allows your OSD hosts to be completely stateless! You can even boot your OSD system over network with PXE that way, and just start all the OSDs system-independently.

Adding many OSDs at once

Ensure the cluster is healthy (HEALTH_OK)
ceph osd set norebalance nobackfill
Add the OSDs with normal procedure as above
Let all OSDs peer, this might take a few minutes
ceph osd unset norebalance nobackfill
- now the cluster fills up the new OSDs
Everything's done once cluster is on HEALTH_OK again

Reformat an OSD

In case a OSD is corrupted somehow, and you want to re-initialize it's on-disk data structures (i.e. delete everything), you can wipe the blockdevice and create a new BlueStore FS.

This keeps all the ceph-volum and encryption things in-place, and just resets the OSD itself.

# observe and decide that OSD $osdid needs an reset # go to the osd directory cd /var/lib/ceph/osd/ceph-$osdid # clear the bluestore header dd if=/dev/zero of=./block count=1 bs=100MB # create a new bluestore filesystem sudo ceph-osd -f --id $osdid --setuser ceph --setgroup ceph --mkfs # start the service again sudo systemctl start ceph-osd@$osdid.service

Now the OSD is fresh and clean and rejoins the cluster.

Separate OSD Database and Bulk Storage

You can store the metadata and data of an OSD on different devices.

Usually, you store the database and journal on a fastdevice:

ceph-volume lvm create --data /dev/slowdevice --block.db /dev/fastdevice

If your fastdevice is too small, you can store only the journal on it:

ceph-volume lvm create --data /dev/slowdevice --block.wal /dev/fastdevice

You can even store on three different blockdevices: data, block.db and block.wal.

Because a fast device (SSD, NVMe is usually so much faster than a HDD), you can use multiple partitions and place one DB on each.

If the "external" DB is full, the data-device will be used to store the remaining information. BlueStore will automatically relocate often-used data to the fast device then.

Migrate of OSD journal and database

With ceph-bluestore-tool, you can create, migrate expand and merge OSD block devices.

To move the block.wal from an all-in-one OSD to a new device with target full partition size:

ceph-bluestore-tool --command bluefs-bdev-new-wal --dev-target /dev/system/osdwal$id --path /var/lib/ceph/osd/ceph-$id

The same is possible with DB (i.e. wal + most hot rocksdb data):

ceph-bluestore-tool --command bluefs-bdev-new-db --dev-target /dev/system/osdwal$id --path /var/lib/ceph/osd/ceph-$id

For example, view the sizes of all involved BlueFS block devices:

ceph-bluestore-tool --command bluefs-stats --path /var/lib/ceph/osd/ceph-$i

You can pass some arguments via env-variables if needed: CEPH_ARGS="--bluestore-block-db-size 2147483648" ceph-bluestore-tool ...
To resize a block.db, use bluefs-bdev-expand (e.g. when the underlying partition size was increased)
To merge separate block.db or block.wal drives onto the slow disk, use bdev-migrate
Details for all the commands are in the manpage

Metadata Server

Metadata servers (MDS) are needed for the CephFS.

sudo -u ceph mkdir -p /var/lib/ceph/mds/ceph-$mdsid sudo -u ceph ceph-authtool --create-keyring /var/lib/ceph/mds/ceph-$mdsid/keyring --gen-key -n mds.$mdsid sudo -u ceph ceph auth add mds.$mdsid osd "allow rwx" mds "allow" mon "allow profile mds" -i /var/lib/ceph/mds/ceph-$mdsid/keyring # test with sudo ceph-mds -f --cluster ceph --id $mdsid --setuser ceph --setgroup ceph sudo systemctl enable --now ceph-mds@$mdsid.service

Multiple MDS servers can be active, they distribute the inode workload. Kernel clients support this [since Linux 4.14](#Kernel feature list).

To assign a hot-standby to every active MDS:

ceph fs set $fsname allow_standby_replay

Balancer

This is super-important to use when you have different-sized OSDs!

Basically we improve the placement of PGs on top of the CRUSH algorithm's distribution.

It distributes PGs on OSDs such that available space is maximized and/or load is evenly distributed

Also, make sure for right balancing that big pools have enough PGs, otherwise each shard ("piece") of the PG is very big already.

I recommend a PG shard should be around 10-50 GiB. When balancing, this is the amount that can be moved from one OSD to another.

Shard sizes

To see the shard size of your pools, use:

# github.com/TheJJ/ceph-balancer ./placementoptimizer.py show

Or calculate it by hand:

# shard size calculation if pool is replica: shardsize = pool_size / pg_num elif pool is erasurecoded(n+m): shardsize = pool_size / (pg_num * n)

JJ's Ceph Balancer

I recommend using my own balancer: https://github.com/TheJJ/ceph-balancer It's scientifically approved(tm) and yields great(r) results :)

# generate 100 balancing movements ./placementoptimizer.py -v balance -m 100 > /tmp/balance-instructions # after you are happy with the results: bash /tmp/balance-instructions # repeat (and/or generate more at once) if you want :)

If this works for you, please send a mail to jj -at- sft.lol so I can collect samples and improve the algorithm even more.

Ceph's built-in Balancer

Ceph also has a built-in balancer which can also produce good results, it just considers even PG distribution (by id), but it does not respect device fill levels or pool/shard sizes.

# to see what the mgr is doing internally with # tail -f ceph-mgr.*.log | grep balancer ceph tell 'mgr.*' injectargs -- --debug_mgr=4/5 # balancer commands: ceph balancer status ceph balancer mode upmap # upmap items as movement method, not reweighting. ceph balancer eval # evaluate current score ceph balancer optimize myplan # create a plan, don't run it yet ceph balancer eval myplan # evaluate score after myplan. optimal is 0 ceph balancer show myplan # display what plan would do ceph balancer execute myplan # run plan, this misplaces the objects ceph balancer rm myplan # view auto-balancer status and durations ceph tell 'mgr.$activemgrid' balancer status

Use upmap mode: relocate single PGs as "override" to CRUSH. Needs Ceph Luminous and [Linux kernel 4.13](#Kernel feature list).

# If you have Luminous and kernel >=4.13 it may still complain about a too old client, # but we know what we're doing :) ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it

To further optimize placement and really adjust the equal PG-count to be within a bound of 1.

ceph config set mgr mgr/balancer/upmap_max_deviation 1

Erasure Coding

RAID6 with Ceph.

Create a new profile, standard_8_2 is the arbitrary name.

ceph osd erasure-code-profile set standard_8_2 k=8 m=2 crush-failure-domain=osd

# show the how pools are configured, expecially which ec-profile was assigned to a pool ceph osd pool ls detail --format json | jq -C .

crush-failure-domain ensures that no two chunks of an object are stored on the same osd.

Pools

To run CephFS on an erasure coded (ec) pool, we need allow_ec_overrides.

# erasure coding pool (for data) ceph osd pool create lol_data 32 32 erasure standard_8_2 ceph osd pool set lol_data allow_ec_overwrites true # replicated pools (for metadata) ceph osd pool create lol_root 32 replicated ceph osd pool create lol_metadata 32 replicated # min_size: minimal osd count (per PG) before a PG goes offline ceph osd pool set lol_root size 3 ceph osd pool set lol_root min_size 2 ceph osd pool set lol_metadata size 3 ceph osd pool set lol_metadata min_size 2 # for lol_data the size and min_size are determined by the ec profile

In a CephFS volume, you can have multiple storage pools "mounted" at any directory.

# for example, this 8+3 pool can be to store some directories 'more safe' ceph osd erasure-code-profile set backup_8_3 k=8 m=3 crush-failure-domain=osd ceph osd pool create lol_backup 64 64 erasure backup_8_3 ceph osd pool set lol_backup allow_ec_overwrites true # in the cephfs, assign it to a directory and all its _new_ content: setfattr -n ceph.dir.layout.pool -v lol_backup your-backup-directory-name

Pool quotas

# set max storage bytes to 1TiB (uses shell-calculation) ceph osd pool set-quota funny_pool_name max_bytes $((1 * 1024 ** 4)) # limit number of objects ceph osd pool set-quota funny_pool_name max_objects 1000000

Placement group autoscaling

nautilus support automatic creation and pruing of placement groups.

ceph mgr module enable pg_autoscaler # view autoscale information and what the autoscaler would do ceph osd pool autoscale-status # policy for newly created pools # I recommend setting warn, and _not_ on. ceph config set global osd_pool_default_pg_autoscale_mode # policy per-pool # warn, on or off. recommended by me: warn. ceph osd pool set $pool pg_autoscale_mode

# help the autoscaler by providing target_ratio: # the fraction of total cluster size this pool is expected to consume. ceph osd pool set foo target_size_ratio .2 # only warn that the pg-count is suboptimal ceph osd pool set $poolname pg_autoscale_mode warn # enable automatic pg adjustments on the given pool ceph osd pool set $poolname pg_autoscale_mode on

Crushmap

# get and decompile ceph osd getcrushmap > /tmp/map crushtool -d /tmp/map -o /tmp/map.txt # remember the returned crushmap version # edit /tmp/map.txt # compile and update crushmap crushtool -c /tmp/map.txt -o /tmp/map_new ceph osd setcrushmap -i /tmp/map_new previous_crushmap_version

compare crushmaps:

crushtool -i crushmap --compare crushmap.new

In a rule, device classes can be used to select OSDs for a CRUSH rule. You can define and assign arbitrary device classes, e.g. use huge, fast... ssd, hdd and nvme detected automatically, but any class can be assigned:

ceph osd crush set-device-class $deviceclass $osdid

To select just OSDs of a given class in a CRUSH rule:

step take default => step take default class hdd

Assign pools to placement rules.

ceph osd pool set crush_rule

CRUSH rule commands:

rule rulename { # unique id id 1 type replicated # which pools can use this rule? min_size 1 # -> all pools max_size 10 # now the device selection steps # start at the default bucket, but only for ssd devices step take default class ssd # chooseleaf firstn: recursively explore bucket to look for single devices # choose firstn: select bucket for next step # 0: choose as many buckets as needed for copies (-1: one less than needed, 3: exactly three) # host: bucket type to choose for the next step step chooseleaf firstn 0 type host # the set of osds was selected step emit }

To test where CRUSH will place PGs, use crushtool:

crushtool -i crushmap.bin --test --rule $ruleid --num-rep $reps --show-mappings # other options: --show-bad-mappings

CephFS

On top of RADOS, Ceph provides a POSIX filesystem.

CephFS Setup

ceph fs new lolfs lol_metadata lol_root mount -v -t ceph -o name=lolroot,secretfile=lolfs.root.secret 10.0.18.1:6789:/ mnt/

newer way:

ceph fs volume create ...

Add Users

Create an user that has full access to /

ceph fs authorize lolfs client.lolroot / rw

You can restict access to a path (this is only for Metadata, i.e. inode infos! use namespaces to restrict data access).

ceph fs authorize lolfs client.some.weird_name /lol/ho_ho_ho rw /stuff r

These commands just create regular auth keys, you can view its (effective) permissions with:

ceph auth get client.some.weird_name

CephFS Quotas and Layouts

To allow this client to configure quotas and layouts, it needs the p flag. Beware that this overwrites existing caps, so make sure to auth get first, and then update.

ceph auth caps client.lolroot mon 'allow r' mds 'allow rwp' osd 'allow rw tag cephfs data=cephfsname'

Snapshot Permissions

To allow a key to manage snapshots, it needs the s permission flag for MDS.

ceph auth caps client.blabla ... mds 'allow rw path=/some/dir, allow rws path=/some/dir/snapshots_allowed, ...' ...

This client can now access /some/dir and can manage snapshots in /some/dir/snapshots_allowed or any folder below.

Subdatapool

New data can be written to a different pool (e.g. an ec-pool).

Any folder can be assigned a new custom pool. New file content is then written to the new pool (new == file has had size 0 before). Existing file content will stay in their old pool. Only when the content is written again from 0 (e.g. copy), the new pool is used.

# this automatically sets the application tag for 'cephfs' to data=$cephfsname # so a client automatically has access! ceph fs add_data_pool fsname poolname

To assign this pool to a folder:

setfattr -n ceph.dir.layout.pool -v poolname foldername

The client needs access to this pool. The "simple" way is matching to the cephfs name in the auth key: Clients will automatically have access to all pools through the data=cephfsname tag, if the access key has the osd 'allow rw tag cephfs data=cephfsname' capability. You can set the tag manually (but why would you do that?) with: ceph osd pool application set cephfs .

Alternatively, you can grant access to pools explicitly: allow rw pool=poolname, allow r pool=someotherpool, ....

Access is better restricted by namespace, though: Namespaces are pool-independent and there can be many namespaces per pool.

RBD datapool for client

Some client tools don't support specifying an RBD data pool.

With this "trick", you can set a RBD data pool with erasure coding in OpenStack Cinder, Proxmox, ...

By selecting one ceph.conf and a client name, we set the data pool name as default. If needed, you can have more ceph.conf files, as long as your client tool allows specifying at least the config file name...

[global] fsid = your-fs-id mon_host = mon1.rofl.lol, mon2.rofl.lol, mon3.rofl.lol [client.your-awesome-user] rbd default data pool = your-ec-poolname

Namespace

OSD access would allow reading (and writing!) CephFS objects directly in the pool, even though the MDS prevents mounts through the path restriction.

To actually restrict access, objects can be prefixed with a namespace so the OSDs can check access through namespace restriction.

Set the namespace on a directory of the CephFS

sudo setfattr -n ceph.dir.layout.pool_namespace -v $namespacename /mnt/cephfs/directory/name

Change auth caps for a client to only mount /directory/name (mds restriction) and read osd data from namespace $namespacename on cephfs lolfs.

ceph auth caps client.somename mds 'allow rw path=/directory/name' mon 'allow r' osd 'allow rw namespace=$namespacename tag cephfs data=lolfs'

You can grant access to multiple namespaces to a CephFS client with allow rw namespace=$ns1, allow r namespace=$ns2, allow rw .....

CephFS namespaces are supported on kernel clients since [Linux 4.8](#Kernel feature list) (https://github.com/torvalds/linux/commit/72b5ac54d620b29cae23d25f0405f2765b466f72).

Quota Config

To set a quota on a CephFS subdirectory, use:

setfattr -n ceph.quota.max_bytes -v 20971520 /a/directory # 20 MiB setfattr -n ceph.quota.max_files -v 5000 /another/dir # 5000 files

To remove the quota, set the value to 0.

CephFS quotas work since Linux 4.17.

CephFS Snapshots

A client with the s permission for MDS can manage snapshots. A snapshot is created by creating a directory: dir/to/backup/.snap/snapshot_name.

Finding CephFS Snapshots

As snapshots are directories in hidden .snap directories, but it's tedious to find them since they can be anywhere and are hidden.

To figure out where they are, you can:

Option 1: Ask the active MDS server for the snapshot inodes:
- ceph tell mds.$yourmds dump snaps
- Then look up the inode, see below.
Option 2: Get snaps directly from metadata pool. This skips figuring out which MDS is now correct to ask:

rados -p get mds_snaptable - | ceph-dencoder type SnapServer skip 8 import - decode dump_json

{ "snapserver": { "last_snap": 6776, "last_created": 6776, "last_destroyed": 6737, "pending_noop": [], "snaps": [ { "snapid": 6369, "ino": 6597072470362, "stamp": "2023-11-08T01:00:05.596850+0100", "name": "best-snapshot-name-ever" }, { "snapid": 6370, "ino": 1099511699630, "stamp": "2023-11-08T01:00:05.612619+0100", "name": "really-terrible-snapshot-name" }, ...

These inode numbers can then also be resolved.

CephFS Inodes

Getting a file name and path for an inode number in CephFS:

# have a inode number of a file with e.g. `ls -li ` rados -p getxattr $(printf %x ).00000000 parent | ceph-dencoder type inode_backtrace_t import - decode dump_json # for directories, you need the instead!

{ "ino": 1099511627850, "ancestors": [ { "dirino": 1099511627847, "dname": "cute_kitten.mkv", "version": 902 }, { "dirino": 1099511627864, "dname": "really_shady_stuff", "version": 913 }, { "dirino": 1, "dname": "tax_evasion_plans", "version": 10413 } ], "pool": 14, "old_pools": [ 2 ] }

In the parent xattr of an inode object, CephFS stores a inode_backtrace_t structure (with a list of inode_backpointer_t in ancestors as the file path). Decoding it gives you the file path.

CephFS Status

# show status overview ceph fs status # dump all filesystem info ceph fs dump # get info for specific fs ceph fs get lolfs

Show connected CephFS clients and their IPs

ceph tell mds.$mdsid client ls

MDS Online Scrub

# flush journal ceph daemon mds.$id flush journal # online scrub ceph daemon mds.$id scrub_path /path/on/fs recursive # tell ceph that mds $0 has been repaired ceph mds repaired $cephfs_name:0

Tuning CephFS

Enable fast_read on pools - all OSDs in EC pools will be queried instead of only the first n.
Enable inline data: Store content for (default <4KB) in inode: ceph fs set lolfs inline_data yes. "small files" are then stored in the metadatapool without a datapool access needed.

MDS Slow ops

mds.mds1 [WRN] slow request 30.633018 seconds old, received at 2020-09-12 17:38:03.970677: client_request(client.148012229:9530909 getattr AsLsXsFs #0x100531c49cf caller_uid=3860, caller_gid=3860{}) currently failed to rdlock, waiting

Here 100531c49cf is the file inode number.

Then you can get the path of the file by looking up the inode

RADOS Block Devices RBD

Create pools: One non-EC pool for metadata first, optionally more data pools
If a data pool is an EC-Pool, allow ec_overwrites on it
- ceph osd pool set lol_pool allow_ec_overwrites true
- [Linux 4.11](#Kernel feature list) is required if the Kernel should map the RBD
Prepare all pools for RBD usage: rbd pool init $pool_name
If you have a pool named rbd, it's the default (metadata) rbd pool
You can store rbd data and metadata on separate pools, see the --data-pool option below

Now, images can be created in that pool:

Optionally, create a rbd namespace to restrict access to that image (supported since Nautilus 14.0 and [Kernel 4.19](#Kernel feature list)):
- rbd --pool $poolname --namespace $namespacename namespace create
Create an image: rbd create --pool $metadata_pool_name --data-pool $storage_pool_name --namespace $namespacename --size 20G $imagename
Display namespaces: rbd --pool $metadata_pool_name namespace ls
Display images in a namespace: rbd --pool $metadata_pool_name --namespace $namespacename ls
Create access key for whole pools: ceph auth get-or-create client.$name mon 'profile rbd' osd 'profile rbd pool=$metadata_pool_name, profile rbd pool=$storage_pool_name, profile rbd-read-only pool=$someotherpoolname'
Create access key for a specific rbd namespace: ceph auth get-or-create client.$name mon 'profile rbd' osd 'profile rbd pool=$metadata_pool_name namespace=$namespacename, profile rbd pool=$storage_pool_name namespace=$namespacename'
- Important: restrict the namespace on both the storage and metadata pool! Otherwise this key can read other images' data.

To map an image on a client to /dev/rbdxxx (using monitor addresses from /etc/ceph/ceph.conf):

sudo rbd device map --name client.$name -k keyring [$metadata_pool_name/[$namespacename/]]$imagename[@$snapshotname]
- -t nbd to mount as nbd-device
- --namespace $namespacename to specify the rbd namespace (as an alternative to the image "path" above)

Kernel RBD client

Some Linux kernel rbd clients ("krbd") don't support all features of rbd images.

Available rbd features declared here and listed here and defined in the kernel code.

krbd kernel features: [see in the kernel feature list](#Kernel feature list)

# if you get this dmesg-output: rbd: image $imagename: image uses unsupported features: 0x38 # view enabled features of this image: rbd --pool $meta_data_pool --namespace $namespacename info $imagename # then disable the unsupported features: rbd --pool $meta_data_pool --namespace $namespacename feature disable $imagename $unavailable_feature_name $anotherfeature...

Tuning KRBD

You get the most out of your RBD if you start tweaking some knobs.

Generally: Your client is in control - it has to begin every request to the Ceph cluster

writes: writes can be parallelized easily (write-cache), even if they occur sequentially from your actual program
reads: reads can't be parallelized easily, since many programs just read in a sequential fashion (if they read in parallel, it works just like parallel writes)
- One solution is to "guess" that the program will want to read more data: configure read_ahead
- Another solution is to do read-caching with lvmcache on a client-local SSD
have a look at your metrics: ceph_osd_op_r_process_latency (and w), ceph_rbd_read_latency (and write), and ceph_osd_commit_latency_ms
- when sequential read ops are requested, you won't get more than 1/ceph_osd_op_r_process_latency in seconds ops per second (say 15ms -> max 66 iops)
- given a HDD-pool with read-latency of 10ms average (which is very good), you'll get max 100 random read ops/s from one client.

suggested /etc/udev/rules.d/80-rbd.rules for tweaking read and write

# parallel requests - this only works if the rbd was mapped with at least queue_depth=512! ACTION=="add", KERNEL=="rbd*", ATTR{queue/nr_requests}="512" # read up to 8 ceph objects in advance when the fs on the rbd decides for it ACTION=="add", KERNEL=="rbd*", ATTR{bdi/read_ahead_kb}="32768"

read tuning: read ahead
- set the maximum amount of bytes that can be pre-read (i.e. read in parallel!)
- the actual number is determined on the fly e.g. by the filesystem depending on the read pattern.
- configurable in: /sys/block/rbdxxx/queue/read_ahead_kb
  - to read up to 8 Ceph RBD objects in advance, we have to read 8 * 4096KiB = 32768 KiB
  - echo 32768 > /sys/block/rbdxxx/queue/read_ahead_kb
- you can't get faster than the OSDs - but lvmcache on SSDs and the client's page cache can help drastically with read performance
write tuning: IO queue optimizations
- once your filesystem/... is done (it merges tiny-files into bigger blocks, does journaling and metadata), somewhere the raw io requests will be submitted for the block device
- these requests are then placed in the block device queue, which is handled by your io scheduler (usually mq-deadline)
- this block device IO operation queue has a depth, configurable in /sys/block/rbdxxx/queue/nr_requests and is 256 by default (this number allocates slots for the block io queue). I think this is a good default. If you have lots of small IOPS, increase this to 512.
  - the Linux documentation says "the total allocated number may be twice this amount, since it applies only to reads or writes". My experiment on Linux 5.4 with mq-deadline RBD has shown fio with rw direct=1 sync=0 iodepth=512 with nr_requests=256 produces only 256 ceph ops (mixed r and w). Setting nr_requests to 512 yields 512 ceph ops (mixed r and w) Hence, nr_requests seems to be the total amount of pending IOPS for the mq-deadline Ceph RBD.
- in the block-io-queue, operations are selected by some algorithm (e.g. deadline) and merged with each other because they are adjacent. The resulting 'optimized' operations are then given to the block driver (rbd in our case)
- the RBD driver converts the IO requests to Ceph ops, i.e. selects which OSDs to send the ops over network
- the RBD has its own ceph op queue, which is allocated when the device is mapped, set by queue_depth (e.g. in /etc/ceph/rbdmap or rbd device map your/namespaced/rbd -o queue_depth=256. It's 128 by default, but I would set it to 256 (since 256 slots are allocated for the io queue above (nr_requests). Use 512 if you increased nr_requests to 512)
- the RBD driver then submits the operations, you can see those in tcpdump or /sys/kernel/debug/ceph/$fsid-$yourclientid/osdc (get the clientid by rbd status your/namespaced/rbd)
- since kernel 5.7 krbd processes the mq-deadline queue with multi-threading (i.e. picking ops from the io-queue and sending/receiving ops over network) for even more speed
ext4 optimizations:
- usually your rbd objects will be 4 MiB (with 64KiB minmal allocations). so ext4 can respect this and distribute allocations across RADOS objects if you create the fs the following way:
  - mkfs -E nodiscard,stride=1024,stripe_width=1024 /dev/rbd/your/rbd
  - why 1024? the rbd has 4096 byte blocks, i.e. 1024 * 4096 = 4MiB = the object size
  - it makes no sense also considering the EC sharding sizes, since all requests end up at the primary OSD of a PG anyway
- you can also specify the stripe width when mounting the fs (it will take stripe_width by default):
  - mount -o defaults,noauto,stripe=1024 /dev/rbd/your/rbd /your/mountpoint
- to see the 'default' stripe width chosen when you don't specify it as mount option, you can inspect dumpe2fs /dev/rbd/your/rbd and look for RAID stripe width;

Tests with fio on a mounted filesystem on a RBD:

make sure
- io scheduler's queue size is 512 (nr_requests)
- you allocated 512 RBD "hardware" queue slots (queue_depth)
- the ext4 uses 1024-block stripe size (stripe)
- open another terminal where you do watch -n 0.5 cat osdc to see the osdc contents quickly
non-cached queued writes:
- since we use O_DIRECT, Linux's write cache and io scheduler (mq-deadline) queue is bypassed
- You should see around 512 outstanding ops in the osdc file (limited by nr_requests, queue_depth and the fio iodepth)
- fio --filename=/mnt/rbd-fs/file --size=20G --direct=1 --sync=0 --iodepth=512 --runtime=15 --ioengine=libaio --time_based --rw=rw --bs=64K --numjobs=1 --group_reporting --name=test
cached queued writes
- we don't use O_DIRECT here and thus use Linux's write cache
- you should see much less ops in osdc, but a bit more performance in fio since the the cache and io scheduler merges all the ops
- in iostat -xm 2 you can see how many ops the io scheduler merges (wrqm/rrqm)
- fio --filename=/mnt/rbd-fs/file --size=20G --direct=0 --sync=0 --iodepth=1024 --runtime=15 --ioengine=libaio --time_based --rw=rw --bs=64K --numjobs=1 --group_reporting --name=test
cached but synched writes
- now we make sure every single operation is synced to disk (sync=1), but we still use the io scheduler (direct=0).
- we allow to submit 512 operations from fio in the io scheduler queue, but we sync the ext4 after each one, forcing the io scheduler queue to flush - thus we only allow 1 op at a time.
- this is really slow, since it causes ext4 and the io scheduler to guarantee a transaction for every 64KiB write
- fio --filename=/mnt/rbd-fs/file --size=20G --direct=0 --sync=1 --iodepth=1 --runtime=15 --ioengine=libaio --time_based --rw=rw --bs=64K --numjobs=1 --group_reporting --name=test
non-cached and synched writes
- we now bypass the cache (direct=1) and submit 512 operations, and sync after each one (sync=1)
- i'm really not sure why, but we see max 512 ops at a time in osdc - the cache bypass seems to enable parallel synching somehow?
- fio --filename=/mnt/rbd-fs/file --size=20G --direct=1 --sync=1 --iodepth=1 --runtime=15 --ioengine=libaio --time_based --rw=rw --bs=64K --numjobs=1 --group_reporting --name=test
non-cached read test
- we don't use the cache and io scheduler and submit 512 read requests
- we will see up to 512 parallel reads in osdc (sync doesn't matter for reads)
- fio --filename=/mnt/rbd-fs/file --size=20G --direct=1 --sync=0 --iodepth=512 --runtime=15 --ioengine=libaio --time_based --rw=read --bs=64K --numjobs=1 --group_reporting --name=test
cached read test
- we use the io scheduler and cache and submit 512 read requests
- interestingly, and I can't explain this, this is much slower than with direct=1, and in osdc there's a masimum of 3 read ops only?
- fio --filename=/mnt/rbd-fs/file --size=20G --direct=0 --sync=0 --iodepth=512 --runtime=15 --ioengine=libaio --time_based --rw=read --bs=64K --numjobs=1 --group_reporting --name=test
you can also use --rw=randrw/randread for random access, since --rw=rw tests sequentially

Useful RBD commands

https://docs.ceph.com/en/latest/rbd/rados-rbd-cmds/

When you have a fresh rbd device, use mkfs.ext4 -E nodiscard to skip the discard step
List namespaces of pool: rbd namespace ls $pool
List images in namespace: rbd ls $pool/$namespace
List images without namespace: rbd ls $pool
Show info: rbd info $pool/$namespace/$image
Show clients (IP, id) who have mapped the RBD: rbd status $pool/$namespace/$image
Show pending deleted images: rbd trash ls [$pool]
Size increase: rbd resize --size 9001T $pool/$img
Size decrease: rbd resize --size 20M $pool/$img --allow-shrink
QoS on RBD pool, restrict to 1000 ops: rbd config pool set $pool rbd_qos_iops_limit 1000
Remove QoS on RBD: rbd config pool rm $pool rbd_qos_iops_limit
QoS on RBD image, restrict to 1000 ops: rbd config image set $pool/$img rbd_qos_iops_limit 1000
Remove QoS on RBD images: rbd config image rm $pool/$img rbd_qos_iops_limit
Show performance of RBD pool images: rbd perf image iotop --pool $pool

Automatic mapping

Automatic RBD mapping with rbdmap.service, configured in /etc/ceph/rbdmap:

$poolname/$namespacename/$imagename name=client.$username,keyring=/etc/ceph/ceph.client.$username.keyring

I really recommend putting LVM on the RBD, because then you can decide to do local caching with lvmcache someday.

In /etc/fstab, note the noauto - we need it because the rbdmap tool and not systemd shall mount the filesystem. When you use ext4, add option stripe=1024 (see above for explaination).

either mount the RBD directly

/dev/rbd/$metadata_pool_name/$namespacename/$imagename /your/$mountpoint $filesystem defaults,noatime,noauto 0 0

or if you used LVM

/dev/yourvg/yourlv /your/$mountpoint $filesystem defaults,noatime,noauto 0 0

These entries (due to the noauto) won't mount at boot. This is good, since the RBD is not mapped at that point anyway.

To actually mount when the RBD was mapped with rbdmap.service, create a mount script (don't forget to mark it executable):

/etc/ceph/rbd.d/$rbd_metadata_pool/$namespacename/$imagename

#!/bin/bash mountpoint -q /your/mount || mount /your/mount

Manual mapping

Instead of rbdmap.service we can directly use sysfs to map:

echo $ceph_monitor_ip name=$username,secret=$secret,queue_depth=512 $pool $image > /sys/bus/rbd/add

LVM on RBD

If you want to create a lvm pv directly on the rbd, you may need to add its device type in /etc/lvm/lvm.conf:

types=["rbd", 255]

If you did not configure the above, pvcreate will complain Device /dev/rbd... excluded by a filter. If the pv is already on the RBD, pvscan just doesn't find anything.

RBD Status

Show connected RBD clients and their IPs

rbd status $pool/$namespace/$image

Cluster Performance

Each OSD should serve 50 to 250 placement groups in total (see with ceph osd df tree). SSDs can take more (100 to 500), and this can also enhance throughput.

Slow OSD commits

a single device can degrade cluster performance significantly. the slowest 15 OSD commit times:

watch -n 1 'ceph osd perf | sort -n -k3 | tail -n 15'