drbd.conf - DRBD Configuration Files
DRBD implements block devices which replicate their data to all nodes of a
cluster. The actual data and associated metadata are usually stored
redundantly on "ordinary" block devices on each cluster node.
Replicated block devices are called
/dev/drbdminor by
default. They are grouped into resources, with one or more devices per
resource. Replication among the devices in a resource takes place in
chronological order. With DRBD, we refer to the devices inside a resource as
volumes.
In DRBD 9, a resource can be replicated between two or more cluster nodes. The
connections between cluster nodes are point-to-point links, and use TCP or a
TCP-like protocol. All nodes must be directly connected.
DRBD consists of low-level user-space components which interact with the kernel
and perform basic operations (
drbdsetup,
drbdmeta), a
high-level user-space component which understands and processes the DRBD
configuration and translates it into basic operations of the low-level
components (
drbdadm), and a kernel component.
The default DRBD configuration consists of
/etc/drbd.conf and of
additional files included from there, usually
global_common.conf and
all
*.res files inside
/etc/drbd.d/. It has turned
out to be useful to define each resource in a separate
*.res file.
The configuration files are designed so that each cluster node can contain an
identical copy of the entire cluster configuration. The host name of each node
determines which parts of the configuration apply (
uname -n). It is
highly recommended to keep the cluster configuration on all nodes in sync by
manually copying it to all nodes, or by automating the process with
csync2 or a similar tool.
global {
usage-count yes;
udev-always-use-vnr;
}
resource r0 {
net {
cram-hmac-alg sha1;
shared-secret "FooFunFactory";
}
volume 0 {
device /dev/drbd1;
disk /dev/sda7;
meta-disk internal;
}
on alice {
node-id 0;
address 10.1.1.31:7000;
}
on bob {
node-id 1;
address 10.1.1.32:7000;
}
connection {
host alice port 7000;
host bob port 7000;
net {
protocol C;
}
}
}
This example defines a resource
r0 which contains a single replicated
device with volume number 0. The resource is replicated among hosts
alice and
bob, which have the IPv4 addresses
10.1.1.31
and
10.1.1.32 and the node identifiers 0 and 1, respectively. On both
hosts, the replicated device is called
/dev/drbd1, and the actual data
and metadata are stored on the lower-level device
/dev/sda7. The
connection between the hosts uses protocol C.
Please refer to the
DRBD User's Guide[1] for more examples.
DRBD configuration files consist of sections, which contain other sections and
parameters depending on the section types. Each section consists of one or
more keywords, sometimes a section name, an opening brace (“{”),
the section's contents, and a closing brace (“}”). Parameters
inside a section consist of a keyword, followed by one or more keywords or
values, and a semicolon (“;”).
Some parameter values have a default scale which applies when a plain number is
specified (for example Kilo, or 1024 times the numeric value). Such default
scales can be overridden by using a suffix (for example,
M for Mega).
The common suffixes
K = 2^10 = 1024,
M = 1024 K, and
G =
1024 M are supported.
Comments start with a hash sign (“#”) and extend to the end of the
line. In addition, any section can be prefixed with the keyword
skip,
which causes the section and any sub-sections to be ignored.
Additional files can be included with the
include
file-pattern statement (see
glob(7) for the
expressions supported in
file-pattern). Include statements are only
allowed outside of sections.
The following sections are defined (indentation indicates in which context):
common
[disk]
[handlers]
[net]
[options]
[startup]
global
resource
connection
path
net
volume
peer-device-options
[peer-device-options]
connection-mesh
net
[disk]
floating
handlers
[net]
on
volume
disk
[disk]
options
stacked-on-top-of
startup
Sections in brackets affect other parts of the configuration: inside the
common section, they apply to all resources. A
disk section
inside a
resource or
on section applies to all volumes of that
resource, and a
net section inside a
resource section applies to
all connections of that resource. This allows to avoid repeating identical
options for each resource, connection, or volume. Options can be overridden in
a more specific
resource,
connection,
on, or
volume section.
peer-device-options are
resync-rate,
c-plan-ahead,
c-delay-target,
c-fill-target,
c-max-rate and
c-min-rate. Due to backward comapatibility they can be specified in any
disk options section as well. They are inherited into all relevant
connections. If they are given on
connection level they are inherited
to all volumes on that connection. A
peer-device-options section is
started with the
disk keyword.
common
This section can contain each a
disk,
handlers,
net,
options, and
startup section. All resources inherit the
parameters in these sections as their default values.
connection [name]
Define a connection between two hosts. This section must contain two
host
parameters or multiple
path sections. The optional
name is used
to refer to the connection in the system log and in other messages. If no name
is specified, the peer's host name is used instead.
path
Define a path between two hosts. This section must contain two
host
parameters.
connection-mesh
Define a connection mesh between multiple hosts. This section must contain a
hosts parameter, which has the host names as arguments. This section is
a shortcut to define many connections which share the same network
options.
disk
Define parameters for a volume. All parameters in this section are
optional.
floating [address-family]
addr:port
Like the
on section, except that instead of the host name a network
address is used to determine if it matches a
floating section.
The
node-id parameter in this section is required. If the
address
parameter is not provided, no connections to peers will be created by default.
The
device,
disk, and
meta-disk parameters must be
defined in, or inherited by, this section.
global
Define some global parameters. All parameters in this section are optional. Only
one
global section is allowed in the configuration.
handlers
Define handlers to be invoked when certain events occur. The kernel passes the
resource name in the first command-line argument and sets the following
environment variables depending on the event's context:
•For events related to a particular
device: the device's minor number in DRBD_MINOR, the device's volume
number in DRBD_VOLUME.
•For events related to a particular
device on a particular peer: the connection endpoints in
DRBD_MY_ADDRESS, DRBD_MY_AF, DRBD_PEER_ADDRESS, and
DRBD_PEER_AF; the device's local minor number in DRBD_MINOR, and
the device's volume number in DRBD_VOLUME.
•For events related to a particular
connection: the connection endpoints in DRBD_MY_ADDRESS,
DRBD_MY_AF, DRBD_PEER_ADDRESS, and DRBD_PEER_AF; and, for
each device defined for that connection: the device's minor number in
DRBD_MINOR_ volume-number.
•For events that identify a device, if
a lower-level device is attached, the lower-level device's device name is
passed in DRBD_BACKING_DEV (or
DRBD_BACKING_DEV_volume-number).
All parameters in this section are optional. Only a single handler can be
defined for each event; if no handler is defined, nothing will happen.
net
Define parameters for a connection. All parameters in this section are
optional.
on host-name [...]
Define the properties of a resource on a particular host or set of hosts.
Specifying more than one host name can make sense in a setup with IP address
failover, for example. The
host-name argument must match the Linux host
name (
uname -n).
Usually contains or inherits at least one
volume section. The
node-id and
address parameters must be defined in this section.
The
device,
disk, and
meta-disk parameters must be
defined in, or inherited by, this section.
A normal configuration file contains two or more
on sections for each
resource. Also see the
floating section.
options
Define parameters for a resource. All parameters in this section are
optional.
resource name
Define a resource. Usually contains at least two
on sections and at least
one
connection section.
stacked-on-top-of resource
Used instead of an
on section for configuring a stacked resource with
three to four nodes.
Starting with DRBD 9, stacking is deprecated. It is advised to use resources
which are replicated among more than two nodes instead.
startup
The parameters in this section determine the behavior of a resource at startup
time.
volume volume-number
Define a volume within a resource. The volume numbers in the various
volume sections of a resource define which devices on which hosts form
a replicated device.
host name [
address [address-family]
address] [
port port-number]
Defines an endpoint for a connection. Each
host statement refers to an
on section in a resource. If a port number is defined, this endpoint
will use the specified port instead of the port defined in the
on
section. Each
connection section must contain exactly two
host
parameters. Instead of two
host parameters the connection may contain
multiple
path sections.
host name [
address [address-family]
address] [
port port-number]
Defines an endpoint for a connection. Each
host statement refers to an
on section in a resource. If a port number is defined, this endpoint
will use the specified port instead of the port defined in the
on
section. Each
path section must contain exactly two
host
parameters.
hosts name...
Defines all nodes of a mesh. Each
name refers to an
on
section in a resource. The port that is defined in the
on section will
be used.
al-extents extents
DRBD automatically maintains a "hot" or "active" disk area
likely to be written to again soon based on the recent write activity. The
"active" disk area can be written to immediately, while
"inactive" disk areas must be "activated" first, which
requires a meta-data write. We also refer to this active disk area as the
"activity log".
The activity log saves meta-data writes, but the whole log must be resynced upon
recovery of a failed node. The size of the activity log is a major factor of
how long a resync will take and how fast a replicated disk will become
consistent after a crash.
The activity log consists of a number of 4-Megabyte segments; the
al-extents parameter determines how many of those segments can be
active at the same time. The default value for
al-extents is 1237, with
a minimum of 7 and a maximum of 65536.
Note that the effective maximum may be smaller, depending on how you created the
device meta data, see also
drbdmeta(8) The effective maximum is 919 *
(available on-disk activity-log ring-buffer area/4kB -1), the default 32kB
ring-buffer effects a maximum of 6433 (covers more than 25 GiB of data) We
recommend to keep this well within the amount your backend storage and
replication link are able to resync inside of about 5 minutes.
al-updates {yes | no}
With this parameter, the activity log can be turned off entirely (see the
al-extents parameter). This will speed up writes because fewer
meta-data writes will be necessary, but the entire device needs to be
resynchronized opon recovery of a failed primary node. The default value for
al-updates is
yes.
disk-barrier,
disk-flushes,
disk-drain
DRBD has three methods of handling the
ordering of dependent write requests:
disk-barrier
Use disk barriers to make sure that requests
are written to disk in the right order. Barriers ensure that all requests
submitted before a barrier make it to the disk before any requests submitted
after the barrier. This is implemented using 'tagged command queuing' on SCSI
devices and 'native command queuing' on SATA devices. Only some devices and
device stacks support this method. The device mapper (LVM) only supports
barriers in some configurations.
Note that on systems which do not support disk barriers, enabling this option
can lead to data loss or corruption. Until DRBD 8.4.1,
disk-barrier was
turned on if the I/O stack below DRBD did support barriers. Kernels since
linux-2.6.36 (or 2.6.32 RHEL6) no longer allow to detect if barriers are
supported. Since drbd-8.4.2, this option is off by default and needs to be
enabled explicitly.
disk-flushes
Use disk flushes between dependent write
requests, also referred to as 'force unit access' by drive vendors. This
forces all data to disk. This option is enabled by default.
disk-drain
Wait for the request queue to
"drain" (that is, wait for the requests to finish) before submitting
a dependent write request. This method requires that requests are stable on
disk when they finish. Before DRBD 8.0.9, this was the only method
implemented. This option is enabled by default. Do not disable in production
environments.
From these three methods, drbd will use the first that is enabled and supported
by the backing storage device. If all three of these options are turned off,
DRBD will submit write requests without bothering about dependencies.
Depending on the I/O stack, write requests can be reordered, and they can be
submitted in a different order on different cluster nodes. This can result in
data loss or corruption. Therefore, turning off all three methods of
controlling write ordering is strongly discouraged.
A general guideline for configuring write ordering is to use disk barriers or
disk flushes when using ordinary disks (or an ordinary disk array) with a
volatile write cache. On storage without cache or with a battery backed write
cache, disk draining can be a reasonable choice.
disk-timeout
If the lower-level device on which a DRBD
device stores its data does not finish an I/O request within the defined
disk-timeout, DRBD treats this as a failure. The lower-level device is
detached, and the device's disk state advances to Diskless. If DRBD is
connected to one or more peers, the failed request is passed on to one of
them.
This option is
dangerous and may lead to kernel panic!
"Aborting" requests, or force-detaching the disk, is intended for
completely blocked/hung local backing devices which do no longer complete
requests at all, not even do error completions. In this situation, usually a
hard-reset and failover is the only way out.
By "aborting", basically faking a local error-completion, we allow for
a more graceful swichover by cleanly migrating services. Still the affected
node has to be rebooted "soon".
By completing these requests, we allow the upper layers to re-use the associated
data pages.
If later the local backing device "recovers", and now DMAs some data
from disk into the original request pages, in the best case it will just put
random data into unused pages; but typically it will corrupt meanwhile
completely unrelated data, causing all sorts of damage.
Which means delayed successful completion, especially for READ requests, is a
reason to panic(). We assume that a delayed *error* completion is OK, though
we still will complain noisily about it.
The default value of
disk-timeout is 0, which stands for an infinite
timeout. Timeouts are specified in units of 0.1 seconds. This option is
available since DRBD 8.3.12.
md-flushes
Enable disk flushes and disk barriers on the
meta-data device. This option is enabled by default. See the
disk-flushes parameter.
on-io-error handler
Configure how DRBD reacts to I/O errors on a lower-level device. The following
policies are defined:
pass_on
Change the disk status to Inconsistent, mark
the failed block as inconsistent in the bitmap, and retry the I/O operation on
a remote cluster node.
call-local-io-error
Call the local-io-error handler (see
the handlers section).
detach
Detach the lower-level device and continue in
diskless mode.
read-balancing policy
Distribute read requests among cluster nodes
as defined by
policy. The supported policies are
prefer-local
(the default),
prefer-remote,
round-robin,
least-pending,
when-congested-remote,
32K-striping,
64K-striping,
128K-striping,
256K-striping,
512K-striping and
1M-striping.
This option is available since DRBD 8.4.1.
resync-after res-name/volume
Define that a device should only resynchronize after the specified other device.
By default, no order between devices is defined, and all devices will
resynchronize in parallel. Depending on the configuration of the lower-level
devices, and the available network and disk bandwidth, this can slow down the
overall resync process. This option can be used to form a chain or tree of
dependencies among devices.
rs-discard-granularity byte
When
rs-discard-granularity is set to a
non zero, positive value then DRBD tries to do a resync operation in requests
of this size. In case such a block contains only zero bytes on the sync source
node, the sync target node will issue a discard/trim/unmap command for the
area.
The value is constrained by the discard granularity of the backing block device.
In case
rs-discard-granularity is not a multiplier of the discard
granularity of the backing block device DRBD rounds it up. The feature only
gets active if the backing block device reads back zeroes after a discard
command.
The default value of is 0. This option is available since 8.4.7.
discard-zeroes-if-aligned {yes | no}
There are several aspects to discard/trim/unmap support on linux block devices.
Even if discard is supported in general, it may fail silently, or may
partially ignore discard requests. Devices also announce whether reading from
unmapped blocks returns defined data (usually zeroes), or undefined data
(possibly old data, possibly garbage).
If on different nodes, DRBD is backed by devices with differing discard
characteristics, discards may lead to data divergence (old data or garbage
left over on one backend, zeroes due to unmapped areas on the other backend).
Online verify would now potentially report tons of spurious differences. While
probably harmless for most use cases (fstrim on a file system), DRBD cannot
have that.
To play safe, we have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true". We also have
to translate discards to explicit zero-out on the receiving side, unless the
receiving side (Secondary) supports "discard_zeroes_data=true",
thereby allocating areas what were supposed to be unmapped.
There are some devices (notably the LVM/DM thin provisioning) that are capable
of discard, but announce discard_zeroes_data=false. In the case of DM-thin,
discards aligned to the chunk size will be unmapped, and reading from unmapped
sectors will return zeroes. However, unaligned partial head or tail areas of
discard requests will be silently ignored.
If we now add a helper to explicitly zero-out these unaligned partial areas,
while passing on the discard of the aligned full chunks, we effectively
achieve discard_zeroes_data=true on such devices.
Setting
discard-zeroes-if-aligned to
yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on backends that
announce discard_zeroes_data=false.
Setting
discard-zeroes-if-aligned to
no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even announce discard
capabilities on the Primary, if the respective backend announces
discard_zeroes_data=false.
We used to ignore the discard_zeroes_data setting completely. To not break
established and expected behaviour, and suddenly cause fstrim on
thin-provisioned LVs to run out-of-space instead of freeing up space, the
default value is
yes.
This option is available since 8.4.7.
Please note that you open the section with the
disk keyword.
c-delay-target delay_target,
c-fill-target fill_target,
c-max-rate max_rate,
c-plan-ahead plan_time
Dynamically control the resync speed. This
mechanism is enabled by setting the
c-plan-ahead parameter to a
positive value. The goal is to either fill the buffers along the data path
with a defined amount of data if
c-fill-target is defined, or to have a
defined delay along the path if
c-delay-target is defined. The maximum
bandwidth is limited by the
c-max-rate parameter.
The
c-plan-ahead parameter defines how fast drbd adapts to changes in the
resync speed. It should be set to five times the network round-trip time or
more. Common values for
c-fill-target for "normal" data paths
range from 4K to 100K. If drbd-proxy is used, it is advised to use
c-delay-target instead of
c-fill-target. The
c-delay-target parameter is used if the
c-fill-target parameter
is undefined or set to 0. The
c-delay-target parameter should be set to
five times the network round-trip time or more. The
c-max-rate option
should be set to either the bandwidth available between the DRBD-hosts and the
machines hosting DRBD-proxy, or to the available disk bandwidth.
The default values of these parameters are:
c-plan-ahead = 20 (in units
of 0.1 seconds),
c-fill-target = 0 (in units of sectors),
c-delay-target = 1 (in units of 0.1 seconds), and
c-max-rate =
102400 (in units of KiB/s).
Dynamic resync speed control is available since DRBD 8.3.9.
c-min-rate min_rate
A node which is primary and sync-source has to
schedule application I/O requests and resync I/O requests. The
c-min-rate parameter limits how much bandwidth is available for resync
I/O; the remaining bandwidth is used for application I/O.
A
c-min-rate value of 0 means that there is no limit on the resync I/O
bandwidth. This can slow down application I/O significantly. Use a value of 1
(1 KiB/s) for the lowest possible resync rate.
The default value of
c-min-rate is 4096, in units of KiB/s.
resync-rate rate
Define how much bandwidth DRBD may use for resynchronizing. DRBD allows
"normal" application I/O even during a resync. If the resync takes
up too much bandwidth, application I/O can become very slow. This parameter
allows to avoid that. Please note this is option only works when the dynamic
resync controller is disabled.
dialog-refresh time
The DRBD init script can be used to configure and start DRBD devices, which can
involve waiting for other cluster nodes. While waiting, the init script shows
the remaining waiting time. The
dialog-refresh defines the number of
seconds between updates of that countdown. The default value is 1; a value of
0 turns off the countdown.
disable-ip-verification
Normally, DRBD verifies that the IP addresses
in the configuration match the host names. Use the
disable-ip-verification parameter to disable these checks.
usage-count {yes | no | ask}
A explained on DRBD's
Online Usage
Counter[2] web page, DRBD includes a mechanism for anonymously counting
how many installations are using which versions of DRBD. The results are
available on the web page for anyone to see.
This parameter defines if a cluster node participates in the usage counter; the
supported values are
yes,
no, and
ask (ask the user, the
default).
We would like to ask users to participate in the online usage counter as this
provides us valuable feedback for steering the development of DRBD.
udev-always-use-vnr
When udev asks drbdadm for a list of device
related symlinks, drbdadm would suggest symlinks with differing naming
conventions, depending on whether the resource has explicit volume VNR { }
definitions, or only one single volume with the implicit volume number 0:
# implicit single volume without "volume 0 {}" block
DEVICE=drbd<minor>
SYMLINK_BY_RES=drbd/by-res/<resource-name>
SYMLINK_BY_DISK=drbd/by-disk/<backing-disk-name>
# explicit volume definition: volume VNR { }
DEVICE=drbd<minor>
SYMLINK_BY_RES=drbd/by-res/<resource-name>/VNR
SYMLINK_BY_DISK=drbd/by-disk/<backing-disk-name>
If you define this parameter in the global section, drbdadm will always add the
.../VNR part, and will not care for whether the volume definition was implicit
or explicit.
For legacy backward compatibility, this is off by default, but we do recommend
to enable it.
after-resync-target cmd
Called on a resync target when a node state changes from
Inconsistent to
Consistent when a resync finishes. This handler can be used for
removing the snapshot created in the
before-resync-target
handler.
before-resync-target cmd
Called on a resync target before a resync begins. This handler can be used for
creating a snapshot of the lower-level device for the duration of the resync:
if the resync source becomes unavailable during a resync, reverting to the
snapshot can restore a consistent state.
before-resync-source cmd
Called on a resync source before a resync begins.
out-of-sync cmd
Called on all nodes after a
verify finishes and out-of-sync blocks were
found. This handler is mainly used for monitoring purposes. An example would
be to call a script that sends an alert SMS.
quorum-lost cmd
Called on a Primary that lost quorum. This handler is usually used to reboot the
node if it is not possible to restart the application that uses the storage on
top of DRBD.
fence-peer cmd
Called when a node should fence a resource on a particular peer. The handler
should not use the same communication path that DRBD uses for talking to the
peer.
unfence-peer cmd
Called when a node should remove fencing constraints from other nodes.
initial-split-brain cmd
Called when DRBD connects to a peer and detects that the peer is in a
split-brain state with the local node. This handler is also called for
split-brain scenarios which will be resolved automatically.
local-io-error cmd
Called when an I/O error occurs on a lower-level device.
pri-lost cmd
The local node is currently primary, but DRBD believes that it should become a
sync target. The node should give up its primary role.
pri-lost-after-sb cmd
The local node is currently primary, but it has lost the after-split-brain auto
recovery procedure. The node should be abandoned.
pri-on-incon-degr cmd
The local node is primary, and neither the local lower-level device nor a
lower-level device on a peer is up to date. (The primary has no device to read
from or to write to.)
split-brain cmd
DRBD has detected a split-brain situation which could not be resolved
automatically. Manual recovery is necessary. This handler can be used to call
for administrator attention.
after-sb-0pri policy
Define how to react if a split-brain scenario
is detected and none of the two nodes is in primary role. (We detect
split-brain scenarios when two nodes connect; split-brain decisions are always
between two nodes.) The defined policies are:
disconnect
No automatic resynchronization; simply
disconnect.
discard-younger-primary,
discard-older-primary
Resynchronize from the node which became
primary first ( discard-younger-primary) or last
(discard-older-primary). If both nodes became primary independently,
the discard-least-changes policy is used.
discard-zero-changes
If only one of the nodes wrote data since the
split brain situation was detected, resynchronize from this node to the other.
If both nodes wrote data, disconnect.
discard-least-changes
Resynchronize from the node with more modified
blocks.
discard-node-nodename
Always resynchronize to the named node.
after-sb-1pri policy
Define how to react if a split-brain scenario
is detected, with one node in primary role and one node in secondary role. (We
detect split-brain scenarios when two nodes connect, so split-brain decisions
are always among two nodes.) The defined policies are:
disconnect
No automatic resynchronization, simply
disconnect.
consensus
Discard the data on the secondary node if the
after-sb-0pri algorithm would also discard the data on the secondary
node. Otherwise, disconnect.
violently-as0p
Always take the decision of the
after-sb-0pri algorithm, even if it causes an erratic change of the
primary's view of the data. This is only useful if a single-node file system
(i.e., not OCFS2 or GFS) with the allow-two-primaries flag is used.
This option can cause the primary node to crash, and should not be used.
discard-secondary
Discard the data on the secondary node.
call-pri-lost-after-sb
Always take the decision of the
after-sb-0pri algorithm. If the decision is to discard the data on the
primary node, call the pri-lost-after-sb handler on the primary
node.
after-sb-2pri policy
Define how to react if a split-brain scenario
is detected and both nodes are in primary role. (We detect split-brain
scenarios when two nodes connect, so split-brain decisions are always among
two nodes.) The defined policies are:
disconnect
No automatic resynchronization, simply
disconnect.
violently-as0p
See the violently-as0p policy for
after-sb-1pri.
call-pri-lost-after-sb
Call the pri-lost-after-sb helper
program on one of the machines unless that machine can demote to secondary.
The helper program is expected to reboot the machine, which brings the node
into a secondary role. Which machine runs the helper program is determined by
the after-sb-0pri strategy.
allow-two-primaries
The most common way to configure DRBD devices is to allow only one node to be
primary (and thus writable) at a time.
In some scenarios it is preferable to allow two nodes to be primary at once; a
mechanism outside of DRBD then must make sure that writes to the shared,
replicated device happen in a coordinated way. This can be done with a
shared-storage cluster file system like OCFS2 and GFS, or with virtual machine
images and a virtual machine manager that can migrate virtual machines between
physical machines.
The
allow-two-primaries parameter tells DRBD to allow two nodes to be
primary at the same time. Never enable this option when using a
non-distributed file system; otherwise, data corruption and node crashes will
result!
always-asbp
Normally the automatic after-split-brain
policies are only used if current states of the UUIDs do not indicate the
presence of a third node.
With this option you request that the automatic after-split-brain policies are
used as long as the data sets of the nodes are somehow related. This might
cause a full sync, if the UUIDs indicate the presence of a third node. (Or
double faults led to strange UUID sets.)
connect-int time
As soon as a connection between two nodes is configured with
drbdsetup
connect, DRBD immediately tries to establish the connection. If this
fails, DRBD waits for
connect-int seconds and then repeats. The default
value of
connect-int is 10 seconds.
cram-hmac-alg hash-algorithm
Configure the hash-based message authentication code (HMAC) or secure hash
algorithm to use for peer authentication. The kernel supports a number of
different algorithms, some of which may be loadable as kernel modules. See the
shash algorithms listed in /proc/crypto. By default,
cram-hmac-alg is
unset. Peer authentication also requires a
shared-secret to be
configured.
csums-alg hash-algorithm
Normally, when two nodes resynchronize, the sync target requests a piece of
out-of-sync data from the sync source, and the sync source sends the data.
With many usage patterns, a significant number of those blocks will actually
be identical.
When a
csums-alg algorithm is specified, when requesting a piece of
out-of-sync data, the sync target also sends along a hash of the data it
currently has. The sync source compares this hash with its own version of the
data. It sends the sync target the new data if the hashes differ, and tells it
that the data are the same otherwise. This reduces the network bandwidth
required, at the cost of higher cpu utilization and possibly increased I/O on
the sync target.
The
csums-alg can be set to one of the secure hash algorithms supported
by the kernel; see the shash algorithms listed in /proc/crypto. By default,
csums-alg is unset.
csums-after-crash-only
Enabling this option (and csums-alg, above) makes it possible to use the
checksum based resync only for the first resync after primary crash, but not
for later "network hickups".
In most cases, block that are marked as need-to-be-resynced are in fact changed,
so calculating checksums, and both reading and writing the blocks on the
resync target is all effective overhead.
The advantage of checksum based resync is mostly after primary crash recovery,
where the recovery marked larger areas (those covered by the activity log) as
need-to-be-resynced, just in case. Introduced in 8.4.5.
data-integrity-alg alg
DRBD normally relies on the data integrity
checks built into the TCP/IP protocol, but if a data integrity algorithm is
configured, it will additionally use this algorithm to make sure that the data
received over the network match what the sender has sent. If a data integrity
error is detected, DRBD will close the network connection and reconnect, which
will trigger a resync.
The
data-integrity-alg can be set to one of the secure hash algorithms
supported by the kernel; see the shash algorithms listed in /proc/crypto. By
default, this mechanism is turned off.
Because of the CPU overhead involved, we recommend not to use this option in
production environments. Also see the notes on data integrity below.
fencing fencing_policy
Fencing is a preventive measure to avoid situations where both nodes are
primary and disconnected. This is also known as a split-brain situation. DRBD
supports the following fencing policies:
dont-care
No fencing actions are taken. This is the
default policy.
resource-only
If a node becomes a disconnected primary, it
tries to fence the peer. This is done by calling the fence-peer
handler. The handler is supposed to reach the peer over an alternative
communication path and call ' drbdadm outdate minor' there.
resource-and-stonith
If a node becomes a disconnected primary, it
freezes all its IO operations and calls its fence-peer handler. The fence-peer
handler is supposed to reach the peer over an alternative communication path
and call ' drbdadm outdate minor' there. In case it cannot do that, it
should stonith the peer. IO is resumed as soon as the situation is resolved.
In case the fence-peer handler fails, I/O can be resumed manually with '
drbdadm resume-io'.
ko-count number
If a secondary node fails to complete a write request in
ko-count times
the
timeout parameter, it is excluded from the cluster. The primary
node then sets the connection to this secondary node to Standalone. To disable
this feature, you should explicitly set it to 0; defaults may change between
versions.
max-buffers number
Limits the memory usage per DRBD minor device on the receiving side, or for
internal buffers during resync or online-verify. Unit is PAGE_SIZE, which is 4
KiB on most systems. The minimum possible setting is hard coded to 32 (=128
KiB). These buffers are used to hold data blocks while they are written
to/read from disk. To avoid possible distributed deadlocks on congestion, this
setting is used as a throttle threshold rather than a hard limit. Once more
than max-buffers pages are in use, further allocation from this pool is
throttled. You want to increase max-buffers if you cannot saturate the IO
backend on the receiving side.
max-epoch-size number
Define the maximum number of write requests DRBD may issue before issuing a
write barrier. The default value is 2048, with a minimum of 1 and a maximum of
20000. Setting this parameter to a value below 10 is likely to decrease
performance.
on-congestion policy,
congestion-fill threshold,
congestion-extents threshold
By default, DRBD blocks when the TCP send
queue is full. This prevents applications from generating further write
requests until more buffer space becomes available again.
When DRBD is used together with DRBD-proxy, it can be better to use the
pull-ahead on-congestion policy, which can switch DRBD into
ahead/behind mode before the send queue is full. DRBD then records the
differences between itself and the peer in its bitmap, but it no longer
replicates them to the peer. When enough buffer space becomes available again,
the node resynchronizes with the peer and switches back to normal replication.
This has the advantage of not blocking application I/O even when the queues fill
up, and the disadvantage that peer nodes can fall behind much further. Also,
while resynchronizing, peer nodes will become inconsistent.
The available congestion policies are
block (the default) and
pull-ahead. The
congestion-fill parameter defines how much data
is allowed to be "in flight" in this connection. The default value
is 0, which disables this mechanism of congestion control, with a maximum of
10 GiBytes. The
congestion-extents parameter defines how many bitmap
extents may be active before switching into ahead/behind mode, with the same
default and limits as the
al-extents parameter. The
congestion-extents parameter is effective only when set to a value
smaller than
al-extents.
Ahead/behind mode is available since DRBD 8.3.10.
ping-int interval
When the TCP/IP connection to a peer is idle for more than
ping-int
seconds, DRBD will send a keep-alive packet to make sure that a failed peer or
network connection is detected reasonably soon. The default value is 10
seconds, with a minimum of 1 and a maximum of 120 seconds. The unit is
seconds.
ping-timeout timeout
Define the timeout for replies to keep-alive packets. If the peer does not reply
within
ping-timeout, DRBD will close and try to reestablish the
connection. The default value is 0.5 seconds, with a minimum of 0.1 seconds
and a maximum of 3 seconds. The unit is tenths of a second.
socket-check-timeout timeout
In setups involving a DRBD-proxy and
connections that experience a lot of buffer-bloat it might be necessary to set
ping-timeout to an unusual high value. By default DRBD uses the same
value to wait if a newly established TCP-connection is stable. Since the
DRBD-proxy is usually located in the same data center such a long wait time
may hinder DRBD's connect process.
In such setups
socket-check-timeout should be set to at least to the
round trip time between DRBD and DRBD-proxy. I.e. in most cases to 1.
The default unit is tenths of a second, the default value is 0 (which causes
DRBD to use the value of
ping-timeout instead). Introduced in
8.4.5.
protocol name
Use the specified protocol on this connection.
The supported protocols are:
A
Writes to the DRBD device complete as soon as
they have reached the local disk and the TCP/IP send buffer.
B
Writes to the DRBD device complete as soon as
they have reached the local disk, and all peers have acknowledged the receipt
of the write requests.
C
Writes to the DRBD device complete as soon as
they have reached the local and all remote disks.
rcvbuf-size size
Configure the size of the TCP/IP receive buffer. A value of 0 (the default)
causes the buffer size to adjust dynamically. This parameter usually does not
need to be set, but it can be set to a value up to 10 MiB. The default unit is
bytes.
rr-conflict policy
This option helps to solve the cases when the
outcome of the resync decision is incompatible with the current role
assignment in the cluster. The defined policies are:
disconnect
No automatic resynchronization, simply
disconnect.
violently
Resync to the primary node is allowed,
violating the assumption that data on a block device are stable for one of the
nodes. Do not use this option, it is dangerous.
call-pri-lost
Call the pri-lost handler on one of the
machines. The handler is expected to reboot the machine, which puts it into
secondary role.
shared-secret secret
Configure the shared secret used for peer authentication. The secret is a string
of up to 64 characters. Peer authentication also requires the
cram-hmac-alg parameter to be set.
sndbuf-size size
Configure the size of the TCP/IP send buffer. Since DRBD 8.0.13 / 8.2.7, a value
of 0 (the default) causes the buffer size to adjust dynamically. Values below
32 KiB are harmful to the throughput on this connection. Large buffer sizes
can be useful especially when protocol A is used over high-latency networks;
the maximum value supported is 10 MiB.
tcp-cork
By default, DRBD uses the TCP_CORK socket
option to prevent the kernel from sending partial messages; this results in
fewer and bigger packets on the network. Some network stacks can perform worse
with this optimization. On these, the tcp-cork parameter can be used to
turn this optimization off.
timeout time
Define the timeout for replies over the network: if a peer node does not send an
expected reply within the specified
timeout, it is considered dead and
the TCP/IP connection is closed. The timeout value must be lower than
connect-int and lower than
ping-int. The default is 6 seconds;
the value is specified in tenths of a second.
use-rle
Each replicated device on a cluster node has a separate bitmap for each of its
peer devices. The bitmaps are used for tracking the differences between the
local and peer device: depending on the cluster state, a disk range can be
marked as different from the peer in the device's bitmap, in the peer device's
bitmap, or in both bitmaps. When two cluster nodes connect, they exchange each
other's bitmaps, and they each compute the union of the local and peer bitmap
to determine the overall differences.
Bitmaps of very large devices are also relatively large, but they usually
compress very well using run-length encoding. This can save time and bandwidth
for the bitmap transfers.
The
use-rle parameter determines if run-length encoding should be used.
It is on by default since DRBD 8.4.0.
verify-alg hash-algorithm
Online verification (
drbdadm verify)
computes and compares checksums of disk blocks (i.e., hash values) in order to
detect if they differ. The
verify-alg parameter determines which
algorithm to use for these checksums. It must be set to one of the secure hash
algorithms supported by the kernel before online verify can be used; see the
shash algorithms listed in /proc/crypto.
We recommend to schedule online verifications regularly during low-load periods,
for example once a month. Also see the notes on data integrity below.
address [address-family]
address: port
Defines the address family, address, and port of a connection endpoint.
The address families
ipv4,
ipv6,
ssocks (Dolphin
Interconnect Solutions' "super sockets"),
sdp (Infiniband
Sockets Direct Protocol), and
sci are supported (
sci is an
alias for
ssocks). If no address family is specified,
ipv4 is
assumed. For all address families except
ipv6, the address is specified
in IPV4 address notation (for example, 1.2.3.4). For
ipv6, the address
is enclosed in brackets and uses IPv6 address notation (for example,
[fd01:2345:6789:abcd::1]). The port is always specified as a decimal number
from 1 to 65535.
On each host, the port numbers must be unique for each address; ports cannot be
shared.
node-id value
Defines the unique node identifier for a node in the cluster. Node identifiers
are used to identify individual nodes in the network protocol, and to assign
bitmap slots to nodes in the metadata.
Node identifiers can only be reasssigned in a cluster when the cluster is down.
It is essential that the node identifiers in the configuration and in the
device metadata are changed consistently on all hosts. To change the metadata,
dump the current state with
drbdmeta dump-md, adjust the bitmap slot
assignment, and update the metadata with
drbdmeta restore-md.
The
node-id parameter exists since DRBD 9. Its value ranges from 0 to 16;
there is no default.
auto-promote bool-value
A resource must be promoted to primary role
before any of its devices can be mounted or opened for writing.
Before DRBD 9, this could only be done explicitly ("drbdadm primary").
Since DRBD 9, the
auto-promote parameter allows to automatically
promote a resource to primary role when one of its devices is mounted or
opened for writing. As soon as all devices are unmounted or closed with no
more remaining users, the role of the resource changes back to secondary.
Automatic promotion only succeeds if the cluster state allows it (that is, if an
explicit
drbdadm primary command would succeed). Otherwise, mounting or
opening the device fails as it already did before DRBD 9: the
mount(2)
system call fails with errno set to EROFS (Read-only file system); the
open(2) system call fails with errno set to EMEDIUMTYPE (wrong medium
type).
Irrespective of the
auto-promote parameter, if a device is promoted
explicitly (
drbdadm primary), it also needs to be demoted explicitly
(
drbdadm secondary).
The
auto-promote parameter is available since DRBD 9.0.0, and defaults to
yes.
cpu-mask cpu-mask
Set the cpu affinity mask for DRBD kernel threads. The cpu mask is specified as
a hexadecimal number. The default value is 0, which lets the scheduler decide
which kernel threads run on which CPUs. CPU numbers in
cpu-mask which
do not exist in the system are ignored.
on-no-data-accessible policy
Determine how to deal with I/O requests when
the requested data is not available locally or remotely (for example, when all
disks have failed). The defined policies are:
io-error
System calls fail with errno set to EIO.
suspend-io
The resource suspends I/O. I/O can be resumed
by (re)attaching the lower-level device, by connecting to a peer which has
access to the data, or by forcing DRBD to resume I/O with drbdadm resume-io
res. When no data is available, forcing I/O to resume will
result in the same behavior as the io-error policy.
This setting is available since DRBD 8.3.9; the default policy is
io-error.
peer-ack-window value
On each node and for each device, DRBD maintains a bitmap of the differences
between the local and remote data for each peer device. For example, in a
three-node setup (nodes A, B, C) each with a single device, every node
maintains one bitmap for each of its peers.
When nodes receive write requests, they know how to update the bitmaps for the
writing node, but not how to update the bitmaps between themselves. In this
example, when a write request propagates from node A to B and C, nodes B and C
know that they have the same data as node A, but not whether or not they both
have the same data.
As a remedy, the writing node occasionally sends peer-ack packets to its peers
which tell them which state they are in relative to each other.
The
peer-ack-window parameter specifies how much data a primary node may
send before sending a peer-ack packet. A low value causes increased network
traffic; a high value causes less network traffic but higher memory
consumption on secondary nodes and higher resync times between the secondary
nodes after primary node failures. (Note: peer-ack packets may be sent due to
other reasons as well, e.g. membership changes or expiry of the
peer-ack-delay timer.)
The default value for
peer-ack-window is 2 MiB, the default unit is
sectors. This option is available since 9.0.0.
peer-ack-delay expiry-time
If after the last finished write request no new write request gets issued for
expiry-time, then a peer-ack packet is sent. If a new write request is
issued before the timer expires, the timer gets reset to
expiry-time.
(Note: peer-ack packets may be sent due to other reasons as well, e.g.
membership changes or the
peer-ack-window option.)
This parameter may influence resync behavior on remote nodes. Peer nodes need to
wait until they receive an peer-ack for releasing a lock on an AL-extent.
Resync operations between peers may need to wait for for these locks.
The default value for
peer-ack-delay is 100 milliseconds, the default
unit is milliseconds. This option is available since 9.0.0.
quorum value
When activated, a cluster partition requires quorum in order to modify the
replicated data set. That means a node in the cluster partition can only be
promoted to primary if the cluster partition has quorum. Every node with a
disk directly connected to the node that should be promoted counts. If a
primary node should execute a write request, but the cluster partition has
lost quorum, it will freeze IO or reject the write request with an error
(depending on the
on-no-quorum setting). Upon loosing quorum a primary
always invokes the
quorum-lost handler. The handler is intended for
notification purposes, its return code is ignored.
The option's value might be set to
off,
majority,
all or a
numeric value. If you set it to a numeric value, make sure that the value is
greater than half of your number of nodes. Quorum is a mechanism to avoid data
divergence, it might be used instead of fencing when there are more than two
repicas. It defaults to
off
If all missing nodes are marked as outdated, a partition always has quorum, no
matter how small it is. I.e. If you disconnect all secondary nodes gracefully
a single primary continues to operate. In the moment a single secondary is
lost, it has to be assumed that it forms a partition with all the missing
outdated nodes. In case my partition might be smaller than the other, quorum
is lost in this moment.
In case you want to allow permanently diskless nodes to gain quorum it is
recommendet to not use
majority or
all. It is recommended to
specify an absolute number, since DBRD's heuristic to determine the complete
number of diskfull nodes in the cluster is unreliable.
The quorum implementation is available starting with the DRBD kernel driver
version 9.0.7.
quorum-minimum-redundancy value
This option sets the minimal required number of nodes with an UpToDate disk to
allow the partition to gain quorum. This is a different requirement than the
plain
quorum option expresses.
The option's value might be set to
off,
majority,
all or a
numeric value. If you set it to a numeric value, make sure that the value is
greater than half of your number of nodes.
In case you want to allow permanently diskless nodes to gain quorum it is
recommendet to not use
majority or
all. It is recommended to
specify an absolute number, since DBRD's heuristic to determine the complete
number of diskfull nodes in the cluster is unreliable.
This option is available starting with the DRBD kernel driver version
9.0.10.
on-no-quorum {io-error | suspend-io}
By default DRBD freezes IO on a device, that lost quorum. By setting the
on-no-quorum to
io-error it completes all IO operations with an
error if quorum ist lost.
The
on-no-quorum options is available starting with the DRBD kernel
driver version 9.0.8.
The parameters in this section define the behavior of DRBD at system startup
time, in the DRBD init script. They have no effect once the system is up and
running.
degr-wfc-timeout timeout
Define how long to wait until all peers are connected in case the cluster
consisted of a single node only when the system went down. This parameter is
usually set to a value smaller than
wfc-timeout. The assumption here is
that peers which were unreachable before a reboot are less likely to be
reachable after the reboot, so waiting is less likely to help.
The timeout is specified in seconds. The default value is 0, which stands for an
infinite timeout. Also see the
wfc-timeout parameter.
outdated-wfc-timeout timeout
Define how long to wait until all peers are connected if all peers were outdated
when the system went down. This parameter is usually set to a value smaller
than
wfc-timeout. The assumption here is that an outdated peer cannot
have become primary in the meantime, so we don't need to wait for it as long
as for a node which was alive before.
The timeout is specified in seconds. The default value is 0, which stands for an
infinite timeout. Also see the
wfc-timeout parameter.
stacked-timeouts
On stacked devices, the wfc-timeout and
degr-wfc-timeout parameters in the configuration are usually ignored,
and both timeouts are set to twice the connect-int timeout. The
stacked-timeouts parameter tells DRBD to use the wfc-timeout and
degr-wfc-timeout parameters as defined in the configuration, even on
stacked devices. Only use this parameter if the peer of the stacked resource
is usually not available, or will not become primary. Incorrect use of this
parameter can lead to unexpected split-brain scenarios.
wait-after-sb
This parameter causes DRBD to continue waiting
in the init script even when a split-brain situation has been detected, and
the nodes therefore refuse to connect to each other.
wfc-timeout timeout
Define how long the init script waits until all peers are connected. This can be
useful in combination with a cluster manager which cannot manage DRBD
resources: when the cluster manager starts, the DRBD resources will already be
up and running. With a more capable cluster manager such as Pacemaker, it
makes more sense to let the cluster manager control DRBD resources. The
timeout is specified in seconds. The default value is 0, which stands for an
infinite timeout. Also see the
degr-wfc-timeout parameter.
device /dev/drbdminor-number
Define the device name and minor number of a replicated block device. This is
the device that applications are supposed to access; in most cases, the device
is not used directly, but as a file system. This parameter is required and the
standard device naming convention is assumed.
In addition to this device, udev will create
/dev/drbd/by-res/resource /volume
and
/dev/drbd/by-disk/lower-level-device symlinks to the
device.
disk {[disk] |
none}
Define the lower-level block device that DRBD will use for storing the actual
data. While the replicated drbd device is configured, the lower-level device
must not be used directly. Even read-only access with tools like
dumpe2fs(8) and similar is not allowed. The keyword
none
specifies that no lower-level block device is configured; this also overrides
inheritance of the lower-level device.
meta-disk internal,
meta-disk device,
meta-disk device [index]
Define where the metadata of a replicated block device resides: it can be
internal, meaning that the lower-level device contains both the data
and the metadata, or on a separate device.
When the
index form of this parameter is used, multiple replicated
devices can share the same metadata device, each using a separate index. Each
index occupies 128 MiB of data, which corresponds to a replicated device size
of at most 4 TiB with two cluster nodes. We recommend not to share metadata
devices anymore, and to instead use the lvm volume manager for creating
metadata devices as needed.
When the
index form of this parameter is not used, the size of the
lower-level device determines the size of the metadata. The size needed is 36
KiB + (size of lower-level device) / 32K * (number of nodes - 1). If the
metadata device is bigger than that, the extra space is not used.
This parameter is required if a
disk other than
none is specified,
and ignored if
disk is set to
none. A
meta-disk parameter
without a
disk parameter is not allowed.
DRBD supports two different mechanisms for data integrity checking: first, the
data-integrity-alg network parameter allows to add a checksum to the
data sent over the network. Second, the online verification mechanism (
drbdadm verify and the
verify-alg parameter) allows to check for
differences in the on-disk data.
Both mechanisms can produce false positives if the data is modified during I/O
(i.e., while it is being sent over the network or written to disk). This does
not always indicate a problem: for example, some file systems and applications
do modify data under I/O for certain operations. Swap space can also undergo
changes while under I/O.
Network data integrity checking tries to identify data modification during I/O
by verifying the checksums on the sender side after sending the data. If it
detects a mismatch, it logs an error. The receiver also logs an error when it
detects a mismatch. Thus, an error logged only on the receiver side indicates
an error on the network, and an error logged on both sides indicates data
modification under I/O.
The most recent example of systematic data corruption was identified as a bug in
the TCP offloading engine and driver of a certain type of GBit NIC in 2007:
the data corruption happened on the DMA transfer from core memory to the card.
Because the TCP checksum were calculated on the card, the TCP/IP protocol
checksums did not reveal this problem.
This document was revised for version 9.0.0 of the DRBD distribution.
Written by Philipp Reisner <philipp.reisner@linbit.com> and Lars Ellenberg
<lars.ellenberg@linbit.com>.
Report bugs to <drbd-user@lists.linbit.com>.
Copyright 2001-2018 LINBIT Information Technologies, Philipp Reisner, Lars
Ellenberg. This is free software; see the source for copying conditions. There
is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
drbd(8),
drbdsetup(8),
drbdadm(8),
DRBD User's
Guide[1],
DRBD Web Site[3]
- 1.
- DRBD User's Guide
http://www.drbd.org/users-guide/
- 2.
-
Online Usage Counter
http://usage.drbd.org
- 3.
- DRBD Web Site
http://www.drbd.org/