stacd.conf — stacd(8) configuration file
/etc/stas/stacd.conf
stacd.conf
is a plain text file divided into
sections, with configuration entries in the style
key
=value
.
Spaces immediately before or after the "=
" are
ignored. Empty lines are ignored as well as lines starting with
"#
", which may be used for commenting.
The following options are available in the
"[Global]
" section:
tron=
Trace ON. Takes a boolean argument. If true
,
enables full code tracing. The trace will be displayed in
the system log such as systemd's journal. Defaults to
false
.
hdr-digest=
Enable Protocol Data Unit (PDU) Header Digest. Takes a
boolean argument. NVMe/TCP facilitates an optional PDU
Header digest. Digests are calculated using the CRC32C
algorithm. If true
, Header Digests
are inserted in PDUs and checked for errors. Defaults to
false
.
data-digest=
Enable Protocol Data Unit (PDU) Data Digest. Takes a
boolean argument. NVMe/TCP facilitates an optional PDU
Data digest. Digests are calculated using the CRC32C
algorithm. If true
, Data Digests
are inserted in PDUs and checked for errors. Defaults to
false
.
kato=
Keep Alive Timeout (KATO) in seconds. Takes an unsigned integer. This field specifies the timeout value for the Keep Alive feature in seconds. Defaults to 30 seconds for Discovery Controller connections and 120 seconds for I/O Controller connections.
ip-family=
Takes a string argument. With this you can specify
whether IPv4, IPv6, or both are supported when
connecting to a Controller. Connections will not be
attempted to IP addresses (whether discovered or
manually configured with controller=
)
disabled by this option. If an invalid value
is entered, then the default (see below) will apply.
Choices are ipv4
, ipv6
, or ipv4+ipv6
.
Defaults to ipv4+ipv6
.
nr-io-queues=
Takes a value in the range 1...N. Overrides the default number of I/O queues create by the driver.
Note: This parameter is identical to that provided by nvme-cli.
Default: Depends on kernel and other run time factors (e.g. number of CPUs).
nr-write-queues=
Takes a value in the range 1...N. Adds additional queues that will be used for write I/O.
Note: This parameter is identical to that provided by nvme-cli.
Default: Depends on kernel and other run time factors (e.g. number of CPUs).
nr-poll-queues=
Takes a value in the range 1...N. Adds additional queues that will be used for polling latency sensitive I/O.
Note: This parameter is identical to that provided by nvme-cli.
Default: Depends on kernel and other run time factors (e.g. number of CPUs).
queue-size=
Takes a value in the range 16...1024.
Overrides the default number of elements in the I/O queues created by the driver. This option will be ignored for discovery, but will be passed on to the subsequent connect call.
Note: This parameter is identical to that provided by nvme-cli.
Defaults to 128
.
reconnect-delay=
Takes a value in the range 1 to N seconds.
Overrides the default delay before reconnect is attempted after a connect loss.
Note: This parameter is identical to that provided by nvme-cli.
Defaults to 10
. Retry to connect every 10 seconds.
ctrl-loss-tmo=
Takes a value in the range -1, 0, ..., N seconds. -1 means retry forever. 0 means do not retry.
Overrides the default controller loss timeout period (in seconds).
Note: This parameter is identical to that provided by nvme-cli.
Defaults to 600
seconds (10 minutes).
duplicate-connect=
Takes a boolean argument. Allows duplicated connections between same transport host and subsystem port.
Note: This parameter is identical to that provided by nvme-cli.
Defaults to false
.
disable-sqflow=
Takes a boolean argument. Disables SQ flow control to omit head doorbell update for submission queues when sending nvme completions.
Note: This parameter is identical to that provided by nvme-cli.
Defaults to false
.
ignore-iface=
Takes a boolean argument. This option controls how connections with I/O Controllers (IOC) are made.
There is no guarantee that there will be a route to reach that IOC. However, we can use the socket option SO_BINDTODEVICE to force the connection to be made on a specific interface instead of letting the routing tables decide where to make the connection.
This option determines whether stacd
will use
SO_BINDTODEVICE to force connections on an interface
or just rely on the routing tables. The default is
to use SO_BINDTODEVICE, in other words, stacd
does
not ignore the interface.
BACKGROUND:
By default, stacd
will connect to IOCs on the same
interface that was used to retrieve the discovery
log pages. If stafd discovers a DC on an interface
using mDNS, and stafd connects to that DC and
retrieves the log pages, it is expected that the
storage subsystems listed in the log pages are
reachable on the same interface where the DC was
discovered.
For example, let's say a DC is discovered on interface ens102. Then all the subsystems listed in the log pages retrieved from that DC must be reachable on interface ens102. If this doesn't work, for example you cannot "ping -I ens102 [storage-ip]", then the most likely explanation is that proxy arp is not enabled on the switch that the host is connected to on interface ens102. Whatever you do, resist the temptation to manually set up the routing tables or to add alternate routes going over a different interface than the one where the DC is located. That simply won't work. Make sure proxy arp is enabled on the switch first.
Setting routes won't work because, by default, stacd
uses the SO_BINDTODEVICE socket option when it
connects to IOCs. This option is used to force a
socket connection to be made on a specific interface
instead of letting the routing tables decide where
to connect the socket. Even if you were to manually
configure an alternate route on a different interface,
the connections (i.e. host to IOC) will still be
made on the interface where the DC was discovered by
stafd.
Defaults to false
.
udev-rule=
Takes a string argument enabled
or
disabled
. This option determines
whether nvme-cli
's udev rules for TCP connections
will be executed or ignored.
A set of udev rules get installed with nvme-cli
that tells the udev daemon (udevd
) to look
for Asynchronous Event Notifications (AEN) indicating
a change of Discovery Log Page Entries (DPLE). These
udev rules are typically installed as:
/usr/lib/udev/rules.d/70-nvmf-autoconnect.rules
When an AEN is detected, udevd
instructs
systemd
to start a service that invokes
nvme-cli
's connect-all
command. This command retrieves the DLPEs from the
Discovery Controller (DC) that sent the AEN and
connects to all the I/O Controllers (IOC) listed
in the DPLEs.
In parallel, stafd
and stacd
react to the AEN in the same way. This results in a
race condition between udevd
and
nvme-stas
. nvme-stas
is written in Python and runs slower than nvme-cli
written in C. In other words, nvme-stas
usually loses the race.
This can be a problem for TCP connections because
nvme-cli
traditionally doesn't specify
the interface (host-iface
) when making
TCP connections and leaves it to the kernel (and the
routing table) to select the best interface.
nvme-stas
, on the other hand, always
tries to make connections on a specific interface
(per configuration). Note that a fix was added to
nvme-cli
so that TCP connections to IOCs
will now be made with host-iface
specified.
That, however, will only be available in post-2.1.2
versions of nvme-cli
.
To add insult to injury, when a connection is made
without specifying the host-iface
, and
therefore the kernel decides which interface to use,
there is no way to tell from user space (i.e. by
nvme-stas
) which interface the kernel
actually used. A fix was made to the kernel to make
TCP connection's interface available to user space
applications, but that will only be available in
Linux 6.1 (or later).
Being able to identify the interface (host-iface
)
is important to nvme-stas
. That's because it
uses a Transport Identifier (TID) containing all the
parameters (including the host-iface
)
needed to make connections (see table below).
The parameters that compose the TID can be retrieved
from the sysfs
under /sys/class/nvme/
.
Table 1. Transport Identifier
trtype | Transport type (tcp, rdma, fc, loop) |
traddr | Transport address (e.g. IP address) |
trsvcid | Transport service ID (e.g. IP port) |
subnqn | Subsystem NQN |
host-traddr | Host transport address (e.g. source IP address) |
host-iface | Host interface (e.g. eth1) |
When nvme-stas
makes a connection, it
first looks for an existing connection that matches
the TID (including a matching host-iface
).
Since connections made by nvme-cli
lack
the host-iface
, nvme-stas
does not find a match. Therefore, nvme-stas
will try to make a new connection, which will often
be refused by the kernel because a connection already
exists.
Suffice it to say that issues may arise when both
nvme-stas
and nvme-cli
operate in parallel. These issues may vary depending
on your version of Linux, nvme-cli
,
nvme-stas
, and/or libnvme
.
These issues will often result in messages printed
by the kernel to the syslog
. A typical
error message from the kernel may look something like these:
"[...] failed to connect controller, error 1006
".
"[...] failed to connect socket: -111
".
"[...] failed to write to nvme-fabrics device
".
"[...] Failed to write to /dev/nvme-fabrics: Connection refused
".
The udev-rule
option allows a user to
disable nvme-cli
's udev rule for TCP
connections. Only TCP connections rely on the
host-iface
parameter, and therefore the
udev rule need only be disabled for this type of transport.
Defaults to disabled
.
Connectivity between hosts and subsystems in a fabric is controlled by Fabric Zoning. Entities that share a common zone (i.e., are zoned together) are allowed to discover each other and establish connections between them. Fabric Zoning is configured on Discovery Controllers (DC). Users can add/remove controllers and/or hosts to/from zones.
Hosts have no direct knowledge of the Fabric Zoning configuration that is active on a given DC. As a result, if a host is impacted by a Fabric Zoning configuration change, it will be notified of the connectivity configuration change by the DC via Asynchronous Event Notifications (AEN).
Table 2. List of terms used in this section:
Term | Description |
---|---|
AEN | Asynchronous Event Notification. A CQE (Completion Queue Entry) for an Asynchronous Event Request that was previously transmitted by the host to a Discovery Controller. AENs are used by DCs to notify hosts that a change (e.g., a connectivity configuration change) has occurred. |
DC | Discovery Controller. |
DLP | Discovery Log Page. A host will issue a Get Log Page command to retrieve the list of controllers it may connect to. |
DLPE | Discovery Log Page Entry. The response to a Get Log Page command contains a list of DLPEs identifying each controller that the host is allowed to connect with.
Note that DLPEs may contain both I/O Controllers (IOCs)
and Discovery Controllers (DCs). DCs listed in DLPEs
are called referrals. |
IOC | I/O Controller. |
Manual Config | Refers to manually adding entries to stacd.conf with the controller= parameter. |
Automatic Config | Refers to receiving configuration from a DC as DLPEs |
External Config | Refers to configuration done outside of the nvme-stas framework, for example using nvme-cli commands |
DCs notify hosts of connectivity configuration changes by sending AENs indicating a "Discovery Log" change. The host uses these AENs as a trigger to issue a Get Log Page command. The response to this command is used to update the list of DLPEs containing the controllers the host is allowed to access. Upon reception of the current DLPEs, the host will determine whether DLPEs were added and/or removed, which will trigger the addition and/or removal of controller connections. This happens in real time and may affect active connections to controllers including controllers that support I/O operations (IOCs). A host that was previously connected to an IOC may suddenly be told that it is no longer allowed to connect to that IOC and should disconnect from it.
IOC connection creation. There are 3 ways to configure IOC connections on a host:
Manual Config by adding controller=
entries
to the "[Controllers]
" section (see below).
Automatic Config received in the form of DLPEs from a remote DC.
External Config using nvme-cli
(e.g. "nvme connect
")
IOC connection removal/prevention. There are 3 ways to remove (or prevent) connections to an IOC:
Manual Config.
by adding exclude=
entries to
the "[Controllers]
" section (see below).
by removing controller=
entries
from the "[Controllers]
" section.
Automatic Config. As explained above, a host gets a
new list of DLPEs upon connectivity configuration
changes. On DLPE removal, the host should remove the
connection to the IOC matching that DLPE. This
behavior is configurable using the
disconnect-scope=
parameter
described below.
External Config using nvme-cli
(e.g. "nvme
disconnect
" or "nvme disconnect-all
")
The decision by the host to automatically disconnect from an
IOC following connectivity configuration changes is controlled
by two parameters: disconnect-scope
and disconnect-trtypes
.
disconnect-scope=
Takes one of: only-stas-connections
,
all-connections-matching-disconnect-trtypes
, or no-disconnect
.
In theory, hosts should only connect to IOCs that have been zoned for them. Connections to IOCs that a host is not zoned to have access to should simply not exist. In practice, however, users may not want hosts to disconnect from all IOCs in reaction to connectivity configuration changes (or at least for some of the IOC connections).
Some users may prefer for IOC connections to be "sticky"
and only be removed manually (nvme-cli
or
exclude=
) or removed by a system
reboot. Specifically, they don't want IOC connections
to be removed unexpectedly on DLPE removal. These users
may want to set disconnect-scope
to no-disconnect
.
It is important to note that when IOC connections are removed, ongoing I/O transactions will be terminated immediately. There is no way to tell what happens to the data being exchanged when such an abrupt termination happens. If a host was in the middle of writing to a storage subsystem, there is a chance that outstanding I/O operations may not successfully complete.
only-stas-connections
Only remove connections previously made by stacd
.
In this mode, when a DLPE is removed as a result of
connectivity configuration changes, the corresponding
IOC connection will be removed by stacd
.
Connections to IOCs made externally, e.g. using nvme-cli
,
will not be affected, unless they happen to be duplicates
of connections made by stacd
. It's simply not
possible for stacd
to tell that a connection
was previously made with nvme-cli
(or any other external tool).
So, it's good practice to avoid duplicating
configuration between stacd
and external tools.
Users wanting to persist some of their IOC connections
regardless of connectivity configuration changes should not use
nvme-cli
to make those connections. Instead,
they should hard-code them in stacd.conf
with the controller=
parameter. Using the
controller=
parameter is the only way for a user
to tell stacd
that a connection must be made and
not be deleted "no-matter-what".
all-connections-matching-disconnect-trtypes
All connections that match the transport type specified by
disconnect-trtypes=
, whether they were
made automatically by stacd
or externally
(e.g., nvme-cli
), will be audited and are
subject to removal on DLPE removal.
In this mode, as DLPEs are removed as a result of
connectivity configuration changes, the corresponding
IOC connections will be removed by the host immediately
whether they were made by stacd
, nvme-cli
,
or any other way. Basically, stacd
audits
all IOC connections matching the
transport type specified by disconnect-trtypes=
.
NOTE.
This mode implies that stacd
will
only allow Manually Configured or Automatically
Configured IOC connections to exist. Externally
Configured connections using nvme-cli
(or other external mechanism)
that do not match any Manual Config
(stacd.conf
)
or Automatic Config (DLPEs) will get deleted
immediately by stacd
.
no-disconnect
stacd
does not disconnect from IOCs
when a DPLE is removed or a controller=
entry is removed from stacd.conf
.
All IOC connections are "sticky".
Instead, users can remove connections
by issuing the nvme-cli
command "nvme disconnect
", add an
exclude=
entry to
stacd.conf
, or wait
until the next system reboot at which time all
connections will be removed.
Defaults to only-stas-connections
.
disconnect-trtypes=
This parameter only applies when disconnect-scope
is set to all-connections-matching-disconnect-trtypes
.
It limits the scope of the audit to specific transport types.
Can take the values tcp
,
rdma
, fc
, or
a combination thereof by separating them with a plus (+) sign.
For example: tcp+fc
. No spaces
are allowed between values and the plus (+) sign.
Defaults to tcp
.
connect-attempts-on-ncc=
The NCC bit (Not Connected to CDC) is a bit returned by the CDC in the EFLAGS field of the DLPE. Only CDCs will set the NCC bit. DDCs will always clear NCC to 0. The NCC bit is a way for the CDC to let hosts know that the subsystem is currently not reachable by the CDC. This may indicate that the subsystem is currently down or that there is an outage on the section of the network connecting the CDC to the subsystem.
If a host is currently failing to connect to an I/O controller and if the NCC bit associated with that I/O controller is asserted, the host can decide to stop trying to connect to that subsystem until connectivity is restored. This will be indicated by the CDC when it clears the NCC bit.
The parameter connect-attempts-on-ncc=
controls whether stacd
will take the
NCC bit into account when attempting to connect to
an I/O Controller. Setting connect-attempts-on-ncc=
to 0 means that stacd
will ignore
the NCC bit and will keep trying to connect. Setting
connect-attempts-on-ncc=
to a
non-zero value indicates the number of connection
attempts that will be made before stacd
gives up trying. Note that this value should be set
to a value greater than 1. In fact, when set to 1,
stacd
will automatically use 2 instead.
The reason for this is simple. It is possible that a
first connect attempt may fail especially if
nvme-cli
's udev rule is enabled (see
race condition discussion under the
udev-rule=
parameter above).
Defaults to 0
.
The following options are available in the
"[Controllers]
" section:
controller=
Controllers are specified with the controller
option. This option may be specified more than once to specify
more than one controller. The format is one line per Controller
composed of a series of fields separated by semi-colons as follows:
controller=transport=[trtype];traddr=[traddr];trsvcid=[trsvcid];host-traddr=[traddr],host-iface=[iface];nqn=[nqn]
transport=
This is a mandatory field that specifies the
network fabric being used for a
NVMe-over-Fabrics network. Current
trtype
values understood
are:
Table 3. Transport type
trtype | Definition |
---|---|
rdma | The network fabric is an rdma network (RoCE, iWARP, Infiniband, basic rdma, etc) |
fc | The network fabric is a Fibre Channel network. |
tcp | The network fabric is a TCP/IP network. |
loop | Connect to a NVMe over Fabrics target on the local host |
traddr=
This is a mandatory field that specifies the network address of the Controller. For transports using IP addressing (e.g. rdma) this should be an IP-based address (ex. IPv4, IPv6). It could also be a resolvable host name (e.g. localhost).
trsvcid=
This is an optional field that specifies the transport service id. For transports using IP addressing (e.g. rdma, tcp) this field is the port number.
Depending on the transport type, this field will default to either 8009 or 4420 as follows.
UDP port 4420 and TCP port 4420 have been assigned by IANA for use by NVMe over Fabrics. NVMe/RoCEv2 controllers use UDP port 4420 by default. NVMe/iWARP controllers use TCP port 4420 by default.
TCP port 4420 has been assigned for use by NVMe over Fabrics and TCP port 8009 has been assigned by IANA for use by NVMe over Fabrics discovery. TCP port 8009 is the default TCP port for NVMe/TCP discovery controllers. There is no default TCP port for NVMe/TCP I/O controllers, the Transport Service Identifier (TRSVCID) field in the Discovery Log Entry indicates the TCP port to use.
The TCP ports that may be used for NVMe/TCP I/O controllers include TCP port 4420, and the Dynamic and/or Private TCP ports (i.e., ports in the TCP port number range from 49152 to 65535). NVMe/TCP I/O controllers should not use TCP port 8009. TCP port 4420 shall not be used for both NVMe/iWARP and NVMe/TCP at the same IP address on the same network.
nqn=
This field specifies the Controller's NVMe Qualified Name.
This field is mandatory for I/O Controllers, but is optional for
Discovery Controllers (DC). For the latter, the NQN will default
to the well-known DC NQN: "nqn.2014-08.org.nvmexpress.discovery
"
if left undefined.
host-traddr=
This is an optional field that specifies the network address used on the host to connect to the Controller. For TCP, this sets the source address on the socket.
host-iface=
This is an optional field that specifies the network interface used on the host to connect to the Controller (e.g. IP eth1, enp2s0, enx78e7d1ea46da). This forces the connection to be made on a specific interface instead of letting the system decide.
Examples:
controller = transport=tcp;traddr=localhost;trsvcid=8009 controller = transport=tcp;traddr=2001:db8::370:7334;host-iface=enp0s8 controller = transport=fc;traddr=nn-0x204600a098cbcac6:pn-0x204700a098cbcac6
exclude=
Controllers that should be excluded can be specified with the
exclude=
option. Using mDNS to
automatically discover and connect to controllers, can result
in unintentional connections being made. This keyword allows
configuring the controllers that should not be connected to.
The syntax is the same as for "controller", except that the parameter
host-traddr
does not apply. Multiple
exclude=
keywords may appear in the config
file to specify more than 1 excluded controller.
Note 1: A minimal match approach is used to eliminate unwanted
controllers. That is, you do not need to specify all the
parameters to identify a controller. Just specifying the
host-iface
, for example, can be used to
exclude all controllers on an interface.
Note 2: exclude=
takes precedence over
controller
. A controller specified by the
controller
keyword, can be eliminated by
the exclude=
keyword.
Examples:
exclude = transport=tcp;traddr=fe80::2c6e:dee7:857:26bb # Eliminate a specific address exclude = host-iface=enp0s8 # Eliminate everything on this interface