# gluster volume info VOLNAME
Managing GlusterFS Volumes
This chapter describes how to perform common volume management operations on the GlusterFS volumes.
Configuring Volume Options
Note
Volume options can be configured while the trusted storage pool is online.
The current settings for a volume can be viewed using the following command:
Volume options can be configured using the following command:
# gluster volume set VOLNAME OPTION PARAMETER
For example, to specify the performance cache size for test-volume
:
# gluster volume set test-volume performance.cache-size 256MB Set volume successful
The following table lists available volume options along with their description and default value.
Note
The default values are subject to change, and may not be the same for all versions of GlusterFS.
Option | Value Description | Allowed Values | Default Value |
---|---|---|---|
auth.allow |
IP addresses or hostnames of the clients which are allowed to access the volume. |
Valid hostnames or IP addresses, which includes
wild card patterns including . For example, |
* (allow all) |
auth.reject |
IP addresses or hostnames of the clients which are denied access to the volume. |
Valid hostnames or IP addresses, which includes
wild card patterns including . For example, |
none (reject none) |
|
changelog |
Enables the changelog translator to record all the file operations. |
on |
off |
off |
client.event-threads |
Specifies the number of network connections to be handled simultaneously by the client processes accessing a GlusterFS node. |
1 - 32 |
2 |
server.event-threads |
Specifies the number of network connections to be handled simultaneously by the server processes hosting a GlusterFS node. |
1 - 32 |
2 |
cluster.consistent-metadata |
If set to On, the readdirp function in Automatic File Replication feature will always fetch metadata from their respective read children as long as it holds the good copy (the copy that does not need healing) of the file/directory. However, this could cause a reduction in performance where readdirps are involved. |
on |
off |
off |
|
cluster.min-free-disk |
Specifies the percentage of disk space that must be kept free. This may be useful for non-uniform bricks. |
Percentage of required minimum free disk space. |
10% |
cluster.op-version |
Allows you to set the operating version of the cluster. The op-version number cannot be downgraded and is set for all the volumes. Also the op-version does not appear when you execute the gluster volume info command. |
3000z |
30703 |
30706 |
Default value is 3000z after an upgrade from GlusterFS 3.0 or 30703 after upgrade from RHGS 3.1.1. Value is set to 30706 for a new cluster deployment. |
cluster.self-heal-daemon |
Specifies whether proactive self-healing on replicated volumes is activated. |
on |
off |
on |
cluster.background-self-heal-count |
The maximum number of heal
operations that can occur simultaneously. Requests in excess of this
number are stored in a queue whose length is defined by
|
0–256 |
8 |
cluster.heal-wait-queue-leng |
The maximum number of requests for heal
operations that can be queued when heal operations equal to
|
0-10000 |
128 |
cluster.server-quorum-type |
If set to |
none |
server |
none |
cluster.server-quorum-ratio |
Sets the quorum percentage for the trusted storage pool. |
0 - 100 |
>50% |
cluster.quorum-type |
If set to |
fixed |
auto |
none |
cluster.quorum-count |
The minimum number of bricks that must be active
in a replica-set to allow writes. This option is used in conjunction
with |
1 - |
0 |
cluster.lookup-optimize |
If this option, is set |
|
off |
cluster.read-freq-threshold |
Specifies the number of reads, in a
promotion/demotion cycle, that would mark a file |
0-20 |
0 |
cluster.write-freq-threshold |
Specifies the number of writes, in a
promotion/demotion cycle, that would mark a file |
0-20 |
0 |
cluster.tier-promote-frequency |
Specifies how frequently the tier daemon must check for files to promote. |
1- 172800 seconds |
120 seconds |
cluster.tier-demote-frequency |
Specifies how frequently the tier daemon must check for files to demote. |
1 - 172800 seconds |
3600 seconds |
cluster.tier-mode |
If set to cache mode, promotes or demotes files based on whether the cache is full or not, as specified with watermarks. If set to test mode, periodically demotes or promotes files automatically based on access. |
test |
cache |
cache |
cluster.tier-max-mb |
Specifies the maximum number of MB that may be migrated in any direction from each node in a given cycle. |
1 -100000 (100 GB) |
4000 MB |
cluster.tier-max-files |
Specifies the maximum number of files that may be migrated in any direction from each node in a given cycle. |
1-100000 files |
10000 |
cluster.watermark-hi |
Upper percentage watermark for promotion. If hot tier fills above this percentage, no promotion will happen and demotion will happen with high probability. |
1- 99 % |
90% |
cluster.watermark-low |
Lower percentage watermark. If hot tier is less full than this, promotion will happen and demotion will not happen. If greater than this, promotion/demotion will happen at a probability relative to how full the hot tier is. |
1- 99 % |
75% |
cluster.shd-max-threads |
Specifies the number of entries that can be self healed in parallel on each replica by self-heal daemon. |
1 - 64 |
1 |
cluster.shd-wait-qlength |
Specifies the number of entries that must be kept in the queue for self-heal daemon threads to take up as soon as any of the threads are free to heal. This value should be changed based on how much memory self-heal daemon process can use for keeping the next set of entries that need to be healed. |
1 - 655536 |
1024 |
config.transport |
Specifies the type of transport(s) volume would support communicating over. |
tcp OR rdma OR tcp,rdma |
tcp |
diagnostics.brick-log-level |
Changes the log-level of the bricks. |
INFO |
DEBUG |
WARNING |
ERROR |
CRITICAL |
NONE |
TRACE |
info |
diagnostics.client-log-level |
Changes the log-level of the clients. |
INFO |
DEBUG |
WARNING |
ERROR |
CRITICAL |
NONE |
TRACE |
info |
diagnostics.brick-sys-log-level |
Depending on the value defined for this option, log messages at and above the defined level are generated in the syslog and the brick log files. |
INFO |
WARNING |
ERROR |
CRITICAL |
CRITICAL |
diagnostics.client-sys-log-level |
Depending on the value defined for this option, log messages at and above the defined level are generated in the syslog and the client log files. |
INFO |
WARNING |
ERROR |
CRITICAL |
CRITICAL |
diagnostics.client-log-format |
Allows you to configure the log format to log either with a message id or without one on the client. |
no-msg-id |
with-msg-id |
with-msg-id |
diagnostics.brick-log-format |
Allows you to configure the log format to log either with a message id or without one on the brick. |
no-msg-id |
with-msg-id |
with-msg-id |
diagnostics.brick-log-flush-timeout |
The length of time for which the log messages are buffered, before being flushed to the logging infrastructure (gluster or syslog files) on the bricks. |
30 - 300 seconds (30 and 300 included) |
120 seconds |
diagnostics.brick-log-buf-size |
The maximum number of unique log messages that can be suppressed until the timeout or buffer overflow, whichever occurs first on the bricks. |
0 and 20 (0 and 20 included) |
5 |
diagnostics.client-log-flush-timeout |
The length of time for which the log messages are buffered, before being flushed to the logging infrastructure (gluster or syslog files) on the clients. |
30 - 300 seconds (30 and 300 included) |
120 seconds |
diagnostics.client-log-buf-size |
The maximum number of unique log messages that can be suppressed until the timeout or buffer overflow, whichever occurs first on the clients. |
0 and 20 (0 and 20 included) |
5 |
disperse.eager-lock |
Before a file operation starts, a lock is placed on the file. The lock remains in place until the file operation is complete. After the file operation completes, if eager-lock is on, the lock remains in place either until lock contention is detected, or for 1 second in order to check if there is another request for that file from the same client. If eager-lock is off, locks release immediately after file operations complete, improving performance for some operations, but reducing access efficiency. |
on |
off |
on |
features.ctr-enabled |
Enables Change Time Recorder (CTR) translator for
a tiered volume. This option is used in conjunction with
|
on |
off |
on |
features.ctr_link_consistency |
Enables a crash consistent way of recording hardlink updates by Change Time Recorder translator. When recording in a crash consistent way the data operations will experience more latency. |
on |
off |
off |
features.quota-deem-statfs |
When this option is set to on, it takes the quota limits into consideration while estimating the filesystem size. The limit will be treated as the total size instead of the actual size of filesystem. |
on |
off |
on |
features.record-counters |
If set to |
on |
off |
on |
features.read-only |
Specifies whether to mount the entire volume as read-only for all the clients accessing it. |
on |
off |
off |
features.shard |
Enables or disables sharding on the volume. Affects files created after volume configuration. |
enable |
disable |
disable |
features.shard-block-size |
Specifies the maximum size of file pieces when sharding is enabled. Affects files created after volume configuration. |
512MB |
512MB |
geo-replication.indexing |
Enables the marker translator to track the changes in the volume. |
on |
off |
off |
performance.quick-read |
To enable/disable quick-read translator in the volume. |
on |
off |
on |
network.ping-timeout |
The time the client waits for a response from the server. If a timeout occurs, all resources held by the server on behalf of the client are cleaned up. When the connection is reestablished, all resources need to be reacquired before the client can resume operations on the server. Additionally, locks are acquired and the lock tables are updated. A reconnect is a very expensive operation and must be avoided. |
42 seconds |
42 seconds |
nfs.acl |
Disabling nfs.acl will remove support for the NFSACL sideband protocol. This is enabled by default. |
enable |
disable |
enable |
nfs.enable-ino32 |
For nfs clients or applciatons that do not support 64-bit inode numbers, use this option to make NFS return 32-bit inode numbers instead. Disabled by default, so NFS returns 64-bit inode numbers. |
enable |
disable |
disable |
|
nfs.export-dir |
By default, all NFS volumes are exported as individual exports. This option allows you to export specified subdirectories on the volume. |
The path must be an absolute path. Along with the path allowed, list of IP address or hostname can be associated with each subdirectory. |
None |
nfs.export-dirs |
By default, all NFS sub-volumes are exported as individual exports. This option allows any directory on a volume to be exported separately. |
on |
off |
on |
|
nfs.export-volumes |
Enables or disables exporting entire volumes. If
disabled and used in conjunction with |
on |
off |
on |
nfs.mount-rmtab |
Path to the cache file that contains a list of
NFS-clients and the volumes they have mounted. Change the location of
this file to a mounted (with glusterfs-fuse, on all storage servers)
volume to gain a trusted pool wide view of all NFS-clients that use the
volumes. The contents of this file provide the information that can get
obtained with the |
Path to a directory |
/var/lib/glusterd/nfs/rmtab |
nfs.mount-udp |
Enable UDP transport for the MOUNT sideband protocol. By
default, UDP is not enabled, and MOUNT can only be used over TCP. Some
NFS-clients (certain Solaris, HP-UX and others) do not support MOUNT
over TCP and enabling |
disable |
enable |
disable |
nfs.nlm |
By default, the Network Lock Manager (NLMv4) is enabled. Use this option to disable NLM. does not recommend disabling this option. |
on |
on |
off |
nfs.rpc-auth-allow IP_ADRESSES |
A comma separated list of IP addresses allowed to connect to the server. By default, all clients are allowed. |
Comma separated list of IP addresses |
accept all |
nfs.rpc-auth-reject IP_ADRESSES |
A comma separated list of addresses not allowed to connect to the server. By default, all connections are allowed. |
Comma separated list of IP addresses |
reject none |
nfs.ports-insecure |
Allows client connections from unprivileged ports. By default only privileged ports are allowed. This is a global setting for allowing insecure ports for all exports using a single option. |
on |
off |
off |
nfs.addr-namelookup |
Specifies whether to lookup names for incoming
client connections. In some configurations, the name server can take too
long to reply to DNS queries, resulting in timeouts of mount requests.
This option can be used to disable name lookups during address
authentication. Note that disabling name lookups will prevent you from
using hostnames in |
on |
off |
on |
nfs.port |
Associates glusterFS NFS with a non-default port. |
1025-65535 |
38465- 38467 |
nfs.disable |
Specifies whether to disable NFS exports of individual volumes. |
on |
off |
off |
nfs.server-aux-gids |
When enabled, the NFS-server will resolve the groups of the user accessing the volume. NFSv3 is restricted by the RPC protocol (AUTH_UNIX/AUTH_SYS header) to 16 groups. By resolving the groups on the NFS-server, this limits can get by-passed. |
on |
off |
off |
nfs.transport-type |
Specifies the transport used by GlusterFS NFS server to communicate with bricks. |
tcp OR rdma |
tcp |
open-behind |
It improves the application’s ability to read data from a file by sending success notifications to the application whenever it receives a open call. |
on |
off |
on |
performance.io-thread-count |
The number of threads in the IO threads translator. |
0 - 65 |
16 |
performance.cache-max-file-size |
Sets the maximum file size cached by the io-cache translator. Can be specified using the normal size descriptors of KB, MB, GB, TB, or PB (for example, 6GB). |
Size in bytes, or specified using size descriptors. |
2 ^ 64-1 bytes |
performance.cache-min-file-size |
Sets the minimum file size cached by the io-cache translator. Can be specified using the normal size descriptors of KB, MB, GB, TB, or PB (for example, 6GB). |
Size in bytes, or specified using size descriptors. |
0 |
performance.cache-refresh-timeout |
The number of seconds cached data for a file will be retained. After this timeout, data re-validation will be performed. |
0 - 61 seconds |
1 second |
performance.cache-size |
Size of the read cache. |
Size in bytes, or specified using size descriptors. |
32 MB |
performance.md-cache-timeout |
The time period in seconds which controls when metadata cache has to be refreshed. If the age of cache is greater than this time-period, it is refreshed. Every time cache is refreshed, its age is reset to 0. |
0-60 seconds |
1 second |
performance.use-anonymous-fd |
This option requires |
Yes |
No |
Yes |
performance.lazy-open |
This option requires |
Yes/No |
Yes |
rebal-throttle |
Rebalance process is made multithreaded to handle multiple files migration for enhancing the performance. During multiple file migration, there can be a severe impact on storage system performance. The throttling mechanism is provided to manage it. |
lazy, normal, aggressive |
normal |
server.allow-insecure |
Allows client connections from unprivileged ports. By default, only privileged ports are allowed. This is a global setting for allowing insecure ports to be enabled for all exports using a single option. |
on |
off |
off |
|
server.root-squash |
Prevents root users from having root privileges, and instead assigns them the privileges of nfsnobody. This squashes the power of the root users, preventing unauthorized modification of files on the GlusterFS Servers. |
on |
off |
off |
server.anonuid |
Value of the UID used for the anonymous user when root-squash is enabled. When root-squash is enabled, all the requests received from the root UID (that is 0) are changed to have the UID of the anonymous user. |
0 - 4294967295 |
65534 (this UID is also known as
|
server.anongid |
Value of the GID used for the anonymous user when root-squash is enabled. When root-squash is enabled, all the requests received from the root GID (that is 0) are changed to have the GID of the anonymous user. |
0 - 4294967295 |
65534 (this UID is also known as
|
server.gid-timeout |
The time period in seconds which controls when
cached groups has to expire. This is the cache that contains the groups
(GIDs) where a specified user (UID) belongs to. This option is used only
when |
0-4294967295 seconds |
2 seconds |
server.manage-gids |
Resolve groups on the server-side. By enabling this option, the groups (GIDs) a user (UID) belongs to gets resolved on the server, instead of using the groups that were send in the RPC Call by the client. This option makes it possible to apply permission checks for users that belong to bigger group lists than the protocol supports (approximately 93). |
on |
off |
off |
server.statedump-path |
Specifies the directory in which the |
/var/run/gluster (for a default installation) |
Path to a directory |
storage.health-check-interval |
Sets the time interval in seconds for a filesystem health check. You can set it to 0 to disable. The POSIX translator on the bricks performs a periodic health check. If this check fails, the filesystem exported by the brick is not usable anymore and the brick process (glusterfsd) logs a warning and exits. |
0-4294967295 seconds |
30 seconds |
storage.owner-uid |
Sets the UID for the bricks of the volume. This option may be required when some of the applications need the brick to have a specific UID to function correctly. Example: For QEMU integration the UID/GID must be qemu:qemu, that is, 107:107 (107 is the UID and GID of qemu). |
Any integer greater than or equal to -1. |
The UID of the
bricks are not changed. This is denoted by |
storage.owner-gid |
Sets the GID for the bricks of the volume. This option may be required when some of the applications need the brick to have a specific GID to function correctly. Example: For QEMU integration the UID/GID must be qemu:qemu, that is, 107:107 (107 is the UID and GID of qemu). |
Any integer greater than or equal to -1. |
Configuring Transport Types for a Volume
A volume can support one or more transport types for communication between clients and brick processes. There are three types of supported transport, which are, tcp, rdma, and tcp,rdma.
To change the supported transport types of a volume, follow the procedure:
-
Unmount the volume on all the clients using the following command:
# umount mount-point
-
Stop the volumes using the following command:
# gluster volume stop volname
-
Change the transport type. For example, to enable both tcp and rdma execute the followimg command:
# gluster volume set volname config.transport tcp,rdma OR tcp OR rdma
-
Mount the volume on all the clients. For example, to mount using rdma transport, use the following command:
# mount -t glusterfs -o transport=rdma server1:/test-volume /mnt/glusterfs
Expanding Volumes
Volumes can be expanded while the trusted storage pool is online and available. For example, you can add a brick to a distributed volume, which increases distribution and adds capacity to the GlusterFS volume. Similarly, you can add a group of bricks to a replicated or distributed replicated volume, which increases the capacity of the GlusterFS volume.
Note
When expanding replicated or distributed replicated volumes, the number of bricks being added must be a multiple of the replica count. For example, to expand a distributed replicated volume with a replica count of 2, you need to add bricks in multiples of 2 (such as 4, 6, 8, etc.).
Important
Converting an existing distribute volume to replicate or distribute-replicate volume is not supported.
-
From any server in the trusted storage pool, use the following command to probe the server on which you want to add a new brick :
# gluster peer probe HOSTNAME
For example:
# gluster peer probe server5 Probe successful # gluster peer probe server6 Probe successful
-
Add the bricks using the following command:
# gluster volume add-brick VOLNAME NEW_BRICK
For example:
# gluster volume add-brick test-volume server5:/rhgs5 server6:/rhgs6 Add Brick successful
-
Check the volume information using the following command:
# gluster volume info
The command output displays information similar to the following:
Volume Name: test-volume Type: Distribute-Replicate Status: Started Number of Bricks: 6 Bricks: Brick1: server1:/rhgs/brick1 Brick2: server2:/rhgs/brick2 Brick3: server3:/rhgs/brick3 Brick4: server4:/rhgs/brick4 Brick5: server5:/rhgs/brick5 Brick6: server6:/rhgs/brick6
-
Rebalance the volume to ensure that files will be distributed to the new brick. Use the rebalance command as described in Rebalancing Volumes.
The
add-brick
command should be followed by arebalance
operation to ensure better utilization of the added bricks.
Expanding a Tiered Volume
You can add a group of bricks to a cold tier volume and to the hot tier volume to increase the capacity of the GlusterFS volume.
Expanding a Cold Tier Volume
Expanding a cold tier volume is same as a non-tiered volume. If you are reusing the brick, ensure to perform the steps listed in “Formatting and Mounting Bricks” section.
-
Detach the tier by performing the steps listed in Detaching a Tier from a Volume
-
From any server in the trusted storage pool, use the following command to probe the server on which you want to add a new brick :
# gluster peer probe HOSTNAME
For example:
# gluster peer probe server5 Probe successful # gluster peer probe server6 Probe successful
-
Add the bricks using the following command:
# gluster volume add-brick VOLNAME NEW_BRICK
For example:
# gluster volume add-brick test-volume server5:/rhgs5 server6:/rhgs6
-
Rebalance the volume to ensure that files will be distributed to the new brick. Use the rebalance command as described in Rebalancing Volumes.
The
add-brick
command should be followed by arebalance
operation to ensure better utilization of the added bricks. -
Reattach the tier to the volume with both old and new (expanded) bricks:
# gluster volume tier VOLNAME attach [replica COUNT] NEW-BRICK…
Important
When you reattach a tier, an internal process called fix-layout commences internally to prepare the hot tier for use. This process takes time and there will a delay in starting the tiering activities.
If you are reusing the brick, be sure to clearly wipe the existing data before attaching it to the tiered volume.
Expanding a Hot Tier Volume
You can expand a hot tier volume by attaching and adding bricks for the hot tier.
-
Detach the tier by performing the steps listed in Detaching a Tier from a Volume
-
Reattach the tier to the volume with both old and new (expanded) bricks:
# gluster volume tier VOLNAME attach [replica COUNT] NEW-BRICK…
For example,
# gluster volume tier test-volume attach replica 2 server1:/rhgs5/tier5 server2:/rhgs6/tier6 server1:/rhgs7/tier7 server2:/rhgs8/tier8
Important
When you reattach a tier, an internal process called fix-layout commences internally to prepare the hot tier for use. This process takes time and there will a delay in starting the tiering activities.
If you are reusing the brick, be sure to clearly wipe the existing data before attaching it to the tiered volume.
Shrinking Volumes
You can shrink volumes while the trusted storage pool is online and available. For example, you may need to remove a brick that has become inaccessible in a distributed volume because of a hardware or network failure.
Note
When shrinking distributed replicated volumes, the number of bricks being removed must be a multiple of the replica count. For example, to shrink a distributed replicated volume with a replica count of 2, you need to remove bricks in multiples of 2 (such as 4, 6, 8, etc.). In addition, the bricks you are removing must be from the same sub-volume (the same replica set). In a non-replicated volume, all bricks must be available in order to migrate data and perform the remove brick operation. In a replicated volume, at least one of the bricks in the replica must be available.
-
Remove a brick using the following command:
# gluster volume remove-brick VOLNAME BRICK start
For example:
# gluster volume remove-brick test-volume server2:/rhgs/brick2 start Remove Brick start successful
Note
If the
remove-brick
command is run withforce
or without any option, the data on the brick that you are removing will no longer be accessible at the glusterFS mount point. When using thestart
option, the data is migrated to other bricks, and on a successful commit the removed brick’s information is deleted from the volume configuration. Data can still be accessed directly on the brick. -
You can view the status of the remove brick operation using the following command:
# gluster volume remove-brick VOLNAME BRICK status
For example:
# gluster volume remove-brick test-volume server2:/rhgs/brick2 status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 16 16777216 52 0 in progress 192.168.1.1 13 16723211 47 0 in progress
-
When the data migration shown in the previous
status
command is complete, run the following command to commit the brick removal:# gluster volume remove-brick VOLNAME BRICK commit
For example,
# gluster volume remove-brick test-volume server2:/rhgs/brick2 commit
-
After the brick removal, you can check the volume information using the following command:
# gluster volume info
The command displays information similar to the following:
# gluster volume info Volume Name: test-volume Type: Distribute Status: Started Number of Bricks: 3 Bricks: Brick1: server1:/rhgs/brick1 Brick3: server3:/rhgs/brick3 Brick4: server4:/rhgs/brick4
Shrinking a Geo-replicated Volume
-
Remove a brick using the following command:
# gluster volume remove-brick VOLNAME BRICK start
For example:
# gluster volume remove-brick MASTER_VOL MASTER_HOST:/rhgs/brick2 start Remove Brick start successful
Note
If the
remove-brick
command is run withforce
or without any option, the data on the brick that you are removing will no longer be accessible at the glusterFS mount point. When using thestart
option, the data is migrated to other bricks, and on a successful commit the removed brick’s information is deleted from the volume configuration. Data can still be accessed directly on the brick. -
Use geo-replication
config checkpoint
to ensure that all the data in that brick is synced to the slave. -
Set a checkpoint to help verify the status of the data synchronization.
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL config checkpoint now
-
Verify the checkpoint completion for the geo-replication session using the following command:
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL status detail
-
You can view the status of the remove brick operation using the following command:
# gluster volume remove-brick VOLNAME BRICK status
For example:
# gluster volume remove-brick MASTER_VOL MASTER_HOST:/rhgs/brick2 status
-
Stop the geo-replication session between the master and the slave:
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL stop
-
When the data migration shown in the previous
status
command is complete, run the following command to commit the brick removal:# gluster volume remove-brick VOLNAME BRICK commit
For example,
# gluster volume remove-brick MASTER_VOL MASTER_HOST:/rhgs/brick2 commit
-
After the brick removal, you can check the volume information using the following command:
# gluster volume info
-
Start the geo-replication session between the hosts:
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL start
Shrinking a Tiered Volume
You can shrink a tiered volume while the trusted storage pool is online and available. For example, you may need to remove a brick that has become inaccessible because of a hardware or network failure.
Shrinking a Cold Tier Volume
-
Detach the tier by performing the steps listed in Detaching a Tier from a Volume
-
Remove a brick using the following command:
# gluster volume remove-brick VOLNAME BRICK start
For example:
# gluster volume remove-brick test-volume server2:/rhgs2 start Remove Brick start successful
Note
If the
remove-brick
command is run withforce
or without any option, the data on the brick that you are removing will no longer be accessible at the glusterFS mount point. When using thestart
option, the data is migrated to other bricks, and on a successful commit the removed brick’s information is deleted from the volume configuration. Data can still be accessed directly on the brick. -
You can view the status of the remove brick operation using the following command:
# gluster volume remove-brick VOLNAME BRICK status
For example:
# gluster volume remove-brick test-volume server2:/rhgs2 status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 16 16777216 52 0 in progress 192.168.1.1 13 16723211 47 0 in progress
-
When the data migration shown in the previous
status
command is complete, run the following command to commit the brick removal:# gluster volume remove-brick VOLNAME BRICK commit
For example,
# gluster volume remove-brick test-volume server2:/rhgs2 commit
-
Rerun the attach-tier command only with the required set of bricks:
# gluster volume tier VOLNAME attach [replica COUNT] BRICK…
For example,
# gluster volume tier test-volume attach replica 2 server1:/rhgs1/tier1 server2:/rhgs2/tier2 server1:/rhgs3/tier3 server2:/rhgs5/tier5
Important
When you attach a tier, an internal process called fix-layout commences internally to prepare the hot tier for use. This process takes time and there will a delay in starting the tiering activities.
Shrinking a Hot Tier Volume
You must first decide on which bricks should be part of the hot tiered volume and which bricks should be removed from the hot tier volume.
-
Detach the tier by performing the steps listed in Detaching a Tier from a Volume
-
Rerun the attach-tier command only with the required set of bricks:
# gluster volume tier VOLNAME attach [replica COUNT] brick…
Important
When you reattach a tier, an internal process called fix-layout commences internally to prepare the hot tier for use. This process takes time and there will a delay in starting the tiering activities.
Stopping a remove-brick
Operation
A remove-brick
operation that is in progress can be stopped by using
the stop
command.
Note
Files that were already migrated during
remove-brick
operation will not be migrated back to the same brick when the operation is stopped.
To stop remove brick operation, use the following command:
# gluster volume remove-brick VOLNAME BRICK stop
For example:
gluster volume remove-brick di rhgs1:/brick1/di21 rhgs1:/brick1/di21 stop Node Rebalanced-files size scanned failures skipped status run-time in secs ---- ------- ---- ---- ------ ----- ----- ------ localhost 23 376Bytes 34 0 0 stopped 2.00 rhs1 0 0Bytes 88 0 0 stopped 2.00 rhs2 0 0Bytes 0 0 0 not started 0.00 'remove-brick' process may be in the middle of a file migration. The process will be fully stopped once the migration of the file is complete. Please check remove-brick process for completion before doing any further brick related tasks on the volume.
Migrating Volumes
Data can be redistributed across bricks while the trusted storage pool is online and available.Before replacing bricks on the new servers, ensure that the new servers are successfully added to the trusted storage pool.
Note
Before performing a
replace-brick
operation, review the known issues related toreplace-brick
operation in the GlusterFS 3.1 Release Notes.
Replacing a Subvolume on a Distribute or Distribute-replicate Volume
This procedure applies only when at least one brick from the subvolume to be replaced is online. In case of a Distribute volume, the brick that must be replaced must be online. In case of a Distribute-replicate, at least one brick from the subvolume from the replica set that must be replaced must be online.
To replace the entire subvolume with new bricks on a Distribute-replicate volume, follow these steps:
-
Add the new bricks to the volume.
# gluster volume add-brick VOLNAME [replica <COUNT>] NEW-BRICK
# gluster volume add-brick test-volume server5:/rhgs/brick5 Add Brick successful
-
Verify the volume information using the command:
# gluster volume info Volume Name: test-volume Type: Distribute Status: Started Number of Bricks: 5 Bricks: Brick1: server1:/rhgs/brick1 Brick2: server2:/rhgs/brick2 Brick3: server3:/rhgs/brick3 Brick4: server4:/rhgs/brick4 Brick5: server5:/rhgs/brick5
Note
In case of a Distribute-replicate volume, you must specify the replica count in the
add-brick
command and provide the same number of bricks as the replica count to theadd-brick
command. -
Remove the bricks to be replaced from the subvolume.
-
Start the
remove-brick
operation using the command:# gluster volume remove-brick VOLNAME [replica <COUNT>] <BRICK> start
# gluster volume remove-brick test-volume server2:/rhgs/brick2 start Remove Brick start successful
-
View the status of the
remove-brick
operation using the command:# gluster volume remove-brick VOLNAME [replica <COUNT>] BRICK status
# gluster volume remove-brick test-volume server2:/rhgs/brick2 status Node Rebalanced-files size scanned failures status ------------------------------------------------------------------ server2 16 16777216 52 0 in progress
Keep monitoring the
remove-brick
operation status by executing the above command. When the value of the status field is set tocomplete
in the output ofremove-brick
status command, proceed further. -
Commit the
remove-brick
operation using the command:# gluster volume remove-brick VOLNAME [replica <COUNT>] <BRICK> commit
# gluster volume remove-brick test-volume server2:/rhgs/brick2 commit
-
Verify the volume information using the command:
# gluster volume info Volume Name: test-volume Type: Distribute Status: Started Number of Bricks: 4 Bricks: Brick1: server1:/rhgs/brick1 Brick3: server3:/rhgs/brick3 Brick4: server4:/rhgs/brick4 Brick5: server5:/rhgs/brick5
-
Verify the content on the brick after committing the
remove-brick
operation on the volume. If there are any files leftover, copy it through FUSE or NFS mount. -
Verify if there are any pending files on the bricks of the subvolume.
Along with files, all the application-specific extended attributes must be copied. glusterFS also uses extended attributes to store its internal data. The extended attributes used by glusterFS are of the form
trusted.glusterfs.
,trusted.afr.
, and ` trusted.gfid`. Any extended attributes other than ones listed above must also be copied.To copy the application-specific extended attributes and to achieve a an effect similar to the one that is described above, use the following shell script:
Syntax:
# copy.sh <glusterfs-mount-point> <brick>
If the mount point is
/mnt/glusterfs
and brick path is/rhgs/brick1
, then the script must be run as:# copy.sh /mnt/glusterfs /rhgs/brick1
#!/bin/bash MOUNT=$1 BRICK=$2 for file in `find $BRICK ! -type d`; do rpath=`echo $file | sed -e "s#$BRICK\(.*\)#\1#g"` rdir=`dirname $rpath` cp -fv $file $MOUNT/$rdir; for xattr in `getfattr -e hex -m. -d $file 2>/dev/null | sed -e '/^#/d' | grep -v -E "trusted.glusterfs.*" | grep -v -E "trusted.afr.*" | grep -v "trusted.gfid"`; do key=`echo $xattr | cut -d"=" -f 1` value=`echo $xattr | cut -d"=" -f 2` setfattr $MOUNT/$rpath -n $key -v $value done done
-
To identify a list of files that are in a split-brain state, execute the command:
# gluster volume heal test-volume info split-brain
-
If there are any files listed in the output of the above command, compare the files across the bricks in a replica set, delete the bad files from the brick and retain the correct copy of the file. Manual intervention by the System Administrator would be required to choose the correct copy of file.
Replacing an Old Brick with a New Brick on a Replicate or
Distribute-replicate Volume
A single brick can be replaced during a hardware failure situation, such as a disk failure or a server failure. The brick that must be replaced could either be online or offline. This procedure is applicable for volumes with replication. In case of a Replicate or Distribute-replicate volume types, after replacing the brick, self-heal is automatically triggered to heal the data on the new brick.
Procedure to replace an old brick with a new brick on a Replicate or Distribute-replicate volume:
-
Ensure that the new brick (
sys5:/rhgs/brick1
) that replaces the old brick (sys0:/rhgs/brick1
) is empty. Ensure that all the bricks are online. The brick that must be replaced can be in an offline state. -
Execute the
replace-brick
command with theforce
option:# gluster volume replace-brick r2 sys0:/rhgs/brick1 sys5:/rhgs/brick1 commit force volume replace-brick: success: replace-brick commit successful
-
Check if the new brick is online.
# gluster volume status Status of volume: r2 Gluster process Port Online Pid
Brick sys5:/rhgs/brick1 49156 Y 5731
Brick sys1:/rhgs/brick1 49153 Y 5354
Brick sys2:/rhgs/brick1 49154 Y 5365
Brick sys3:/rhgs/brick1 49155 Y 5376
4. Data on the newly added brick would automatically be healed. It might take time depending upon the amount of data to be healed. It is recommended to check heal information after replacing a brick to make sure all the data has been healed before replacing/removing any other brick. + ----------------------------------- # gluster volume heal VOL_NAME info ----------------------------------- + For example: + -------------------------------------- # gluster volume heal test-volume info Brick server1:/rhgs/brick1 Status: Connected Number of entries: 0 Brick server1:/rhgs/brick2new Status: Connected Number of entries: 0 Brick server2:/rhgs/brick3 Status: Connected Number of entries: 0 Brick server2:/rhgs/brick4 Status: Connected Number of entries: 0 Brick server3:/rhgs/brick5 Status: Connected Number of entries: 0 Brick server3:/rhgs/brick6 Status: Connected Number of entries: 0 -------------------------------------- + The value of `Number of entries` field will be displayed as zero if the heal is complete. [[Replacing_an_Old_Brick_with_a_New_Brick_on_a_Distribute_Volume]] == Replacing an Old Brick with a New Brick on a Distribute Volume _______________________________________________________________________________________________________ *Important* In case of a _Distribute_ volume type, replacing a brick using this procedure will result in data loss. _______________________________________________________________________________________________________ 1. Replace a brick with a commit `force` option: + ----------------------------------------------------------------------- # gluster volume replace-brick VOLNAME <BRICK> <NEW-BRICK> commit force ----------------------------------------------------------------------- + ---------------------------------------------------------------------------------- # gluster volume replace-brick r2 sys0:/rhgs/brick1 sys5:/rhgs/brick1 commit force volume replace-brick: success: replace-brick commit successful ---------------------------------------------------------------------------------- 2. Verify if the new brick is online. +
gluster volume status
Status of volume: r2 Gluster process Port Online Pid
Brick sys5:/rhgs/brick1 49156 Y 5731 Brick sys1:/rhgs/brick1 49153 Y 5354 Brick sys2:/rhgs/brick1 49154 Y 5365 Brick sys3:/rhgs/brick1 49155 Y 5376
Note
All the
replace-brick
command options except the commitforce
option are deprecated.
Replacing an Old Brick with a New Brick on a Dispersed or
Distributed-dispersed Volume
A single brick can be replaced during a hardware failure situation, such as a disk failure or a server failure. The brick that must be replaced could either be online or offline but all other bricks must be online.
Procedure to replace an old brick with a new brick on a Dispersed or Distributed-dispersed volume:
-
Ensure that the new brick that replaces the old brick is empty. The brick that must be replaced can be in an offline state but all other bricks must be online.
-
Execute the replace-brick command with the
force
option:# gluster volume replace-brick VOL_NAME old_brick_path new_brick_path commit force
For example:
# gluster volume replace-brick test-volume server1:/rhgs/brick2 server1:/rhgs/brick2new commit force volume replace-brick: success: replace-brick commit successful
The new brick you are adding could be from the same server or you can add a new server and then a new brick.
-
Check if the new brick is online.
# gluster volume status Status of volume: test-volume Gluster process TCP Port RDMA Port Online Pid
Brick server1:/rhgs/brick1 49187 0 Y 19927 Brick server1:/rhgs/brick2new 49188 0 Y 19946 Brick server2:/rhgs/brick3 49189 0 Y 19965 Brick server2:/rhgs/brick4 49190 0 Y 19984 Brick server3:/rhgs/brick5 49191 0 Y 20003 Brick server3:/rhgs/brick6 49192 0 Y 20022 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 20043
Task Status of Volume test-volume
There are no active volume tasks
-
Data on the newly added brick would automatically be healed. It might take time depending upon the amount of data to be healed. It is recommended to check heal information after replacing a brick to make sure all the data has been healed before replacing/removing any other brick.
# gluster volume heal VOL_NAME info
For example:
# gluster volume heal test-volume info Brick server1:/rhgs/brick1 Status: Connected Number of entries: 0 Brick server1:/rhgs/brick2new Status: Connected Number of entries: 0 Brick server2:/rhgs/brick3 Status: Connected Number of entries: 0 Brick server2:/rhgs/brick4 Status: Connected Number of entries: 0 Brick server3:/rhgs/brick5 Status: Connected Number of entries: 0 Brick server3:/rhgs/brick6 Status: Connected Number of entries: 0
The value of
Number of entries
field will be displayed as zero if the heal is complete.
Replacing Hosts
Replacing a Host Machine with a Different Hostname
You can replace a failed host machine with another host that has a different hostname.
Important
Ensure that the new peer has the exact disk capacity as that of the one it is replacing. For example, if the peer in the cluster has two 100GB drives, then the new peer must have the same disk capacity and number of drives.
In the following example the original machine which has had an
irrecoverable failure is sys0.example.com
and the replacement machine
is sys5.example.com
. The brick with an unrecoverable failure is
sys0.example.com:/rhgs/brick1
and the replacement brick is
sys5.example.com:/rhgs/brick1
.
-
Stop the geo-replication session if configured by executing the following command:
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL stop force
-
Probe the new peer from one of the existing peers to bring it into the cluster.
# gluster peer probe sys5.example.com
-
Ensure that the new brick
(sys5.example.com:/rhgs/brick1)
that is replacing the old brick(sys0.example.com:/rhgs/brick1)
is empty. -
If the geo-replication session is configured, perform the following steps:
-
Setup the geo-replication session by generating the ssh keys:
# gluster system:: execute gsec_create
-
Create geo-replication session again with
force
option to distribute the keys from new nodes to Slave nodes.# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL create push-pem force
-
After successfully setting up the shared storage volume, when a new node is replaced in the cluster, the shared storage is not mounted automatically on this node. Neither is the `/etc/fstab ` entry added for the shared storage on this node. To make use of shared storage on this node, execute the following commands:
# mount -t glusterfs local node's ip:gluster_shared_storage /var/run/gluster/shared_storage # cp /etc/fstab /var/run/gluster/fstab.tmp # echo local node's ip:/gluster_shared_storage /var/run/gluster/shared_storage/ glusterfs defaults 0 0" >> /etc/fstab
For more information on setting up shared storage volume, see Setting up Shared Storage Volume.
-
Configure the meta-volume for geo-replication:
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL config use_meta_volume true
For more information on configuring meta-volume, see Configuring a Meta-Volume.
-
Retrieve the brick paths in
sys0.example.com
using the following command:# gluster volume info <VOLNAME>
Volume Name: vol Type: Replicate Volume ID: 0xde822e25ebd049ea83bfaa3c4be2b440 Status: Started Snap Volume: no Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: sys0.example.com:/rhgs/brick1 Brick2: sys1.example.com:/rhgs/brick1 Options Reconfigured: performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable
Brick path in
sys0.example.com
is/rhgs/brick1
. This has to be replaced with the brick in the newly added host,sys5.example.com
. -
Create the required brick path in sys5.example.com.For example, if /rhs/brick is the XFS mount point in sys5.example.com, then create a brick directory in that path.
# mkdir /rhgs/brick1
-
Execute the
replace-brick
command with the force option:# gluster volume replace-brick vol sys0.example.com:/rhgs/brick1 sys5.example.com:/rhgs/brick1 commit force volume replace-brick: success: replace-brick commit successful
-
Verify that the new brick is online.
# gluster volume status Status of volume: vol Gluster process Port Online Pid Brick sys5.example.com:/rhgs/brick1 49156 Y 5731 Brick sys1.example.com:/rhgs/brick1 49153 Y 5354
-
Initiate self-heal on the volume. The status of the heal process can be seen by executing the command:
# gluster volume heal VOLNAME
-
The status of the heal process can be seen by executing the command:
# gluster volume heal VOLNAME info
-
Detach the original machine from the trusted pool.
# gluster peer detach sys0.example.com
-
Ensure that after the self-heal completes, the extended attributes are set to zero on the other bricks in the replica.
# getfattr -d -m. -e hex /rhgs/brick1 getfattr: Removing leading '/' from absolute path names #file: rhgs/brick1 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.afr.vol-client-0=0x000000000000000000000000 trusted.afr.vol-client-1=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000000000007ffffffe trusted.glusterfs.volume-id=0xde822e25ebd049ea83bfaa3c4be2b440
In this example, the extended attributes
trusted.afr.vol-client-0
andtrusted.afr.vol-client-1
have zero values. This means that the data on the two bricks is identical. If these attributes are not zero after self-heal is completed, the data has not been synchronised correctly. -
Start the geo-replication session using
force
option:# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL start force
Replacing a Host Machine with the Same Hostname
You can replace a failed host with another node having the same FQDN
(Fully Qualified Domain Name). A host in a GlusterFS
Trusted Storage Pool has its own identity called the UUID generated by
the glusterFS Management Daemon.The UUID for the host is available in
/var/lib/glusterd/glusterd/info
file.
In the following example, the host with the FQDN as sys0.example.com was irrecoverable and must to be replaced with a host, having the same FQDN. The following steps have to be performed on the new host.
-
Stop the geo-replication session if configured by executing the following command:
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL stop force
-
Stop the
glusterd
service on the sys0.example.com.# service glusterd stop
-
Retrieve the UUID of the failed host (sys0.example.com) from another of the GlusterFS Trusted Storage Pool by executing the following command:
# gluster peer status Number of Peers: 2 Hostname: sys1.example.com Uuid: 1d9677dc-6159-405e-9319-ad85ec030880 State: Peer in Cluster (Connected) Hostname: sys0.example.com Uuid: b5ab2ec3-5411-45fa-a30f-43bd04caf96b State: Peer Rejected (Connected)
Note that the UUID of the failed host is
b5ab2ec3-5411-45fa-a30f-43bd04caf96b
-
Edit the
glusterd.info
file in the new host and include the UUID of the host you retrieved in the previous step.# cat /var/lib/glusterd/glusterd.info UUID=b5ab2ec3-5411-45fa-a30f-43bd04caf96b operating-version=30703
Note
The operating version of this node must be same as in other nodes of the trusted storage pool.
-
Select any host (say for example, sys1.example.com) in the GlusterFS Trusted Storage Pool and retrieve its UUID from the
glusterd.info
file.# grep -i uuid /var/lib/glusterd/glusterd.info UUID=8cc6377d-0153-4540-b965-a4015494461c
-
Gather the peer information files from the host (sys1.example.com) in the previous step. Execute the following command in that host (sys1.example.com) of the cluster.
# cp -a /var/lib/glusterd/peers /tmp/
-
Remove the peer file corresponding to the failed host (sys0.example.com) from the
/tmp/peers
directory.# rm /tmp/peers/b5ab2ec3-5411-45fa-a30f-43bd04caf96b
Note that the UUID corresponds to the UUID of the failed host (sys0.example.com) retrieved in Step 3.
-
Archive all the files and copy those to the failed host(sys0.example.com).
# cd /tmp; tar -cvf peers.tar peers
-
Copy the above created file to the new peer.
# scp /tmp/peers.tar [email protected]:/tmp
-
Copy the extracted content to the
/var/lib/glusterd/peers
directory. Execute the following command in the newly added host with the same name (sys0.example.com) and IP Address.# tar -xvf /tmp/peers.tar # cp peers/* /var/lib/glusterd/peers/
-
Select any other host in the cluster other than the node (sys1.example.com) selected in step 5. Copy the peer file corresponding to the UUID of the host retrieved in Step 4 to the new host (sys0.example.com) by executing the following command:
# scp /var/lib/glusterd/peers/<UUID-retrieved-from-step4> root@Example1:/var/lib/glusterd/peers/
-
Retrieve the brick directory information, by executing the following command in any host in the cluster.
# gluster volume info Volume Name: vol Type: Replicate Volume ID: 0x8f16258c88a0498fbd53368706af7496 Status: Started Snap Volume: no Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: sys0.example.com:/rhgs/brick1 Brick2: sys1.example.com:/rhgs/brick1 Options Reconfigured: performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable
In the above example, the brick path in sys0.example.com is,
/rhgs/brick1
. If the brick path does not exist in sys0.example.com, perform steps a, b, and c. -
Create a brick path in the host, sys0.example.com.
mkdir /rhgs/brick1
-
Retrieve the volume ID from the existing brick of another host by executing the following command on any host that contains the bricks for the volume.
# getfattr -d -m. -ehex <brick-path>
Copy the volume-id.
# getfattr -d -m. -ehex /rhgs/brick1 getfattr: Removing leading '/' from absolute path names # file: rhgs/brick1 trusted.afr.vol-client-0=0x000000000000000000000000 trusted.afr.vol-client-1=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000000000007ffffffe trusted.glusterfs.volume-id=0x8f16258c88a0498fbd53368706af7496
In the above example, the volume id is 0x8f16258c88a0498fbd53368706af7496
-
Set this volume ID on the brick created in the newly added host and execute the following command on the newly added host (sys0.example.com).
# setfattr -n trusted.glusterfs.volume-id -v <volume-id> <brick-path>
For Example:
# setfattr -n trusted.glusterfs.volume-id -v 0x8f16258c88a0498fbd53368706af7496 /rhs/brick2/drv2
Data recovery is possible only if the volume type is replicate or distribute-replicate. If the volume type is plain distribute, you can skip steps 12 and 13.
-
Create a FUSE mount point to mount the glusterFS volume.
# mount -t glusterfs <server-name>:/VOLNAME <mount>
-
Perform the following operations to change the Automatic File Replication extended attributes so that the heal process happens from the other brick (sys1.example.com:/rhgs/brick1) in the replica pair to the new brick (sys0.example.com:/rhgs/brick1). Note that /mnt/r2 is the FUSE mount path.
-
Create a new directory on the mount point and ensure that a directory with such a name is not already present.
# mkdir /mnt/r2/<name-of-nonexistent-dir>
-
Delete the directory and set the extended attributes.
# rmdir /mnt/r2/<name-of-nonexistent-dir> # setfattr -n trusted.non-existent-key -v abc /mnt/r2 # setfattr -x trusted.non-existent-key /mnt/r2
-
Ensure that the extended attributes on the other bricks in the replica (in this example,
trusted.afr.vol-client-0
) is not set to zero.# getfattr -d -m. -e hex /rhgs/brick1 # file: rhgs/brick1 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.afr.vol-client-0=0x000000000000000300000002 trusted.afr.vol-client-1=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000000000007ffffffe trusted.glusterfs.volume-id=0x8f16258c88a0498fbd53368706af7496
Note
You must ensure to perform steps 12, 13, and 14 for all the volumes having bricks from
sys0.example.com
. -
Start the
glusterd
service.# service glusterd start
-
Perform the self-heal operation on the restored volume.
# gluster volume heal VOLNAME
-
You can view the gluster volume self-heal status by executing the following command:
# gluster volume heal VOLNAME info
-
If the geo-replication session is configured, perform the following steps:
-
Setup the geo-replication session by generating the ssh keys:
# gluster system:: execute gsec_create
-
Create geo-replication session again with
force
option to distribute the keys from new nodes to Slave nodes.# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL create push-pem force
-
After successfully setting up the shared storage volume, when a new node is replaced in the cluster, the shared storage is not mounted automatically on this node. Neither is the `/etc/fstab ` entry added for the shared storage on this node. To make use of shared storage on this node, execute the following commands:
# mount -t glusterfs <local node's ip>:gluster_shared_storage /var/run/gluster/shared_storage # cp /etc/fstab /var/run/gluster/fstab.tmp # echo "<local node's ip>:/gluster_shared_storage /var/run/gluster/shared_storage/ glusterfs defaults 0 0" >> /etc/fstab
For more information on setting up shared storage volume, see Setting up Shared Storage Volume.
-
Configure the meta-volume for geo-replication:
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL config use_meta_volume true
-
Start the geo-replication session using
force
option:# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL start force
Replacing a host with the same Hostname in a two-node GlusterFS Trusted Storage Pool.
If there are only 2 hosts in the GlusterFS Trusted Storage Pool where the host sys0.example.com must be replaced, perform the following steps:
-
Stop the geo-replication session if configured by executing the following command:
# gluster volume geo-replication MASTER_VOL SLAVE_HOST::SLAVE_VOL stop force
-
Stop the
glusterd
service on sys0.example.com.# service glusterd stop
-
Retrieve the UUID of the failed host (sys0.example.com) from another peer in the GlusterFS Trusted Storage Pool by executing the following command:
# gluster peer status Number of Peers: 1 Hostname: sys0.example.com Uuid: b5ab2ec3-5411-45fa-a30f-43bd04caf96b State: Peer Rejected (Connected)
Note that the UUID of the failed host is
b5ab2ec3-5411-45fa-a30f-43bd04caf96b
-
Edit the
glusterd.info
file in the new host (sys0.example.com) and include the UUID of the host you retrieved in the previous step.# cat /var/lib/glusterd/glusterd.info UUID=b5ab2ec3-5411-45fa-a30f-43bd04caf96b operating-version=30703
Note
The operating version of this node must be same as in other nodes of the trusted storage pool.
-
Create the peer file in the newly created host (sys0.example.com) in /var/lib/glusterd/peers/<uuid-of-other-peer> with the name of the UUID of the other host (sys1.example.com).
UUID of the host can be obtained with the following:
# gluster system:: uuid get
For example, # gluster system:: uuid get UUID: 1d9677dc-6159-405e-9319-ad85ec030880
In this case the UUID of other peer is
1d9677dc-6159-405e-9319-ad85ec030880
-
Create a file
/var/lib/glusterd/peers/1d9677dc-6159-405e-9319-ad85ec030880
in sys0.example.com, with the following command:# touch /var/lib/glusterd/peers/1d9677dc-6159-405e-9319-ad85ec030880
The file you create must contain the following information:
UUID=<uuid-of-other-node> state=3 hostname=<hostname>
-
Continue to perform steps 12 to 18 as documented in the previous procedure.
Rebalancing Volumes
If a volume has been expanded or shrunk using the add-brick
or
remove-brick
commands, the data on the volume needs to be rebalanced
among the servers.
Note
In a non-replicated volume, all bricks should be online to perform the
rebalance
operation using the start option. In a replicated volume, at least one of the bricks in the replica should be online.
To rebalance a volume, use the following command on any of the servers:
# gluster volume rebalance VOLNAME start
For example:
# gluster volume rebalance test-volume start Starting rebalancing on volume test-volume has been successful
A rebalance
operation, without force
option, will attempt to balance
the space utilized across nodes, thereby skipping files to rebalance in
case this would cause the target node of migration to have lesser
available space than the source of migration. This leads to link files
that are still left behind in the system and hence may cause performance
issues in access when a large number of such link files are present.
volume rebalance: VOLNAME: failed: Volume VOLNAME has one or more connected clients of a version lower than GlusterFS-2.1 update 5. Starting rebalance in this state could lead to data loss. Please disconnect those clients before attempting this command again.
strongly recommends you to disconnect all the older clients before executing the rebalance command to avoid a potential data loss scenario.
Warning
The
Rebalance
command can be executed with the force option even when the older clients are connected to the cluster. However, this could lead to a data loss situation.
A rebalance
operation with force
, balances the data based on the
layout, and hence optimizes or does away with the link files, but may
lead to an imbalanced storage space used across bricks. This option is
to be used only when there are a large number of link files in the
system.
To rebalance a volume forcefully, use the following command on any of the servers:
# gluster volume rebalance VOLNAME start force
For example:
# gluster volume rebalance test-volume start force Starting rebalancing on volume test-volume has been successful
Rebalance Throttling
Rebalance process is made multithreaded to handle multiple files migration for enhancing the performance. During multiple file migration, there can be a severe impact on storage system performance and a throttling mechanism is provided to manage it.
By default, the rebalance throttling is started in the normal
mode.
Configure the throttling modes to adjust the rate at which the files
must be migrated
# gluster volume set VOLNAME rebal-throttle lazy|normal|aggressive
For example:
# gluster volume set test-volume rebal-throttle lazy
Displaying Status of a Rebalance Operation
To display the status of a volume rebalance operation, use the following command:
# gluster volume rebalance VOLNAME status
For example:
# gluster volume rebalance test-volume status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 112 14567 150 0 in progress 10.16.156.72 140 2134 201 2 in progress
The time taken to complete the rebalance operation depends on the number of files on the volume and their size. Continue to check the rebalancing status, and verify that the number of rebalanced or scanned files keeps increasing.
For example, running the status command again might display a result similar to the following:
# gluster volume rebalance test-volume status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 112 14567 150 0 in progress 10.16.156.72 140 2134 201 2 in progress
The rebalance status will be shown as completed
the following when the
rebalance is complete:
# gluster volume rebalance test-volume status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 112 15674 170 0 completed 10.16.156.72 140 3423 321 2 completed
Stopping a Rebalance Operation
To stop a rebalance operation, use the following command:
# gluster volume rebalance VOLNAME stop
For example:
# gluster volume rebalance test-volume stop Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 102 12134 130 0 stopped 10.16.156.72 110 2123 121 2 stopped Stopped rebalance process on volume test-volume
Setting up Shared Storage Volume
Features like Snapshot Scheduler, NFS Ganesha and geo-replication
require a shared storage to be available across all nodes of the
cluster. A gluster volume named gluster_shared_storage
is made
available for this purpose, and is facilitated by the following volume
set option.
cluster.enable-shared-storage
This option accepts the following two values:
-
enable.
When the volume set option is enabled, a gluster volume named
gluster_shared_storage
is created in the cluster, and is mounted at/var/run/gluster/shared_storage
on all the nodes in the cluster.Note
-
This option cannot be enabled if there is only one node present in the cluster, or if only one node is online in the cluster.
-
The volume created is either a replica 2, or a replica 3 volume. This depends on the number of nodes which are online in the cluster at the time of enabling this option and each of these nodes will have one brick participating in the volume. The brick path participating in the volume is
/var/lib/glusterd/ss_brick.
-
The mount entry is also added to
/etc/fstab
as part ofenable
. -
Before enabling this feature make sure that there is no volume named
gluster_shared_storage
in the cluster. This volume name is reserved for internal use only
After successfully setting up the shared storage volume, when a new node is added to the cluster, the shared storage is not mounted automatically on this node. Neither is the
/etc/fstab
entry added for the shared storage on this node. To make use of shared storage on this node, execute the following commands:# mount -t glusterfs <local node's ip>:gluster_shared_storage /var/run/gluster/shared_storage # cp /etc/fstab /var/run/gluster/fstab.tmp # echo "<local node's ip>:/gluster_shared_storage /var/run/gluster/shared_storage/ glusterfs defaults 0 0" >> /etc/fstab
-
-
disable.
When the volume set option is disabled, the
gluster_shared_storage
volume is unmounted on all the nodes in the cluster, and then the volume is deleted. The mount entry from/etc/fstab
as part ofdisable
is also removed.
For example:
# gluster volume set all cluster.enable-shared-storage enable volume set: success
Stopping Volumes
To stop a volume, use the following command:
# gluster volume stop VOLNAME
For example, to stop test-volume:
# gluster volume stop test-volume Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y Stopping volume test-volume has been successful
Deleting Volumes
Important
Volumes must be unmounted and stopped before you can delete them. Ensure that you also remove entries relating to this volume from the
/etc/fstab
file after the volume has been deleted.
To delete a volume, use the following command:
# gluster volume delete VOLNAME
For example, to delete test-volume:
# gluster volume delete test-volume Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y Deleting volume test-volume has been successful
Managing Split-brain
Split-brain is a state when a data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and synchronizing their data to each other.
In GlusterFS, split-brain is a term applicable to GlusterFS volumes in a replicate configuration. A file is said to be in split-brain when the copies of the same file in different bricks that constitute the replica-pair have mismatching data and/or meta-data contents such that they are conflicting each other and automatic healing is not possible. In this scenario, you can decide which is the correct file (source) and which is the one that require healing (sink) by inspecting at the mismatching files from the backend bricks.
The AFR translator in glusterFS makes use of extended attributes to keep track of the operations on a file. These attributes determine which brick is the source and which brick is the sink for a file that require healing. If the files are clean, the extended attributes are all zeroes indicating that no heal is necessary. When a heal is required, they are marked in such a way that there is a distinguishable source and sink and the heal can happen automatically. But, when a split brain occurs, these extended attributes are marked in such a way that both bricks mark themselves as sources, making automatic healing impossible.
When a split-brain occurs, applications cannot perform certain operations like read and write on the file. Accessing the files results in the application receiving an Input/Output Error.
The three types of split-brains that occur in GlusterFS are:
-
Data split-brain: Contents of the file under split-brain are different in different replica pairs and automatic healing is not possible.
-
Metadata split-brain : The metadata of the files (example, user defined extended attribute) are different and automatic healing is not possible.
-
Entry split-brain: This happens when a file have different gfids on each of the replica pair.
The only way to resolve split-brains is by manually inspecting the file contents from the backend and deciding which is the true copy (source ) and modifying the appropriate extended attributes such that healing can happen automatically.
Preventing Split-brain
To prevent split-brain in the trusted storage pool, you must configure server-side and client-side quorum.
Configuring Server-Side Quorum
The quorum configuration in a trusted storage pool determines the number of server failures that the trusted storage pool can sustain. If an additional failure occurs, the trusted storage pool will become unavailable. If too many server failures occur, or if there is a problem with communication between the trusted storage pool nodes, it is essential that the trusted storage pool be taken offline to prevent data loss.
After configuring the quorum ratio at the trusted storage pool level,
you must enable the quorum on a particular volume by setting
cluster.server-quorum-type
volume option as server
. For more
information on this volume option, see Configuring Volume Options.
Configuration of the quorum is necessary to prevent network partitions in the trusted storage pool. Network Partition is a scenario where, a small set of nodes might be able to communicate together across a functioning part of a network, but not be able to communicate with a different set of nodes in another part of the network. This can cause undesirable situations, such as split-brain in a distributed system. To prevent a split-brain situation, all the nodes in at least one of the partitions must stop running to avoid inconsistencies.
This quorum is on the server-side, that is, the glusterd
service.
Whenever the glusterd
service on a machine observes that the quorum is
not met, it brings down the bricks to prevent data split-brain. When the
network connections are brought back up and the quorum is restored, the
bricks in the volume are brought back up. When the quorum is not met for
a volume, any commands that update the volume configuration or peer
addition or detach are not allowed. It is to be noted that both, the
glusterd
service not running and the network connection between two
machines being down are treated equally.
You can configure the quorum percentage ratio for a trusted storage pool. If the percentage ratio of the quorum is not met due to network outages, the bricks of the volume participating in the quorum in those nodes are taken offline. By default, the quorum is met if the percentage of active nodes is more than 50% of the total storage nodes. However, if the quorum ratio is manually configured, then the quorum is met only if the percentage of active storage nodes of the total storage nodes is greater than or equal to the set value.
To configure the quorum ratio, use the following command:
# gluster volume set all cluster.server-quorum-ratio PERCENTAGE
For example, to set the quorum to 51% of the trusted storage pool:
# gluster volume set all cluster.server-quorum-ratio 51%
In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time. If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.
You must ensure to enable the quorum on a particular volume to participate in the server-side quorum by running the following command:
# gluster volume set VOLNAME cluster.server-quorum-type server
Important
For a two-node trusted storage pool, it is important to set the quorum ratio to be greater than 50% so that two nodes separated from each other do not both believe they have a quorum.
For a replicated volume with two nodes and one brick on each machine, if the server-side quorum is enabled and one of the nodes goes offline, the other node will also be taken offline because of the quorum configuration. As a result, the high availability provided by the replication is ineffective. To prevent this situation, a dummy node can be added to the trusted storage pool which does not contain any bricks. This ensures that even if one of the nodes which contains data goes offline, the other node will remain online. Note that if the dummy node and one of the data nodes goes offline, the brick on other node will be also be taken offline, and will result in data unavailability.
Configuring Client-Side Quorum
Replication in GlusterFS Server allows modifications as long as at least one of the bricks in a replica group is online. In a network-partition scenario, different clients connect to different bricks in the replicated environment. In this situation different clients may modify the same file on different bricks. When a client is witnessing brick disconnections, a file could be modified on different bricks at different times while the other brick is off-line in the replica. For example, in a 1 X 2 replicate volume, while modifying the same file, it can so happen that client C1 can connect only to brick B1 and client C2 can connect only to brick B2. These situations lead to split-brain and the file becomes unusable and manual intervention is required to fix this issue.
Client-side quorum is implemented to minimize split-brains. Client-side
quorum configuration determines the number of bricks that must be up for
it to allow data modification. If client-side quorum is not met, files
in that replica group become read-only. This client-side quorum
configuration applies for all the replica groups in the volume, if
client-side quorum is not met for m
of n
replica groups only m
replica groups becomes read-only and the rest of the replica groups
continue to allow data modifications.
In the above scenario, when the client-side quorum is not met for
replica group A
, only replica group A
becomes read-only. Replica
groups B
and C
continue to allow data modifications.
Important
If
cluster.quorum-type
isfixed
, writes will continue till number of bricks up and running in replica pair is equal to the count specified incluster.quorum-count
option. This is irrespective of first or second or third brick. All the bricks are equivalent here.If
cluster.quorum-type
isauto
, then at least ceil (n/2) number of bricks need to be up to allow writes, wheren
is the replica count. For example,for replica 2, ceil(2/2)= 1 brick for replica 3, ceil(3/2)= 2 bricks for replica 4, ceil(4/2)= 2 bricks for replica 5, ceil(5/2)= 3 bricks for replica 6, ceil(6/2)= 3 bricks and so onIn addition, for
auto
, if the number of bricks that are up is exactly ceil (n/2), andn
is an even number, then the first brick of the replica must also be up to allow writes. For replica 6, if more than 3 bricks are up, then it can be any of the bricks. But if exactly 3 bricks are up, then the first brick has to be up and running.In a three-way replication setup, it is recommended to set
cluster.quorum-type
toauto
to avoid split brains. If the quorum is not met, the replica pair becomes read-only.
Configure the client-side quorum using cluster.quorum-type
and
cluster.quorum-count
options. For more information on these options,
see Configuring Volume Options.
Important
When you integrate GlusterFS with Red Hat Enterprise Virtualization or Red Hat OpenStack, the client-side quorum is enabled when you run
gluster volume set VOLNAME group virt
command. If on a two replica set up, if the first brick in the replica pair is offline, virtual machines will be paused because quorum is not met and writes are disallowed.Consistency is achieved at the cost of fault tolerance. If fault-tolerance is preferred over consistency, disable client-side quorum with the following command:
# gluster volume reset VOLNAME quorum-type
Example - Setting up server-side and client-side quorum to avoid split-brain scenario.
This example provides information on how to set server-side and client-side quorum on a Distribute Replicate volume to avoid split-brain scenario. The configuration of this example has 2 X 2 ( 4 bricks) Distribute Replicate setup.
# gluster volume info testvol Volume Name: testvol Type: Distributed-Replicate Volume ID: 0df52d58-bded-4e5d-ac37-4c82f7c89cfh Status: Created Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: server1:/rhgs/brick1 Brick2: server2:/rhgs/brick2 Brick3: server3:/rhgs/brick3 Brick4: server4:/rhgs/brick4
Setting Server-side Quorum
Enable the quorum on a particular volume to participate in the server-side quorum by running the following command:
# gluster volume set VOLNAME cluster.server-quorum-type server
Set the quorum to 51% of the trusted storage pool:
# gluster volume set all cluster.server-quorum-ratio 51%
In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time. If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.
Setting Client-side Quorum
Set the quorum-type`option to `auto
to allow writes to the file only
if the percentage of active replicate bricks is more than 50% of the
total number of bricks that constitute that replica.
# gluster volume set VOLNAME quorum-type auto
In this example, as there are only two bricks in the replica pair, the first brick must be up and running to allow writes.
Important
Atleast n/2 bricks need to be up for the quorum to be met. If the number of bricks (
n
) in a replica set is an even number, it is mandatory that then/2
count must consist of the primary brick and it must be up and running. Ifn
is an odd number, then/2
count can have any brick up and running, that is, the primary brick need not be up and running to allow writes.
Recovering from File Split-brain
You can recover from the data and meta-data split-brain using one of the following methods:
-
See Recovering File Split-brain from the Mount Point for information on how to recover from data and meta-data split-brain from the mount point.
-
See Recovering File Split-brain from the gluster CLI for information on how to recover from data and meta-data split-brain using CLI
For information on resolving gfid/entry
split-brain, see
Manually Resolving Split-brains.
Recovering File Split-brain from the Mount Point
-
You can use a set of
getfattr
andsetfattr
commands to detect the data and meta-data split-brain status of a file and resolve split-brain from the mount point.Important
This process for split-brain resolution from mount will not work on NFS mounts as it does not provide extended attributes support.
In this example, the
test-volume
volume has bricksbrick0
,brick1
,` brick2` andbrick3
.# gluster volume info test-volume Volume Name: test-volume Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: test-host:/rhgs/brick0 Brick2: test-host:/rhgs/brick1 Brick3: test-host:/rhgs/brick2 Brick4: test-host:/rhgs/brick3
Directory structure of the bricks is as follows:
# tree -R /test/b? /rhgs/brick0 ├── dir │ └── a └── file100 /rhgs/brick1 ├── dir │ └── a └── file100 /rhgs/brick2 ├── dir ├── file1 ├── file2 └── file99 /rhgs/brick3 ├── dir ├── file1 ├── file2 └── file99
In the following output, some of the files in the volume are in split-brain.
# gluster volume heal test-volume info split-brain Brick test-host:/rhgs/brick0/ /file100 /dir Number of entries in split-brain: 2 Brick test-host:/rhgs/brick1/ /file100 /dir Number of entries in split-brain: 2 Brick test-host:/rhgs/brick2/ /file99 <gfid:5399a8d1-aee9-4653-bb7f-606df02b3696> Number of entries in split-brain: 2 Brick test-host:/rhgs/brick3/ <gfid:05c4b283-af58-48ed-999e-4d706c7b97d5> <gfid:5399a8d1-aee9-4653-bb7f-606df02b3696> Number of entries in split-brain: 2
To know data or meta-data split-brain status of a file:
# getfattr -n replica.split-brain-status <path-to-file>
The above command executed from mount provides information if a file is in data or meta-data split-brain. This command is not applicable to gfid/entry split-brain.
For example, *
file100
is in meta-data split-brain. Executing the above mentioned command forfile100
gives :+
# getfattr -n replica.split-brain-status file100 # file: file100 replica.split-brain-status="data-split-brain:no metadata-split-brain:yes Choices:test-client-0,test-client-1"
-
file1
is in data split-brain.# getfattr -n replica.split-brain-status file1 # file: file1 replica.split-brain-status="data-split-brain:yes metadata-split-brain:no Choices:test-client-2,test-client-3"
-
file99
is in both data and meta-data split-brain.# getfattr -n replica.split-brain-status file99 # file: file99 replica.split-brain-status="data-split-brain:yes metadata-split-brain:yes Choices:test-client-2,test-client-3"
-
dir
is ingfid/entry
split-brain but as mentioned earlier, the above command is does not display if the file is ingfid/entry
split-brain. Hence, the command displaysThe file is not under data or metadata split-brain
. For information on resolving gfid/entry split-brain, see Manually Resolving Split-brains.# getfattr -n replica.split-brain-status dir # file: dir replica.split-brain-status="The file is not under data or metadata split-brain"
-
file2
is not in any kind of split-brain.# getfattr -n replica.split-brain-status file2 # file: file2 replica.split-brain-status="The file is not under data or metadata split-brain"
-
-
Analyze the files in data and meta-data split-brain and resolve the issue.
When you perform operations like
cat
,getfattr
, and more from the mount on files in split-brain, it throws an input/output error. For further analyzing such files, you can usesetfattr
command.# setfattr -n replica.split-brain-choice -v "choiceX" <path-to-file>
Using this command, a particular brick can be chosen to access the file in split-brain.
For example,
file1
is in data-split-brain and when you try to read from the file, it throws input/output error.# cat file1 cat: file1: Input/output error
Split-brain choices provided for file1 were
test-client-2
andtest-client-3
.Setting
test-client-2
as split-brain choice for file1 serves reads fromb2
for the file.# setfattr -n replica.split-brain-choice -v test-client-2 file1
Now, you can perform operations on the file. For example, read operations on the file:
# cat file1 xyz
Similarly, to inspect the file from other choice,
replica.split-brain-choice
is to be set totest-client-3
.Trying to inspect the file from a wrong choice errors out. You can undo the split-brain-choice that has been set, the above mentioned
setfattr
command can be used withnone
as the value for extended attribute.For example,
# setfattr -n replica.split-brain-choice -v none file1
Now performing
cat
operation on the file will again result in input/output error, as before.# cat file cat: file1: Input/output error
After you decide which brick to use as a source for resolving the split-brain, it must be set for the healing to be done.
# setfattr -n replica.split-brain-heal-finalize -v <heal-choice> <path-to-file>
Example
# setfattr -n replica.split-brain-heal-finalize -v test-client-2 file1
The above process can be used to resolve data and/or meta-data split-brain on all the files.
Setting the split-brain-choice on the file
After setting the split-brain-choice on the file, the file can be analyzed only for five minutes. If the duration of analyzing the file needs to be increased, use the following command and set the required time in ` timeout-in-minute` argument.
# setfattr -n replica.split-brain-choice-timeout -v <timeout-in-minutes> <mount_point/file>
This is a global timeout and is applicable to all files as long as the mount exists. The timeout need not be set each time a file needs to be inspected but for a new mount it will have to be set again for the first time. This option becomes invalid if the operations like add-brick or remove-brick are performed.
Note
If
fopen-keep-cache
FUSE mount option is disabled, then inode must be invalidated each time before selecting a newreplica.split-brain-choice
to inspect a file using the following command:# setfattr -n inode-invalidate -v 0 <path-to-file>
Recovering File Split-brain from the gluster CLI
You can resolve the split-brin from the gluster CLI by the following ways:
-
Use bigger-file as source
-
Use the file with latest mtime as source
-
Use one replica as source for a particular file
-
Use one replica as source for all files
Note
The
entry/gfid
split-brain resolution is not supported using CLI. For information on resolvinggfid/entry
split-brain, see Manually Resolving Split-brains.
Selecting the bigger-file as source.
This method is useful for per file healing and where you can decided that the file with bigger size is to be considered as source.
-
Run the following command to obtain the list of files that are in split-brain:
# gluster volume heal VOLNAME info split-brain
Brick <hostname:brickpath-b1> <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2> <gfid:39f301ae-4038-48c2-a889-7dac143e82dd> <gfid:c3c94de2-232d-4083-b534-5da17fc476ac> Number of entries in split-brain: 3 Brick <hostname:brickpath-b2> /dir/file1 /dir /file4 Number of entries in split-brain: 3
From the command output, identify the files that are in split-brain.
You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
On brick b1: # stat b1/dir/file1 File: ‘b1/dir/file1’ Size: 17 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919362 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 13:55:40.149897333 +0530 Modify: 2015-03-06 13:55:37.206880347 +0530 Change: 2015-03-06 13:55:37.206880347 +0530 Birth: - # md5sum b1/dir/file1 040751929ceabf77c3c0b3b662f341a8 b1/dir/file1 On brick b2: # stat b2/dir/file1 File: ‘b2/dir/file1’ Size: 13 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919365 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 13:54:22.974451898 +0530 Modify: 2015-03-06 13:52:22.910758923 +0530 Change: 2015-03-06 13:52:22.910758923 +0530 Birth: - # md5sum b2/dir/file1 cb11635a45d45668a403145059c2a0d5 b2/dir/file1
You can notice the differences in the file size and md5 checksums.
-
Execute the following command along with the full file name as seen from the root of the volume (or) the gfid-string representation of the file, which is displayed in the heal info command’s output.
# gluster volume heal <VOLNAME> split-brain bigger-file <FILE>
For example,
# gluster volume heal test-volume split-brain bigger-file /dir/file1 Healed /dir/file1.
After the healing is complete, the md5sum and file size on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file.
On brick b1: # stat b1/dir/file1 File: ‘b1/dir/file1’ Size: 17 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919362 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 14:17:27.752429505 +0530 Modify: 2015-03-06 13:55:37.206880347 +0530 Change: 2015-03-06 14:17:12.880343950 +0530 Birth: - # md5sum b1/dir/file1 040751929ceabf77c3c0b3b662f341a8 b1/dir/file1 On brick b2: # stat b2/dir/file1 File: ‘b2/dir/file1’ Size: 17 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919365 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 14:17:23.249403600 +0530 Modify: 2015-03-06 13:55:37.206880000 +0530 Change: 2015-03-06 14:17:12.881343955 +0530 Birth: - # md5sum b2/dir/file1 040751929ceabf77c3c0b3b662f341a8 b2/dir/file1
Selecting the file with latest mtime as source.
This method is useful for per file healing and if you want the file with latest mtime has to be considered as source.
-
Run the following command to obtain the list of files that are in split-brain:
# gluster volume heal VOLNAME info split-brain
Brick <hostname:brickpath-b1> <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2> <gfid:39f301ae-4038-48c2-a889-7dac143e82dd> <gfid:c3c94de2-232d-4083-b534-5da17fc476ac> Number of entries in split-brain: 3 Brick <hostname:brickpath-b2> /dir/file1 /dir /file4 Number of entries in split-brain: 3
From the command output, identify the files that are in split-brain.
You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
On brick b1: stat b1/file4 File: ‘b1/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919356 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 13:53:19.417085062 +0530 Modify: 2015-03-06 13:53:19.426085114 +0530 Change: 2015-03-06 13:53:19.426085114 +0530 Birth: - # md5sum b1/file4 b6273b589df2dfdbd8fe35b1011e3183 b1/file4 On brick b2: # stat b2/file4 File: ‘b2/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919358 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 13:52:35.761833096 +0530 Modify: 2015-03-06 13:52:35.769833142 +0530 Change: 2015-03-06 13:52:35.769833142 +0530 Birth: - # md5sum b2/file4 0bee89b07a248e27c83fc3d5951213c1 b2/file4
You can notice the differences in the md5 checksums, and the modify time.
-
Execute the following command
# gluster volume heal <VOLNAME> split-brain latest-mtime <FILE>
In this command, FILE can be either the full file name as seen from the root of the volume or the gfid-string representation of the file.
For example,
#gluster volume heal test-volume split-brain latest-mtime /file4 Healed /file4
After the healing is complete, the md5 checksum, file size, and modify time on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file. You can notice that the file has been healed using the brick having the latest mtime (brick b1, in this example) as the source.
On brick b1: # stat b1/file4 File: ‘b1/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919356 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 14:23:38.944609863 +0530 Modify: 2015-03-06 13:53:19.426085114 +0530 Change: 2015-03-06 14:27:15.058927962 +0530 Birth: - # md5sum b1/file4 b6273b589df2dfdbd8fe35b1011e3183 b1/file4 On brick b2: # stat b2/file4 File: ‘b2/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919358 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 14:23:38.944609000 +0530 Modify: 2015-03-06 13:53:19.426085000 +0530 Change: 2015-03-06 14:27:15.059927968 +0530 Birth: # md5sum b2/file4 b6273b589df2dfdbd8fe35b1011e3183 b2/file4
Selecting one replica as source for a particular file.
This method is useful if you know which file is to be considered as source.
-
Run the following command to obtain the list of files that are in split-brain:
# gluster volume heal VOLNAME info split-brain
Brick <hostname:brickpath-b1> <gfid:aaca219f-0e25-4576-8689-3bfd93ca70c2> <gfid:39f301ae-4038-48c2-a889-7dac143e82dd> <gfid:c3c94de2-232d-4083-b534-5da17fc476ac> Number of entries in split-brain: 3 Brick <hostname:brickpath-b2> /dir/file1 /dir /file4 Number of entries in split-brain: 3
From the command output, identify the files that are in split-brain.
You can find the differences in the file size and md5 checksums by performing a stat and md5 checksums on the file from the bricks. The following is the stat and md5 checksum output of a file:
On brick b1: stat b1/file4 File: ‘b1/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919356 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 13:53:19.417085062 +0530 Modify: 2015-03-06 13:53:19.426085114 +0530 Change: 2015-03-06 13:53:19.426085114 +0530 Birth: - # md5sum b1/file4 b6273b589df2dfdbd8fe35b1011e3183 b1/file4 On brick b2: # stat b2/file4 File: ‘b2/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919358 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 13:52:35.761833096 +0530 Modify: 2015-03-06 13:52:35.769833142 +0530 Change: 2015-03-06 13:52:35.769833142 +0530 Birth: - # md5sum b2/file4 0bee89b07a248e27c83fc3d5951213c1 b2/file4
You can notice the differences in the file size and md5 checksums.
-
Execute the following command
# gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME> <FILE>
In this command, FILE present in <HOSTNAME:BRICKNAME> is taken as source for healing.
For example,
# gluster volume heal test-volume split-brain source-brick test-host:b1 /file4 Healed /file4
After the healing is complete, the md5 checksum and file size on both bricks must be same. The following is a sample output of the stat and md5 checksums command after completion of healing the file.
On brick b1: # stat b1/file4 File: ‘b1/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919356 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 14:23:38.944609863 +0530 Modify: 2015-03-06 13:53:19.426085114 +0530 Change: 2015-03-06 14:27:15.058927962 +0530 Birth: - # md5sum b1/file4 b6273b589df2dfdbd8fe35b1011e3183 b1/file4 On brick b2: # stat b2/file4 File: ‘b2/file4’ Size: 4 Blocks: 16 IO Block: 4096 regular file Device: fd03h/64771d Inode: 919358 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2015-03-06 14:23:38.944609000 +0530 Modify: 2015-03-06 13:53:19.426085000 +0530 Change: 2015-03-06 14:27:15.059927968 +0530 Birth: - # md5sum b2/file4 b6273b589df2dfdbd8fe35b1011e3183 b2/file4
Selecting one replica as source for all files.
This method is useful if you know want to use a particular brick as a source for the split-brain files in that replica pair.
-
Run the following command to obtain the list of files that are in split-brain:
# gluster volume heal VOLNAME info split-brain
From the command output, identify the files that are in split-brain.
-
Execute the following command
# gluster volume heal <VOLNAME> split-brain source-brick <HOSTNAME:BRICKNAME>
In this command, for all the files that are in split-brain in this replica, <HOSTNAME:BRICKNAME> is taken as source for healing.
For example,
# gluster volume heal test-volume split-brain source-brick test-host:b1
Triggering Self-Healing on Replicated Volumes
For replicated volumes, when a brick goes offline and comes back online, self-healing is required to re-sync all the replicas. There is a self-heal daemon which runs in the background, and automatically initiates self-healing every 10 minutes on any files which require healing.
Multithreaded Self-heal.
Self-heal daemon has the capability to handle multiple heals in parallel
and is supported on Replicate and Distribute-replicate volumes. However,
increasing the number of heals has impact on I/O performance so the
following options have been provided. The cluster.shd-max-threads
volume option controls the number of entries that can be self healed in
parallel on each replica by self-heal daemon using. Using
cluster.shd-wait-qlength
volume option, you can configure the number
of entries that must be kept in the queue for self-heal daemon threads
to take up as soon as any of the threads are free to heal.
For more information on cluster.shd-max-threads
and
cluster.shd-wait-qlength
volume set options, see Configuring Volume Options.
There are various commands that can be used to check the healing status of volumes and files, or to manually initiate healing:
-
To view the list of files that need healing:
# gluster volume heal VOLNAME info
For example, to view the list of files on test-volume that need healing:
# gluster volume heal test-volume info Brick server1:/gfs/test-volume_0 Number of entries: 0 Brick server2:/gfs/test-volume_1 /95.txt /32.txt /66.txt /35.txt /18.txt /26.txt - Possibly undergoing heal /47.txt /55.txt /85.txt - Possibly undergoing heal ... Number of entries: 101
-
To trigger self-healing only on the files which require healing:
# gluster volume heal VOLNAME
For example, to trigger self-healing on files which require healing on test-volume:
# gluster volume heal test-volume Heal operation on volume test-volume has been successful
-
To trigger self-healing on all the files on a volume:
# gluster volume heal VOLNAME full
For example, to trigger self-heal on all the files on test-volume:
# gluster volume heal test-volume full Heal operation on volume test-volume has been successful
-
To view the list of files on a volume that are in a split-brain state:
# gluster volume heal VOLNAME info split-brain
For example, to view the list of files on test-volume that are in a split-brain state:
# gluster volume heal test-volume info split-brain Brick server1:/gfs/test-volume_2 Number of entries: 12 at path on brick ---------------------------------- 2012-06-13 04:02:05 /dir/file.83 2012-06-13 04:02:05 /dir/file.28 2012-06-13 04:02:05 /dir/file.69 Brick server2:/gfs/test-volume_2 Number of entries: 12 at path on brick ---------------------------------- 2012-06-13 04:02:05 /dir/file.83 2012-06-13 04:02:05 /dir/file.28 2012-06-13 04:02:05 /dir/file.69 ...
Non Uniform File Allocation (NUFA)
When a client on a server creates files, the files are allocated to a brick in the volume based on the file name. This allocation may not be ideal, as there is higher latency and unnecessary network traffic for read/write operations to a non-local brick or export directory. NUFA ensures that the files are created in the local export directory of the server, and as a result, reduces latency and conserves bandwidth for that server accessing that file. This can also be useful for applications running on mount points on the storage server.
If the local brick runs out of space or reaches the minimum disk free limit, instead of allocating files to the local brick, NUFA distributes files to other bricks in the same volume if there is space available on those bricks.
NUFA should be enabled before creating any data in the volume. To enable
NUFA, execute gluster volume set VOLNAMEcluster.nufa enableon
.
Important
NUFA is supported under the following conditions:
Volumes with only with one brick per server.
For use with a FUSE client. NUFA is not supported with NFS or SMB.
A client that is mounting a NUFA-enabled volume must be present within the trusted storage pool.