Manually Recovering File Split-brain

This chapter provides steps to manually recover from split-brain.

  1. Run the following command to obtain the path of the file that is in split-brain:

    # gluster volume heal VOLNAME info split-brain

    From the command output, identify the files for which file operations performed from the client keep failing with Input/Output error.

  2. Close the applications that opened split-brain file from the mount point. If you are using a virtual machine, you must power off the machine.

  3. Obtain and verify the AFR changelog extended attributes of the file using the getfattr command. Then identify the type of split-brain to determine which of the bricks contains the 'good copy' of the file.

    getfattr -d -m . -e hex <file-path-on-brick>

    For example,

    # getfattr -d -e hex -m. brick-a/file.txt
    \#file: brick-a/file.txt
    security.selinux=0x726f6f743a6f626a6563745f723a66696c655f743a733000
    trusted.afr.vol-client-2=0x000000000000000000000000
    trusted.afr.vol-client-3=0x000000000200000000000000
    trusted.gfid=0x307a5c9efddd4e7c96e94fd4bcdcbd1b

    The extended attributes with trusted.afr.VOLNAMEvolname-client-<subvolume-index> are used by AFR to maintain changelog of the file. The values of the trusted.afr.VOLNAMEvolname-client-<subvolume-index> are calculated by the glusterFS client (FUSE or NFS-server) processes. When the glusterFS client modifies a file or directory, the client contacts each brick and updates the changelog extended attribute according to the response of the brick.

    subvolume-index is the brick number - 1 of gluster volume info VOLNAME output.

    For example,

    # gluster volume info vol
    Volume Name: vol
    Type: Distributed-Replicate
    Volume ID: 4f2d7849-fbd6-40a2-b346-d13420978a01
    Status: Created
    Number of Bricks: 4 x 2 = 8
    Transport-type: tcp
    Bricks:
    brick-a: server1:/gfs/brick-a
    brick-b: server1:/gfs/brick-b
    brick-c: server1:/gfs/brick-c
    brick-d: server1:/gfs/brick-d
    brick-e: server1:/gfs/brick-e
    brick-f: server1:/gfs/brick-f
    brick-g: server1:/gfs/brick-g
    brick-h: server1:/gfs/brick-h

    In the example above:

    Brick             |    Replica set        |    Brick subvolume index

    -/gfs/brick-a | 0 | 0 -/gfs/brick-b | 0 | 1 -/gfs/brick-c | 1 | 2 -/gfs/brick-d | 1 | 3 -/gfs/brick-e | 2 | 4 -/gfs/brick-f | 2 | 5 -/gfs/brick-g | 3 | 6 -/gfs/brick-h | 3 | 7

----------------------------------------------------------------------------
+
Each file in a brick maintains the changelog of itself and that of the
files present in all the other bricks in it's replica set as seen by
that brick.
+
In the example volume given above, all files in brick-a will have 2
entries, one for itself and the other for the file present in it's
replica pair. The following is the changelog for brick-b,
* trusted.afr.vol-client-0=0x000000000000000000000000 - is the changelog
for itself (brick-a)
* trusted.afr.vol-client-1=0x000000000000000000000000 - changelog for
brick-b as seen by brick-a
+
Likewise, all files in brick-b will have the following:
* trusted.afr.vol-client-0=0x000000000000000000000000 - changelog for
brick-a as seen by brick-b
* trusted.afr.vol-client-1=0x000000000000000000000000 - changelog for
itself (brick-b)
+
_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
*Note*

From the release of GlusterFS 3.1, the files will `not`
have an entry for itself, but only the changelog entry for the other
bricks in the replica. For example, `brick-a` will only have
`trusted.afr.vol-client-1` set and `brick-b` will only have
`trusted.afr.vol-client-0` set. Interpreting the changelog remains same
as explained below.
_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
+
The same can be extended for other replica pairs.
+
*Interpreting changelog (approximate pending operation count) value.*
+
Each extended attribute has a value which is 24 hexa decimal digits.
First 8 digits represent changelog of data. Second 8 digits represent
changelog of metadata. Last 8 digits represent Changelog of directory
entries.
+
Pictorially representing the same is as follows:
+
---------------------------------------------------------
0x 000003d7 00000001 00000000110
        |      |       |
        |      |        \_ changelog of directory entries
        |       \_ changelog of metadata
         \ _ changelog of data
---------------------------------------------------------
+
For directories, metadata and entry changelogs are valid. For regular
files, data and metadata changelogs are valid. For special files like
device files and so on, metadata changelog is valid. When a file
split-brain happens it could be either be data split-brain or meta-data
split-brain or both.
+
The following is an example of both data, metadata split-brain on the
same file:
+
-------------------------------------------------------
# getfattr -d -m . -e hex /gfs/brick-?/a
getfattr: Removing leading '/' from absolute path names
\#file: gfs/brick-a/a
trusted.afr.vol-client-0=0x000000000000000000000000
trusted.afr.vol-client-1=0x000003d70000000100000000
trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57
\#file: gfs/brick-b/a
trusted.afr.vol-client-0=0x000003b00000000100000000
trusted.afr.vol-client-1=0x000000000000000000000000
trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57
-------------------------------------------------------
+
*Scrutinize the changelogs.*
+
The changelog extended attributes on file `/gfs/brick-a/a` are as
follows:
* The first 8 digits of
`trusted.afr.vol-client-0 are all zeros (0x00000000................)`,
+
The first 8 digits of` trusted.afr.vol-client-1` are not all zeros
(0x000003d7................).
+
So the changelog on `/gfs/brick-a/a` implies that some data operations
succeeded on itself but failed on `/gfs/brick-b/a`.
* The second 8 digits of
`trusted.afr.vol-client-0 are all zeros (0x........00000000........)`,
and the second 8 digits of `trusted.afr.vol-client-1` are not all zeros
(0x........00000001........).
+
So the changelog on `/gfs/brick-a/a` implies that some metadata
operations succeeded on itself but failed on `/gfs/brick-b/a`.
+
The changelog extended attributes on file `/gfs/brick-b/a` are as
follows:
* The first 8 digits of `trusted.afr.vol-client-0` are not all zeros
(0x000003b0................).
+
The first 8 digits of `trusted.afr.vol-client-1` are all zeros
(0x00000000................).
+
So the changelog on `/gfs/brick-b/a` implies that some data operations
succeeded on itself but failed on `/gfs/brick-a/a`.
* The second 8 digits of `trusted.afr.vol-client-0` are not all zeros
(0x........00000001........)
+
The second 8 digits of `trusted.afr.vol-client-1` are all zeros
(0x........00000000........).
+
So the changelog on `/gfs/brick-b/a` implies that some metadata
operations succeeded on itself but failed on `/gfs/brick-a/a`.
+
Here, both the copies have data, metadata changes that are not on the
other file. Hence, it is both data and metadata split-brain.
+
*Deciding on the correct copy.*
+
You must inspect `stat` and `getfattr `output of the files to decide
which metadata to retain and contents of the file to decide which data
to retain. To continue with the example above, here, we are retaining
the data of `/gfs/brick-a/a` and metadata of `/gfs/brick-b/a`.
+
*Resetting the relevant changelogs to resolve the split-brain.*
+
*Resolving data split-brain.*
+
You must change the changelog extended attributes on the files as if
some data operations succeeded on `/gfs/brick-a/a` but failed on
/gfs/brick-b/a. But `/gfs/brick-b/a` should `not` have any changelog
showing data operations succeeded on `/gfs/brick-b/a` but failed on
`/gfs/brick-a/a`. You must reset the data part of the changelog on
`trusted.afr.vol-client-0` of `/gfs/brick-b/a`.
+
*Resolving metadata split-brain.*
+
You must change the changelog extended attributes on the files as if
some metadata operations succeeded on `/gfs/brick-b/a` but failed on
`/gfs/brick-a/a`. But `/gfs/brick-a/a` should `not` have any changelog
which says some metadata operations succeeded on `/gfs/brick-a/a` but
failed on `/gfs/brick-b/a`. You must reset metadata part of the
changelog on `trusted.afr.vol-client-1` of `/gfs/brick-a/a`
+
Run the following commands to reset the extended attributes.
1.  On `/gfs/brick-b/a`, for
`trusted.afr.vol-client-0 0x000003b00000000100000000` to
`0x000000000000000100000000`, execute the following command:
+
-----------------------------------------------------------------------------------
# setfattr -n trusted.afr.vol-client-0 -v 0x000000000000000100000000 /gfs/brick-b/a
-----------------------------------------------------------------------------------
2.  On` /gfs/brick-a/a`, for
`trusted.afr.vol-client-1 0x0000000000000000ffffffff` to
`0x000003d70000000000000000`, execute the following command:
+
-----------------------------------------------------------------------------------
# setfattr -n trusted.afr.vol-client-1 -v 0x000003d70000000000000000 /gfs/brick-a/a
-----------------------------------------------------------------------------------
+
After you reset the extended attributes, the changelogs would look
similar to the following:
+
---------------------------------------------------------
# getfattr -d -m . -e hex /gfs/brick-?/a
getfattr: Removing leading '/' from absolute path names
\#file: gfs/brick-a/a
trusted.afr.vol-client-0=0x000000000000000000000000
trusted.afr.vol-client-1=0x000003d70000000000000000
trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57

\#file: gfs/brick-b/a
trusted.afr.vol-client-0=0x000000000000000100000000
trusted.afr.vol-client-1=0x000000000000000000000000
trusted.gfid=0x80acdbd886524f6fbefa21fc356fed57
---------------------------------------------------------
+
*Resolving Directory entry split-brain.*
+
AFR has the ability to conservatively merge different entries in the
directories when there is a split-brain on directory. If on one brick
directory `storage` has entries `1`, `2` and has entries `3`, `4` on the
other brick then AFR will merge all of the entries in the directory to
have `1, 2, 3, 4` entries in the same directory. But this may result in
deleted files to re-appear in case the split-brain happens because of
deletion of files on the directory. Split-brain resolution needs human
intervention when there is at least one entry which has same file name
but different `gfid` in that directory.
+
For example:
+
On `brick-a` the directory has 2 entries `file1` with `gfid_x` and
`file2` . On `brick-b` directory has 2 entries `file1` with `gfid_y` and
`file3`. Here the gfid's of `file1` on the bricks are different. These
kinds of directory split-brain needs human intervention to resolve the
issue. You must remove either `file1` on `brick-a` or the `file1` on
`brick-b` to resolve the split-brain.
+
In addition, the corresponding `gfid-link `file must be removed. The
`gfid-link` files are present in the .`glusterfs `directory in the
top-level directory of the brick. If the gfid of the file is
`0x307a5c9efddd4e7c96e94fd4bcdcbd1b` (the trusted.gfid extended
attribute received from the `getfattr` command earlier), the gfid-link
file can be found at
`/gfs/brick-a/.glusterfs/30/7a/307a5c9efddd4e7c96e94fd4bcdcbd1b`.
+
___________________________________________________________________________________________________________________________________________________________
*Warning*

Before deleting the `gfid-link`, you must ensure that there are no hard
links to the file present on that brick. If hard-links exist, you must
delete them.
___________________________________________________________________________________________________________________________________________________________
4.  Trigger self-heal by running the following command:
+
------------------------------------
# ls -l <file-path-on-gluster-mount>
------------------------------------
+
or
+
-----------------------------
# gluster volume heal VOLNAME
-----------------------------

results matching ""

    No results matching ""