天天看點

OpenStack之cinder基于快照恢複磁盤bug解決過程OpenStack之cinder基于快照恢複磁盤bug解決過程

OpenStack之cinder基于快照恢複磁盤bug解決過程

現象

OpenStack,使用基于快照恢複磁盤資料的功能,但是恢複不成功。

問題的原因

通過解讀源碼,最終發現,是ceph的相關配置檔案名的命名必須是一字不差才行。

因為cinder相關子產品源碼裡邊,将ceph相關配置檔案名寫死了。

我們的運維工程師在部署ceph階段,就已經使用了非官方的keyring檔案名,系統運作了很久,直到用到了快照恢複磁盤資料這個功能,才觸發了這個bug。

解決過程

第一件事,打開debug模式,就可以看到更詳細的日志:

vi /etc/cinder/cinder.conf

debug=true

檢視日志:

tail -f /var/log/cinder/cinder-volume.log

問題1,缺少所需的配置參數

錯誤日志如下

2020-04-14 11:32:01.241 1403591 WARNING cinder.context [req-c72b8fca-7563-48c2-b4f7-5e67bd6b2dc4 daa48f0af437437f8abe6aa47102e7e5 f02d23f04448412781122a365621e218 - 154925c4fbe74cc2ae2b98b6bd0aea5f 154925c4fbe74cc2ae2b98b6bd0aea5f] Unable to get internal tenant context: Missing required config parameters.

定位到抛錯誤的程式

檔案位于:/usr/lib/python3/dist-packages/cinder/context.py

源碼如下:

def get_internal_tenant_context():
    """Build and return the Cinder internal tenant context object
    This request context will only work for internal Cinder operations. It will
    not be able to make requests to remote services. To do so it will need to
    use the keystone client to get an auth_token.
    """
    project_id = CONF.cinder_internal_tenant_project_id
    user_id = CONF.cinder_internal_tenant_user_id

    if project_id and user_id:
        return RequestContext(user_id=user_id,
                              project_id=project_id,
                              is_admin=True,
                              overwrite=False)
    else:
        LOG.warning('Unable to get internal tenant context: Missing '
                    'required config parameters.')
        return None
           

修改cinder.conf 增加cinder的所屬projectID和USERID。

其實到最後可以發現,此舉非必須,并不影響使用快照恢複磁盤資料的功能。

檔案位于:/etc/cinder/cinder.conf

[DEFAULT]
cinder_internal_tenant_project_id = 84e68a2d256c466cbd2796472769f037
cinder_internal_tenant_user_id = 90b78005e1824e91b912c50b045faa5e
           

問題2,校驗權限出錯,說明擷取參數的過程出現了錯誤

錯誤日志如下

2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd File "/usr/lib/python3/dist-packages/os_brick/initiator/linuxrbd.py", line 70, in connect

2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd client.connect()

2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd File "rados.pyx", line 893, in rados.Rados.connect

2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd rados.PermissionError: [errno 1] error connecting to the cluster

2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd

定位到抛錯誤的程式

檔案位于:/usr/lib/python3/dist-packages/os_brick/initiator/linuxrbd.py

檔案源碼如下:

class RBDClient(object):

    def __init__(self, user, pool, *args, **kwargs):

        self.rbd_user = user
        self.rbd_pool = pool

        for attr in ['rbd_user', 'rbd_pool']:
            val = getattr(self, attr)
            if val is not None:
                setattr(self, attr, utils.convert_str(val))

        # allow these to be overridden for testing
        self.rados = kwargs.get('rados', rados)
        self.rbd = kwargs.get('rbd', rbd)

        if self.rados is None:
            raise exception.InvalidParameterValue(
                err=_('rados module required'))
        if self.rbd is None:
            raise exception.InvalidParameterValue(
                err=_('rbd module required'))

        self.rbd_conf = kwargs.get('conffile', '/etc/ceph/ceph.conf')
        self.rbd_cluster_name = kwargs.get('rbd_cluster_name', 'ceph')
        self.rados_connect_timeout = kwargs.get('rados_connect_timeout', -1)
        cat_res = os.popen("cat %s" % self.rbd_conf)
#----我們自己手動打的日志--開始
        cat_res = os.popen("cat %s" % self.rbd_conf)
        for ll in cat_res.readlines():
            LOG.debug("=================%s======================"% ll)
        LOG.debug("=+++++++++++++++++++user:{},pool:{},rbd:{},conf:{},cluster:{},timeout:{}+++++++++++++++++++++++++".format(self.rbd_user,self.rbd_pool,self.rbd,self.rbd_conf,self.rbd_cluster_name,self.rados_connect_timeout))
#----我們自己手動打的日志--結束
        self.client, self.ioctx = self.connect()

    def __enter__(self):
        return self

    def __exit__(self, type_, value, traceback):
        self.disconnect()

    def connect(self):
        LOG.debug("opening connection to ceph cluster (timeout=%s).",
                  self.rados_connect_timeout)
        client = self.rados.Rados(rados_id=self.rbd_user,
                                  clustername=self.rbd_cluster_name,
                                  conffile=self.rbd_conf)

        try:
            if self.rados_connect_timeout >= 0:
                client.connect(
                    timeout=self.rados_connect_timeout)
            else:
                client.connect()
            ioctx = client.open_ioctx(self.rbd_pool)
            return client, ioctx
        except self.rados.Error:
            msg = _("Error connecting to ceph cluster.")
            LOG.exception(msg)
            # shutdown cannot raise an exception
            client.shutdown()
            raise exception.BrickException(message=msg)	
           

得到兩個關鍵資訊

1.關鍵資訊一

根據

cat_res = os.popen("cat %s" % self.rbd_conf)

for ll in cat_res.readlines():

LOG.debug("=================%s======================"% ll)

獲得如下資訊

=================mon_host = 172.1.1.11:6789,172.1.1.12:6789,172.1.1.13:6789======================

=================[client.cinder]======================

=================key = AQBJ6FteAhQZIhAA+R9uYw8NmCLSIiVmEYSnqQ======================

發現key = AQBJ6FteAhQZIhAA+R9uYw8NmCLSIiVmEYSnqQ== 并不是正确的cinder的keyring。

2.關鍵資訊二

根據

LOG.debug("=+++++++++++++++++++user:{},pool:{},rbd:{},conf:{},cluster:{},timeout:{}+++++++++++++++++++++++++".format(self.rbd_user,self.rbd_pool,self.rbd,self.rbd_conf,self.rbd_cluster_name,self.rados_connect_timeout))

獲得如下資訊

conf:/tmp/brickrbd_zsp449_n

于是,搜尋brickrbd

find /usr/lib/python3/dist-packages/ -name "*brickrbd*"

cd /usr/lib/python3/dist-packages/

grep -rl "brickrbd_" *

結果找到以下檔案:

/usr/lib/python3/dist-packages/os_brick/initiator/connectors/rbd.py

檔案位于:/usr/lib/python3/dist-packages/os_brick/initiator/connectors/rbd.py

檔案源碼如下:實作的是擷取keyring的方法

def _check_or_get_keyring_contents(self, keyring, cluster_name, user):
    try:
        if keyring is None:
            if user:
                # 此處寫死了必須通路/etc/ceph/%s.client.%s.keyring,是以keyring檔案不能改名字
                keyring_path = ("/etc/ceph/%s.client.%s.keyring" %
                                (cluster_name, user))
                with open(keyring_path, 'r') as keyring_file:
                    keyring = keyring_file.read()
            else:
                keyring = ''
        return keyring
    except IOError:
        msg = (_("Keyring path %s is not readable.") % (keyring_path))
        raise exception.BrickException(msg=msg)


def _create_ceph_conf(self, monitor_ips, monitor_ports,
                      cluster_name, user, keyring):
    monitors = ["%s:%s" % (ip, port) for ip, port in
                zip(self._sanitize_mon_hosts(monitor_ips), monitor_ports)]
    mon_hosts = "mon_host = %s" % (','.join(monitors))

    keyring = self._check_or_get_keyring_contents(keyring, cluster_name, user)

    try:
        fd, ceph_conf_path = tempfile.mkstemp(prefix="brickrbd_")
        with os.fdopen(fd, 'w') as conf_file:
            conf_file.writelines([mon_hosts, "\n", keyring, "\n"])
        return ceph_conf_path

    except IOError:
        msg = (_("Failed to write data to %s.") % (ceph_conf_path))
        raise exception.BrickException(msg=msg)
           

通過源碼發現,其寫死了必須通路/etc/ceph/%s.client.%s.keyring,是以keyring檔案不能改名字。

最終解決方案

在/etc/ceph/生成檔案,檔案名如下:

ceph.client.cinder.keyring

ceph.client.cinder-backup.keyring

修改ceph.conf

[client.cinder]

keyring = /etc/ceph/ceph.client.cinder.keyring

[client.cinder-backup]

keyring = /etc/ceph/ceph.client.cinder-backup.keyring

修改檔案屬主

chown cinder:cinder /etc/ceph/ceph.client.cinder.keyring

chown cinder:cinder /etc/ceph/ceph.client.cinder-backup.keyring

繼續閱讀