OpenStack之cinder基于快照恢複磁盤bug解決過程
現象
OpenStack,使用基于快照恢複磁盤資料的功能,但是恢複不成功。
問題的原因
通過解讀源碼,最終發現,是ceph的相關配置檔案名的命名必須是一字不差才行。
因為cinder相關子產品源碼裡邊,将ceph相關配置檔案名寫死了。
我們的運維工程師在部署ceph階段,就已經使用了非官方的keyring檔案名,系統運作了很久,直到用到了快照恢複磁盤資料這個功能,才觸發了這個bug。
解決過程
第一件事,打開debug模式,就可以看到更詳細的日志:
vi /etc/cinder/cinder.conf
debug=true
檢視日志:
tail -f /var/log/cinder/cinder-volume.log
問題1,缺少所需的配置參數
錯誤日志如下
2020-04-14 11:32:01.241 1403591 WARNING cinder.context [req-c72b8fca-7563-48c2-b4f7-5e67bd6b2dc4 daa48f0af437437f8abe6aa47102e7e5 f02d23f04448412781122a365621e218 - 154925c4fbe74cc2ae2b98b6bd0aea5f 154925c4fbe74cc2ae2b98b6bd0aea5f] Unable to get internal tenant context: Missing required config parameters.
定位到抛錯誤的程式
檔案位于:/usr/lib/python3/dist-packages/cinder/context.py
源碼如下:
def get_internal_tenant_context():
"""Build and return the Cinder internal tenant context object
This request context will only work for internal Cinder operations. It will
not be able to make requests to remote services. To do so it will need to
use the keystone client to get an auth_token.
"""
project_id = CONF.cinder_internal_tenant_project_id
user_id = CONF.cinder_internal_tenant_user_id
if project_id and user_id:
return RequestContext(user_id=user_id,
project_id=project_id,
is_admin=True,
overwrite=False)
else:
LOG.warning('Unable to get internal tenant context: Missing '
'required config parameters.')
return None
修改cinder.conf 增加cinder的所屬projectID和USERID。
其實到最後可以發現,此舉非必須,并不影響使用快照恢複磁盤資料的功能。
檔案位于:/etc/cinder/cinder.conf
[DEFAULT]
cinder_internal_tenant_project_id = 84e68a2d256c466cbd2796472769f037
cinder_internal_tenant_user_id = 90b78005e1824e91b912c50b045faa5e
問題2,校驗權限出錯,說明擷取參數的過程出現了錯誤
錯誤日志如下
2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd File "/usr/lib/python3/dist-packages/os_brick/initiator/linuxrbd.py", line 70, in connect
2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd client.connect()
2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd File "rados.pyx", line 893, in rados.Rados.connect
2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd rados.PermissionError: [errno 1] error connecting to the cluster
2020-04-14 17:44:36.073 64800 ERROR os_brick.initiator.linuxrbd
定位到抛錯誤的程式
檔案位于:/usr/lib/python3/dist-packages/os_brick/initiator/linuxrbd.py
檔案源碼如下:
class RBDClient(object):
def __init__(self, user, pool, *args, **kwargs):
self.rbd_user = user
self.rbd_pool = pool
for attr in ['rbd_user', 'rbd_pool']:
val = getattr(self, attr)
if val is not None:
setattr(self, attr, utils.convert_str(val))
# allow these to be overridden for testing
self.rados = kwargs.get('rados', rados)
self.rbd = kwargs.get('rbd', rbd)
if self.rados is None:
raise exception.InvalidParameterValue(
err=_('rados module required'))
if self.rbd is None:
raise exception.InvalidParameterValue(
err=_('rbd module required'))
self.rbd_conf = kwargs.get('conffile', '/etc/ceph/ceph.conf')
self.rbd_cluster_name = kwargs.get('rbd_cluster_name', 'ceph')
self.rados_connect_timeout = kwargs.get('rados_connect_timeout', -1)
cat_res = os.popen("cat %s" % self.rbd_conf)
#----我們自己手動打的日志--開始
cat_res = os.popen("cat %s" % self.rbd_conf)
for ll in cat_res.readlines():
LOG.debug("=================%s======================"% ll)
LOG.debug("=+++++++++++++++++++user:{},pool:{},rbd:{},conf:{},cluster:{},timeout:{}+++++++++++++++++++++++++".format(self.rbd_user,self.rbd_pool,self.rbd,self.rbd_conf,self.rbd_cluster_name,self.rados_connect_timeout))
#----我們自己手動打的日志--結束
self.client, self.ioctx = self.connect()
def __enter__(self):
return self
def __exit__(self, type_, value, traceback):
self.disconnect()
def connect(self):
LOG.debug("opening connection to ceph cluster (timeout=%s).",
self.rados_connect_timeout)
client = self.rados.Rados(rados_id=self.rbd_user,
clustername=self.rbd_cluster_name,
conffile=self.rbd_conf)
try:
if self.rados_connect_timeout >= 0:
client.connect(
timeout=self.rados_connect_timeout)
else:
client.connect()
ioctx = client.open_ioctx(self.rbd_pool)
return client, ioctx
except self.rados.Error:
msg = _("Error connecting to ceph cluster.")
LOG.exception(msg)
# shutdown cannot raise an exception
client.shutdown()
raise exception.BrickException(message=msg)
得到兩個關鍵資訊
1.關鍵資訊一
根據
cat_res = os.popen("cat %s" % self.rbd_conf)
for ll in cat_res.readlines():
LOG.debug("=================%s======================"% ll)
獲得如下資訊
=================mon_host = 172.1.1.11:6789,172.1.1.12:6789,172.1.1.13:6789======================
=================[client.cinder]======================
=================key = AQBJ6FteAhQZIhAA+R9uYw8NmCLSIiVmEYSnqQ======================
發現key = AQBJ6FteAhQZIhAA+R9uYw8NmCLSIiVmEYSnqQ== 并不是正确的cinder的keyring。
2.關鍵資訊二
根據
LOG.debug("=+++++++++++++++++++user:{},pool:{},rbd:{},conf:{},cluster:{},timeout:{}+++++++++++++++++++++++++".format(self.rbd_user,self.rbd_pool,self.rbd,self.rbd_conf,self.rbd_cluster_name,self.rados_connect_timeout))
獲得如下資訊
conf:/tmp/brickrbd_zsp449_n
于是,搜尋brickrbd
find /usr/lib/python3/dist-packages/ -name "*brickrbd*"
cd /usr/lib/python3/dist-packages/
grep -rl "brickrbd_" *
結果找到以下檔案:
/usr/lib/python3/dist-packages/os_brick/initiator/connectors/rbd.py
檔案位于:/usr/lib/python3/dist-packages/os_brick/initiator/connectors/rbd.py
檔案源碼如下:實作的是擷取keyring的方法
def _check_or_get_keyring_contents(self, keyring, cluster_name, user):
try:
if keyring is None:
if user:
# 此處寫死了必須通路/etc/ceph/%s.client.%s.keyring,是以keyring檔案不能改名字
keyring_path = ("/etc/ceph/%s.client.%s.keyring" %
(cluster_name, user))
with open(keyring_path, 'r') as keyring_file:
keyring = keyring_file.read()
else:
keyring = ''
return keyring
except IOError:
msg = (_("Keyring path %s is not readable.") % (keyring_path))
raise exception.BrickException(msg=msg)
def _create_ceph_conf(self, monitor_ips, monitor_ports,
cluster_name, user, keyring):
monitors = ["%s:%s" % (ip, port) for ip, port in
zip(self._sanitize_mon_hosts(monitor_ips), monitor_ports)]
mon_hosts = "mon_host = %s" % (','.join(monitors))
keyring = self._check_or_get_keyring_contents(keyring, cluster_name, user)
try:
fd, ceph_conf_path = tempfile.mkstemp(prefix="brickrbd_")
with os.fdopen(fd, 'w') as conf_file:
conf_file.writelines([mon_hosts, "\n", keyring, "\n"])
return ceph_conf_path
except IOError:
msg = (_("Failed to write data to %s.") % (ceph_conf_path))
raise exception.BrickException(msg=msg)
通過源碼發現,其寫死了必須通路/etc/ceph/%s.client.%s.keyring,是以keyring檔案不能改名字。
最終解決方案
在/etc/ceph/生成檔案,檔案名如下:
ceph.client.cinder.keyring
ceph.client.cinder-backup.keyring
修改ceph.conf
[client.cinder]
keyring = /etc/ceph/ceph.client.cinder.keyring
[client.cinder-backup]
keyring = /etc/ceph/ceph.client.cinder-backup.keyring
修改檔案屬主
chown cinder:cinder /etc/ceph/ceph.client.cinder.keyring
chown cinder:cinder /etc/ceph/ceph.client.cinder-backup.keyring