非法字元在xml中的存儲一直比較讨厭,其實這個非法字元并不僅僅是非可見字元,還包括xml中規定的某些特殊字元,比如<&>等。
一種比較友善的處理方式是将那些非法字元采用HEX方式存儲或者base64加密後存儲,以下是兩個函數展示怎麼采用base64加密的方式妥善處理那些非法字元,既保證資料的完整性,又能保持可讀。畢竟所生成的xml不僅僅是用于機器讀取,而且很大一部分還要對人閱讀友好。其中的思路是:對于存在非法字元的字元串,統一使用base64加密,在生成的xml标簽中增加base64=True屬性,對于不存在非法字元的,直接顯示原始資料,生成的标簽中也不再添加base64屬性。這樣既能保證資料的完整性,又能保持xml的可讀性。
# -*- encoding: utf-8 -*-
"""
Created on 2011-11-08
@summary: helper functions may be used in xml process
@author: JerryKwan
"""
try:
import xml.sax.saxutils
except ImportError:
raise ImportError("requires xml.sax.saxutils package, pleas check if xml.sax.saxutils is installed!")
import base64
import logging
logger = logging.getLogger(__name__)
__all__ = ["escape", "unescape"]
def escape(data):
"""
@summary:
Escape '&', '<', and '>' in a string of data.
if the data is not ascii, then encode in base64
@param data: the data to be processed
@return
{"base64": True | False,
"data": data}
"""
# check if all of the data is in ascii code
is_base64 = False
escaped_data = ""
try:
data.decode("ascii")
is_base64 = False
# check if the data should be escaped to be stored in xml
escaped_data = xml.sax.saxutils.escape(data)
except UnicodeDecodeError:
logger.debug("%s is not ascii-encoded string, so i will encode it in base64")
# base64 encode
escaped_data = base64.b64encode(data)
is_base64 = True
return {"base64": is_base64,
"data": escaped_data}
def unescape(data, is_base64 = False):
"""
@summary:
Unescape '&', '<', and '>' in a string of data.
if base64 is True, then base64 decode will be processed first
@param data: the data to be processed
@param base64: specify if the data is encoded by base64
@result: unescaped data
"""
# check if base64
unescaped_data = data
if is_base64:
try:
unescaped_data = base64.b64decode(data)
except Exception, ex:
logger.debug("some excpetion occured when invoke b64decode")
logger.error(ex)
print ex
else:
# unescape it
unescaped_data = xml.sax.saxutils.unescape(data)
return unescaped_data
if __name__ == "__main__":
def test(data):
print "original data is: ", data
t1 = escape(data)
print "escaped result: ", t1
print "unescaped result is: ", unescape(t1["data"], t1["base64"])
print "#" * 50
test("123456")
test("測試")
test("< & >")
test("`!@#$%^&*:'\"-=")
print "just a test"
注意:上述方法做的比較簡單,隻是處理了ascii字元和<&>,非ascii統一使用base64加密,要想做相容性更好一些的話,可以采用chardet包,将字元串同意轉換成utf-8存儲,這樣一來适用性會強很多。