天天看點

Google Protocol Buffer用于配置檔案存儲的方案與相容性分析

1. 存儲以Message為機關,但Message在讀取前不知道長度,是以Protocol Buffer存儲不支援部分讀取Message;

2. 多個Message連續存儲時,也不支援隻讀取其中一個Message(為什麼?),但官方文檔中有這麼兩段:

Streaming Multiple Messages

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

https://developers.google.com/protocol-buffers/docs/techniques?hl=zh-CN#streaming

Large Data Sets

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

https://developers.google.com/protocol-buffers/docs/techniques?hl=zh-CN#streaming

對此我的了解是:

a. 建議将大檔案切成多個Message結構

b. 每個Message儲存自己的長度(以便找到下一個Message的起始位置)

c. 大于1M的資料建議不考慮ProtoBuf(官方理由是對不同的大資料使用需求建議使用不同的存儲政策,例如資料庫);

3. 是以還需要在Protobuf之外增加強定檔案頭,存儲不變的資訊和Message資料體長度;

4. 關于檔案資料的向前向後相容,對.proto檔案的唯一要求是new version隻能在old version基礎上增加optional/repeated字段(當然說是可以廢棄字段,但廢棄字段必須保留Tag值不被占用),這樣就向前、向後相容都能實作了。對此可以進一步分析:

a. 向前相容(老程式讀新資料):因為隻能新增字段,并且新增字段是optional的,老程式也不會用到新增字段的值,程式行為不會有任何改變;

b 向後相容(新程式讀老資料):protocol buffer會為optional提供預設值,預設值你也可以指定,并且在改變預設值并不會改變相容性;

自動提供的預設值:根據類型分别提供空字元串、數值0和false;

If the default value is not specified for an optional element, a type-specific default value is used instead: for strings, the default value is the empty string. For bools, the default value is false. For numeric types, the default value is zero. For enums, the default value is the first value listed in the enum's type definition.

因為protocol buffer不會存儲optional的預設值,是以預設值不會改變相容性;

Changing a default value is generally OK, as long as you remember that default values are never sent over the wire. Thus, if a program receives a message in which a particular field isn't set, the program will see the default value as it was defined in that program's version of the protocol. It will NOT see the default value that was defined in the sender's code.

c. 實驗證明:

新版本程式在不填新增optional字段的情況下,所生成的二進制資料與老版本程式所生成二進制資料完全相同;

5. 對程式配置資料存儲庫而言,我認為要具備兩個基礎特性(這兩點都勝過了XML):

a. 更新相容,對新增資料提供預設值(更新對應用透明);

b. 格式與資料分離,可以單獨追溯格式變化,測試相容性(XML沒有做到格式資料分離,用DTD/XML Schema可以實作,但太麻煩);