Google Protocol Buffer用于配置文件存储的方案与兼容性分析

1. 存储以Message为单位，但Message在读取前不知道长度，所以Protocol Buffer存储不支持部分读取Message；

2. 多个Message连续存储时，也不支持只读取其中一个Message（为什么？），但官方文档中有这么两段：

Streaming Multiple Messages

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

https://developers.google.com/protocol-buffers/docs/techniques?hl=zh-CN#streaming

Large Data Sets

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

https://developers.google.com/protocol-buffers/docs/techniques?hl=zh-CN#streaming

对此我的理解是：

a. 建议将大文件切成多个Message结构

b. 每个Message保存自己的长度（以便找到下一个Message的起始位置）

c. 大于1M的数据建议不考虑ProtoBuf（官方理由是对不同的大数据使用需求建议使用不同的存储策略，例如数据库）；

3. 所以还需要在Protobuf之外增加固定文件头，存储不变的信息和Message数据体长度；

4. 关于文件数据的向前向后兼容，对.proto文件的唯一要求是new version只能在old version基础上增加optional/repeated字段（当然说是可以废弃字段，但废弃字段必须保留Tag值不被占用），这样就向前、向后兼容都能实现了。对此可以进一步分析：

a. 向前兼容（老程序读新数据）：因为只能新增字段，并且新增字段是optional的，老程序也不会用到新增字段的值，程序行为不会有任何改变；

b 向后兼容（新程序读老数据）：protocol buffer会为optional提供缺省值，缺省值你也可以指定，并且在改变缺省值并不会改变兼容性；

自动提供的缺省值：根据类型分别提供空字符串、数值0和false；

If the default value is not specified for an optional element, a type-specific default value is used instead: for strings, the default value is the empty string. For bools, the default value is false. For numeric types, the default value is zero. For enums, the default value is the first value listed in the enum's type definition.

因为protocol buffer不会存储optional的缺省值，所以缺省值不会改变兼容性；

Changing a default value is generally OK, as long as you remember that default values are never sent over the wire. Thus, if a program receives a message in which a particular field isn't set, the program will see the default value as it was defined in that program's version of the protocol. It will NOT see the default value that was defined in the sender's code.

c. 实验证明：

新版本程序在不填新增optional字段的情况下，所生成的二进制数据与老版本程序所生成二进制数据完全相同；

5. 对程序配置数据存储库而言，我认为要具备两个基础特性（这两点都胜过了XML）：

a. 升级兼容，对新增数据提供缺省值（升级对应用透明）；

b. 格式与数据分离，可以单独追溯格式变化，测试兼容性（XML没有做到格式数据分离，用DTD/XML Schema可以实现，但太麻烦）；

Google Protocol Buffer用于配置文件存储的方案与兼容性分析

继续阅读

DB2表压缩功能

华为笔试软件

性能测试-理发店模型

项目管理那些事儿

OS --written test1

OS-written test2

压缩编码M-JPEG、MPEG4、H.264

web OS —— goowy.com

你幸福吗? 会的

转详解C#数据库存取图片三大方式

在一个非套接字上尝试了一个操作

门户通专访月光博客：第一博客是如何打造成的

BMP文件结构及图像每行字节计算方法

磁盘结构及在Linux中的命名

CQ V1.0分词bates(基于双数组tire树)—应该是目前最快的中文分词算法

linux下的完美网银们（google chrome, ubuntu10.04）