20.DataStream API之State & Fault Tolerance(The Broadcast State Pattern)

flink 1.9

The Broadcast State Pattern

Working with State描述了运算算子的状态，该状态在恢复时均匀地分布于操作算子的并行任务中，或者联合使用，整个状态用于初始化已恢复的并行任务。

Flink支持的第三种操作算子状态是广播状态（Broadcast State）。引入广播状态（Broadcast State）是为了支持这样的用例:来自一个流的一些数据需要广播到所有下游任务，这些数据存储在本地，并用于处理其他流上的所有输入的元素。例如，广播状态可以作为一种自然匹配出现，您可以想象一个低吞吐量流，其中包含一组规则，我们希望对来自另一个流的所有元素进行评估。考虑到上述类型的情况，广播状态（Broadcast State）与其他运算算子状态的不同之处在于:

它有一种Map格式。
它只适用于输入广播流和非广播流的特定操作算子
这样的操作算子可以具有多个不同名称的广播状态。

Provided APIs

为了展示所提供的api，在展示它们的全部功能之前，我们将从一个示例开始。在我们的运行示例中，我们将使用这样一种情况:我们有一个不同颜色和形状的对象流，我们想要遵循一定的模式找到一对相同颜色的对象，例如一个矩形后面跟着一个三角形。我们假设这组有趣的模式会随着时间而发生变化。

在本例中，第一个流将包含带有颜色和形状属性的Item类型的元素。另一个流将包含规则。

从Items流开始，因为我们想要相同颜色的对，只需要按颜色key分组，这将确保相同颜色的元素最终出现在相同的物理机器上。

// key the shapes by color
KeyedStream<Item, Color> colorPartitionedStream = shapeStream
                        .keyBy(new KeySelector<Shape, Color>(){...});

继续讨论规则Rules，包含规则的流应该广播给所有下游任务，这些任务应该将规则存储在本地，以便根据所有输入的Items元素对规则进行评估。下面的代码片段将i)广播规则流，ii)使用提供的MapStateDescriptor，它将创建规则存储的广播状态。

// a map descriptor to store the name of the rule (string) and the rule itself.
MapStateDescriptor<String, Rule> ruleStateDescriptor = new MapStateDescriptor<>(
			"RulesBroadcastState",
			BasicTypeInfo.STRING_TYPE_INFO,
			TypeInformation.of(new TypeHint<Rule>() {}));
		
// broadcast the rules and create the broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
                        .broadcast(ruleStateDescriptor);

最后，为了对来自Item流的输入元素评估规则Rules，我们需要:

连接两个流。
并且指定匹配检测逻辑。

将流(keyed or non-keyed)与广播流BroadcastStream连接，可在非广播non-broadcasted 流上调用connect()，以BroadcastStream为参数。这将返回一个BroadcastConnectedStream对象，我们可以使用特殊类型的CoProcessFunction在其上调用process()。该函数将包含我们的匹配逻辑。函数的确切类型取决于非广播流的类型:

如果它是keyed的，那么这个函数就是一个KeyedBroadcastProcessFunction。
如果它是non-keyed的，则该函数是BroadcastProcessFunction。

鉴于我们的非广播non-broadcasted流是键控keyed的，以下片段包括上述调用:

注意:连接应该在非广播non-broadcasted流上调用，以广播流BroadcastStream作为参数。

DataStream<String> output = colorPartitionedStream
                 .connect(ruleBroadcastStream)
                 .process(
                     
                     // type arguments in our KeyedBroadcastProcessFunction represent: 
                     //   1. the key of the keyed stream
                     //   2. the type of elements in the non-broadcast side
                     //   3. the type of elements in the broadcast side
                     //   4. the type of the result, here a string
                     
                     new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
                         // my matching logic
                     }
                 );

BroadcastProcessFunction and KeyedBroadcastProcessFunction

与CoProcessFunction函数一样，这些函数有两个要实现的过程方法：processBroadcastElement()负责处理广播流中的传入元素，processElement()负责处理非广播流中的传入元素。方法的完整签名如下:

public abstract class BroadcastProcessFunction<IN1, IN2, OUT> extends BaseBroadcastProcessFunction {

    public abstract void processElement(IN1 value, ReadOnlyContext ctx, Collector<OUT> out) throws Exception;

    public abstract void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out) throws Exception;
}

public abstract class KeyedBroadcastProcessFunction<KS, IN1, IN2, OUT> {

    public abstract void processElement(IN1 value, ReadOnlyContext ctx, Collector<OUT> out) throws Exception;

    public abstract void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out) throws Exception;

    public void onTimer(long timestamp, OnTimerContext ctx, Collector<OUT> out) throws Exception;
}

首先要注意的是，这两个函数都需要实现processBroadcastElement()方法来处理广播端中的元素，以及processElement()来处理非广播端中的元素。

这两种方法提供的上下文中有所不同。非广播端有一个ReadOnlyContext，而广播端有一个Context。

这两种上下文(下面列举的ctx):

允许访问广播状态:ctx.getBroadcastState (MapStateDescriptor < K、V > stateDescriptor)；
允许查询元素的时间戳:ctx.timestamp()；
获取当前水印:ctx.currentWatermark()；
获取当前处理时间:ctx.currentProcessingTime()；
向侧输出side-outputs发送元素ctx.output(OutputTag<X> outputTag, X value)。

getBroadcastState()中的状态描述符(stateDescriptor)应该与上面.broadcast(ruleStateDescriptor)中的状态描述符相同。

不同之处在于它们对广播状态的访问类型。广播端broadcasted具有读写访问权限（read-write access），而非广播端non-broadcast具有只读访问权限（read-only access）(即名称)。原因是在Flink中没有跨任务通信。为了保证广播状态（Broadcast State）下的内容在运算符的所有并行实例中都是相同的，我们只对广播端提供读写访问，它在所有任务中看到相同的元素，我们要求在这一侧的每个传入元素上的计算在所有任务中都是相同的。忽略此规则将破坏状态的一致性保证，导致不一致且通常难以调试结果。

注意:在“processBroadcast()”中实现的逻辑必须在所有并行实例中具有相同的确定性行为!

最后，由于KeyedBroadcastProcessFunction是在键控流（keyed stream）上运行的，因此它公开了BroadcastProcessFunction不可用的一些功能。那就是:

processElement()方法中的ReadOnlyContext允许访问Flink的底层计时器服务，该服务允许注册事件和/或处理时间计时器。当计时器触发时，使用OnTimerContext调用onTimer()(如上所示)，OnTimerContext公开了与ReadOnlyContext plus相同的功能

能够询问触发的计时器是否是事件或处理时间1并查询与计时器关联的键。

processBroadcastElement()方法中的上下文包含applyToKeyedState(StateDescriptor<S, VS> StateDescriptor, KeyedStateFunction<KS, S> function)。允许注册一个KeyedStateFunction，将其应用于与提供的stateDescriptor关联的所有键的所有状态（applied to all states of all keys）。

注意：注册计时器只能在' KeyedBroadcastProcessFunction '的' processElement() '上进行，而且只能在那里进行。在“processBroadcastElement()”方法中是不可能的，因为没有与广播元素相关联的键key。

回到我们最初的例子，我们的KeyedBroadcastProcessFunction看起来像这样:

new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {

    // store partial matches, i.e. first elements of the pair waiting for their second element
    // we keep a list as we may have many first elements waiting
    private final MapStateDescriptor<String, List<Item>> mapStateDesc =
        new MapStateDescriptor<>(
            "items",
            BasicTypeInfo.STRING_TYPE_INFO,
            new ListTypeInfo<>(Item.class));

    // identical to our ruleStateDescriptor above
    private final MapStateDescriptor<String, Rule> ruleStateDescriptor = 
        new MapStateDescriptor<>(
            "RulesBroadcastState",
            BasicTypeInfo.STRING_TYPE_INFO,
            TypeInformation.of(new TypeHint<Rule>() {}));

    @Override
    public void processBroadcastElement(Rule value,
                                        Context ctx,
                                        Collector<String> out) throws Exception {
        ctx.getBroadcastState(ruleStateDescriptor).put(value.name, value);
    }

    @Override
    public void processElement(Item value,
                               ReadOnlyContext ctx,
                               Collector<String> out) throws Exception {

        final MapState<String, List<Item>> state = getRuntimeContext().getMapState(mapStateDesc);
        final Shape shape = value.getShape();
    
        for (Map.Entry<String, Rule> entry :
                ctx.getBroadcastState(ruleStateDescriptor).immutableEntries()) {
            final String ruleName = entry.getKey();
            final Rule rule = entry.getValue();
    
            List<Item> stored = state.get(ruleName);
            if (stored == null) {
                stored = new ArrayList<>();
            }
    
            if (shape == rule.second && !stored.isEmpty()) {
                for (Item i : stored) {
                    out.collect("MATCH: " + i + " - " + value);
                }
                stored.clear();
            }
    
            // there is no else{} to cover if rule.first == rule.second
            if (shape.equals(rule.first)) {
                stored.add(value);
            }
    
            if (stored.isEmpty()) {
                state.remove(ruleName);
            } else {
                state.put(ruleName, stored);
            }
        }
    }
}

Important Considerations

在描述了提供的api之后，本节将重点介绍在使用广播状态（broadcast state）时要记住的重要事项。这些都是:

没有跨任务通信（There is no cross-task communication）：如前所述，这就是为什么只有(键的Keyed)broadcastprocessfunction的广播端可以修改广播状态的内容。此外，用户必须确保所有任务以相同的方式修改每个传入元素的broadcast状态的内容。否则，不同的任务可能具有不同的内容，从而导致不一致的结果。
广播状态下的事件顺序可能会因任务的不同而不同（Order of events in Broadcast State may differ across tasks）：尽管广播流的元素可以确保所有元素(最终)到达所有下游任务，但是元素到达每个任务的顺序可能不同。因此，每个传入元素的状态更新不能依赖于传入事件的顺序。
所有任务检查它们的广播状态（All tasks checkpoint their broadcast state）：尽管在检查点checkpoint发生时，所有任务的广播状态中都有相同的元素(检查点屏障checkpoint barriers不会跳过元素)，所有任务检查它们的广播状态，而且不止一个。这是一个设计决策，避免在恢复期间从同一个文件读取所有任务(从而避免热点)，尽管这样做的代价是将检查点状态的大小增加p(=parallelism)。Flink保证在恢复/重启时不会有重复(no duplicates)和丢失数据(no missing data)。在并行度相同或更小的恢复情况下，每个任务读取其检查点状态(checkpointed state)。在扩展时，每个任务读取自己的状态，其余任务(p_new-p_old)以循环方式读取以前任务的检查点。
没有RocksDB状态后端（No RocksDB state backend）：广播状态在运行时保存在内存中，应该相应地执行内存供应。这适用于所有操作算子状态。

https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/broadcast_state.html

https://www.jianshu.com/p/520376ae837e

https://www.cnblogs.com/Springmoon-venn/p/11362397.html

20.DataStream API之State & Fault Tolerance(The Broadcast State Pattern)

继续阅读

22.DataStream API之State & Fault Tolerance(Queryable State)

25.DataStream API之State & Fault Tolerance(Custom State Serialization)

24.DataStream API之State & Fault Tolerance(State Schema Evolution)

23.DataStream API之State & Fault Tolerance(State Backends)

19.DataStream API之State & Fault Tolerance(Working with State)

18.DataStream API之State & Fault Tolerance(Overview)

21.DataStream API之State & Fault Tolerance(Checkpointing)

29.DataStream API之Operators(Process Function)

35.DataStream API之Experimental Features

16.DataStream API之Event Time(Generating Timestamps Watermarks)

基于DataStream API 的flink程序实现TopN

31.DataStream API(Connectors)之Overview

26.DataStream API之Operators(Overview)

15.DataStream API之Event Time(Overview)

20.DataStream API之State &amp; Fault Tolerance(The Broadcast State Pattern)

继续阅读

20.DataStream API之State & Fault Tolerance(The Broadcast State Pattern)