Hive谓词解析过程分析

where col1 = 100 and abs(col2) > 0在Hive中的处理过程

where过滤条件称为谓词predicate。

以上where过滤条件在经过Hive的语法解析后，生成如下的语法树：

TOK_WHERE

AND

TOK_TABLE_OR_COL

100

TOK_FUNCTION

ABS

TOK_TABLE_OR_COL

有了语法树之后，最终的目的是生成predicate每个节点对应的ExprNodeDesc，即描述对应的节点：

public Map<ASTNode, ExprNodeDesc> genAllExprNodeDesc(ASTNode expr, RowResolver input,

TypeCheckCtx tcCtx) throws SemanticException {

…

Map<ASTNode, ExprNodeDesc> nodeOutputs =

TypeCheckProcFactory.genExprNode(expr, tcCtx);

…

生成的过程是对上述语法树的一个深度优先遍历的过程，Hive中大量对树的遍历的代码，在遍历过程中根据指定的规则或对语法树进行修改，或输出相应的结果。

Hive中有一个默认的深度优先遍历的实现DefaultGraphWalker。

这个遍历器的实现部分代码如下：

public void walk(Node nd) throws SemanticException {

// Push the node in the stack

opStack.push(nd);

// While there are still nodes to dispatch...
while (!opStack.empty()) {
  Node node = opStack.peek();

  if (node.getChildren() == null ||
          getDispatchedList().containsAll(node.getChildren())) {
    // Dispatch current node
    if (!getDispatchedList().contains(node)) {
      dispatch(node, opStack);
      opQueue.add(node);
    }
    opStack.pop();
    continue;
  }

  // Add a single child and restart the loop
  for (Node childNode : node.getChildren()) {
    if (!getDispatchedList().contains(childNode)) {
      opStack.push(childNode);
      break;
    }
  }
} // end while

}

先将当前节点放到待处理的栈opStack中，然后从opStack取节点出来，如果取出来的节点没有Children，或者Children已经全部处理完毕，才对当前节点进行处理（dispatch），如果当前节点有Children且还没有处理完，则将当前节点的Children放到栈顶，然后重新从栈中取节点进行处理。这是很基础的深度优先遍历的实现。

那在遍历的过程中，如何针对不同的节点进行不同的处理呢？

在遍历之前，先预置一些针对不同的节点不同规则的处理器，然后在遍历过程中，通过分发器Dispatcher选择最合适的处理器进行处理。

生成ExprNodeDesc的遍历中一共先预置了8个规则Rule，每个规则对应一个处理器NodeProcessor：

Map<Rule, NodeProcessor> opRules = new LinkedHashMap<Rule, NodeProcessor>();

opRules.put(new RuleRegExp("R1", HiveParser.TOK_NULL + "%"),
    tf.getNullExprProcessor());
opRules.put(new RuleRegExp("R2", HiveParser.Number + "%|" +
    HiveParser.TinyintLiteral + "%|" +
    HiveParser.SmallintLiteral + "%|" +
    HiveParser.BigintLiteral + "%|" +
    HiveParser.DecimalLiteral + "%"),
    tf.getNumExprProcessor());
opRules
    .put(new RuleRegExp("R3", HiveParser.Identifier + "%|"
    + HiveParser.StringLiteral + "%|" + HiveParser.TOK_CHARSETLITERAL + "%|"
    + HiveParser.TOK_STRINGLITERALSEQUENCE + "%|"
    + "%|" + HiveParser.KW_IF + "%|" + HiveParser.KW_CASE + "%|"
    + HiveParser.KW_WHEN + "%|" + HiveParser.KW_IN + "%|"
    + HiveParser.KW_ARRAY + "%|" + HiveParser.KW_MAP + "%|"
    + HiveParser.KW_STRUCT + "%|" + HiveParser.KW_EXISTS + "%|"
    + HiveParser.KW_GROUPING + "%|"
    + HiveParser.TOK_SUBQUERY_OP_NOTIN + "%"),
    tf.getStrExprProcessor());
opRules.put(new RuleRegExp("R4", HiveParser.KW_TRUE + "%|"
    + HiveParser.KW_FALSE + "%"), tf.getBoolExprProcessor());
opRules.put(new RuleRegExp("R5", HiveParser.TOK_DATELITERAL + "%|"
    + HiveParser.TOK_TIMESTAMPLITERAL + "%"), tf.getDateTimeExprProcessor());
opRules.put(new RuleRegExp("R6",
    HiveParser.TOK_INTERVAL_YEAR_MONTH_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_DAY_TIME_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_YEAR_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_MONTH_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_DAY_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_HOUR_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_MINUTE_LITERAL + "%|"
    + HiveParser.TOK_INTERVAL_SECOND_LITERAL + "%"), tf.getIntervalExprProcessor());
opRules.put(new RuleRegExp("R7", HiveParser.TOK_TABLE_OR_COL + "%"),
    tf.getColumnExprProcessor());
opRules.put(new RuleRegExp("R8", HiveParser.TOK_SUBQUERY_OP + "%"),
    tf.getSubQueryExprProcessor());

这里使用的分发器Dispatcher是DefaultRuleDispatcher，DefaultRuleDispatcher选择处理器的逻辑如下：

// find the firing rule
// find the rule from the stack specified
Rule rule = null;
int minCost = Integer.MAX_VALUE;
for (Rule r : procRules.keySet()) {
  int cost = r.cost(ndStack);
  if ((cost >= 0) && (cost <= minCost)) {
    minCost = cost;
    rule = r;
  }
}

NodeProcessor proc;

if (rule == null) {
  proc = defaultProc;
} else {
  proc = procRules.get(rule);
}

// Do nothing in case proc is null
if (proc != null) {
  // Call the process function
  return proc.process(nd, ndStack, procCtx, nodeOutputs);
} else {
  return null;
}

遍历所有的规则Rule，调用每个规则的cost方法计算cost，找其中cost最小的规则对应的处理器，如果没有找到，则使用默认处理器，如果没有设置默认处理器，则不做任何事情。

那么每个规则的cost是如何计算的？

– 没太看懂==|| (后续再理理)

WHERE条件语法树每个节点对应的处理器如下：

TOK_WHERE

AND --> TypeCheckProcFactory.DefaultExprProcessor

= --> TypeCheckProcFactory.DefaultExprProcessor

TOK_TABLE_OR_COL --> TypeCheckProcFactory.ColumnExprProcessor

c1 --> TypeCheckProcFactory.StrExprProcessor

100 --> TypeCheckProcFactory.NumExprProcessor

> --> TypeCheckProcFactory.DefaultExprProcessor

TOK_FUNCTION --> TypeCheckProcFactory.DefaultExprProcessor

ABS --> TypeCheckProcFactory.StrExprProcessor

TOK_TABLE_OR_COL --> TypeCheckProcFactory.ColumnExprProcessor

c2 --> TypeCheckProcFactory.StrExprProcessor

0 --> TypeCheckProcFactory.NumExprProcessor

TypeCheckProcFactory.StrExprProcessor 生成ExprNodeConstantDesc

TypeCheckProcFactory.ColumnExprProcessor 处理column，生成ExprNodeColumnDesc

TypeCheckProcFactory.NumExprProcessor生成ExprNodeConstantDesc

TypeCheckProcFactory.DefaultExprProcessor生成ExprNodeGenericFuncDesc

在深度优先遍历完WHERE语法树后，每个节点都会生成一个ExprNodeDesc，但是其实除了最顶层的AND节点生成的ExprNodeDesc有用，其他的节点生成的都是中间结果，最终都会包含在AND节点生成的ExprNodeDesc中。所以在遍历WHERE树后，通过AND节点生成的ExprNodeDesc构造FilterDesc：

new FilterDesc(genExprNodeDesc(condn, inputRR, useCaching), false)

有了FilterDesc后，就能够构造出FilterOperator了，然后再将生成的FilterOperator加入到Operator树中：

Operator ret = get((Class) conf.getClass());

ret.setConf(conf);

至此，where过滤条件对应的FilterOperator构造完毕。

接下来仔细看下AND生成的ExprNodeDesc，它其实是一个ExprNodeGenericFuncDesc：

// genericUDF是GenericUDFOPAnd，就是对应AND操作符

private GenericUDF genericUDF;

// AND是一个二元操作符，children里存的是对应的操作符

// 根据WHERE语法树，可以知道children[0]肯定又是一个ExprNodeGenericFuncDesc，而且是一个=函 // 数，而children[1]也是一个肯定又是一个ExprNodeGenericFuncDesc，而且是一个>函数，以此类 // 推，每个ExprNodeGenericFuncDesc都有对应的children

private List chidren;

// UDF的名字，这里是and

private transient String funcText;

/**

This class uses a writableObjectInspector rather than a TypeInfo to store
the canonical type information for this NodeDesc.

*/

private transient ObjectInspector writableObjectInspector;

每个ExprNodeDesc都对应有一个ExprNodeEvaluator，来对每个ExprNodeDesc进行实际的计算。看下ExprNodeEvaluator类的基本方法：

public abstract class ExprNodeEvaluator {

// 对应的ExprNodeDesc

protected final T expr;

// 在经过这个Evaluator计算后，它的输出值该如何解析的ObjectInspector

protected ObjectInspector outputOI;

…

/**

Initialize should be called once and only once. Return the ObjectInspector
for the return value, given the rowInspector.
初始化方法，传入一个ObjectInspector，即传入的数据应该如何解析的ObjectInspector
而需要返回经过这个Evaluator计算后的输出值的解析ObjectInspector

*/

public abstract ObjectInspector initialize(ObjectInspector rowInspector) throws HiveException;

// evaluate方法，调用来对row数据进行解析

public Object evaluate(Object row) throws HiveException {

return evaluate(row, -1);

}

/**

Evaluate the expression given the row. This method should use the
rowInspector passed in from initialize to inspect the row object. The
return value will be inspected by the return value of initialize.
If this evaluator is referenced by others, store it for them

*/

protected Object evaluate(Object row, int version) throws HiveException {

if (version < 0 || version != this.version) {

this.version = version;

return evaluation = _evaluate(row, version);

}

return evaluation;

}

// 由各个子类实现的方法的_evaluate方法，结合上面的evaluate方法，这里实际使用了设计模式的模板 // 方法模式

protected abstract Object _evaluate(Object row, int version) throws HiveException;

…

}

通过ExprNodeEvaluatorFactory获取到每个ExprNodeDesc对应的ExprNodeEvaluator：

public static ExprNodeEvaluator get(ExprNodeDesc desc) throws HiveException {

// Constant node

if (desc instanceof ExprNodeConstantDesc) {

return new ExprNodeConstantEvaluator((ExprNodeConstantDesc) desc);

}

// Column-reference node, e.g. a column in the input row
if (desc instanceof ExprNodeColumnDesc) {
  return new ExprNodeColumnEvaluator((ExprNodeColumnDesc) desc);
}
// Generic Function node, e.g. CASE, an operator or a UDF node
if (desc instanceof ExprNodeGenericFuncDesc) {
  return new ExprNodeGenericFuncEvaluator((ExprNodeGenericFuncDesc) desc);
}
// Field node, e.g. get a.myfield1 from a
if (desc instanceof ExprNodeFieldDesc) {
  return new ExprNodeFieldEvaluator((ExprNodeFieldDesc) desc);
}
throw new RuntimeException(
    "Cannot find ExprNodeEvaluator for the exprNodeDesc = " + desc);

}

看下FilterOperator中如何使用ExprNodeEvaluator对数据进行过滤的。

首先在FilterOperator的initializeOp方法中，获取到ExprNodeEvaluator：

conditionEvaluator = ExprNodeEvaluatorFactory.get(conf.getPredicate());

然后在process方法中，调用initialize方法后，调用eveluate方法获取到整个where过滤的结果：

conditionInspector = (PrimitiveObjectInspector) conditionEvaluator

.initialize(rowInspector);

…

Object condition = conditionEvaluator.evaluate(row);

…

Boolean ret = (Boolean) conditionInspector

.getPrimitiveJavaObject(condition);

// 如果结果是true，则forward到下一个operator继续处理

if (Boolean.TRUE.equals(ret)) {

forward(row, rowInspector);

}

再来看下GenericUDFOPAnd的evaluate方法实现：

Object a1 = arguments[1].get();
if (a1 != null) {
  bool_a1 = boi1.get(a1);
  if (bool_a1 == false) {
    result.set(false);
    return result;
  }
}

if ((a0 != null && bool_a0 == true) && (a1 != null && bool_a1 == true)) {
  result.set(true);
  return result;
}

return null;

Hive谓词解析过程分析

继续阅读

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

Ambari介绍和架构原理

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

NOSQL安全攻击

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method