在 Databend 中按函數實作分為了：scalars 函數和 aggregates 函數。

Scalar 函數：基于輸入值，傳回單個值。常見的 Scalar function 有 now, round 等。

Aggregate 函數：用于對列的值進行操作并傳回單個值。常見的 Agg function 有 sum, count, avg 等。

github.com/datafuselab…

該系列共兩篇，本文主要介紹 Scalar Function 從注冊到執行是如何在 Databend 運作起來的。

函數注冊

由 FunctionRegistry 接管函數注冊。

#[derive(Default)]
pub struct FunctionRegistry {
    pub funcs: HashMap<&'static str, Vec<Arc<Function>>>,
    #[allow(clippy::type_complexity)]
    pub factories: HashMap<
        &'static str,
        Vec<Box<dyn Fn(&[usize], &[DataType]) -> Option<Arc<Function>> + 'static>>,
    >,
    pub aliases: HashMap<&'static str, &'static str>,
}
複制代碼

三個 item 都是 Hashmap。

其中，funcs 和 factories 都用來存儲被注冊的函數。不同之處在于 funcs 注冊的都是固定參數個數的函數（目前支援最少參數個數為0，最多參數個數為 5），分為 register_0_arg, register_1_arg 等等。而 factories 注冊的都是參數不定長的函數（如 concat），調用 register_function_factory 函數。

由于一個函數可能有多個别名（如 minus 的别名有 subtract 和 neg），是以有了 alias，它的 key 是某個函數的别名，v 是目前的存在的函數名，調用 register_aliases 函數。

另外, 根據不同的功能需求, 我們提供了不同級别的 register api。

函數構成

已知 funcs 的 value 是函數主體，我們來看一下 Function 在 Databend 中是怎麼建構的。

pub struct Function {
    pub signature: FunctionSignature,
    #[allow(clippy::type_complexity)]
    pub calc_domain: Box<dyn Fn(&[Domain]) -> Option<Domain>>,
    #[allow(clippy::type_complexity)]
    pub eval: Box<dyn Fn(&[ValueRef<AnyType>], FunctionContext) -> Result<Value<AnyType>, String>>,
}

複制代碼

其中，signature 包括函數名，參數類型，傳回類型以及函數特性（目前暫未有函數使用特性，僅作為保留位）。要特别注意的是，在注冊時函數名需要是小寫。而一些 token 會經過 src/query/ast/src/parser/token.rs 轉換。

#[allow(non_camel_case_types)]
#[derive(Logos, Clone, Copy, Debug, PartialEq, Eq, Hash)]
pub enum TokenKind {
    ...
    #[token("+")]
    Plus,
    ...
}

複制代碼

以實作 `select 1+2` 的加法函數為例子，`+` 被轉換為 Plus，而函數名需要小寫，是以我們在注冊時函數名使用 `plus`。

with_number_mapped_type!(|NUM_TYPE| match left {
    NumberDataType::NUM_TYPE => {
        registry.register_1_arg::<NumberType<NUM_TYPE>, NumberType<NUM_TYPE>, _, _>(
            "plus",
            FunctionProperty::default(),
            |lhs| Some(lhs.clone()),
            |a, _| a,
        );
    }
});

複制代碼

calc_domain 用來計算輸出值的輸入值的集合。用數學公式描述的話比如 `y = f(x)` 其中域就是 x 值的集合，可以作為f的參數生成 y 值。這可以使我們在索引資料時輕松過濾掉不在域内的值，極大提升響應效率。

eval 可以了解成函數的具體實作内容。本質是接受一些字元或者數字，将他們解析成表達式，再轉換成另外一組值。

示例

目前在 function-v2 中實作的函數有這幾類：arithmetric, array, boolean, control, comparison, datetime, math, string, string_mult_args, variant

以 length 的實作為例：

length 接受一個 String 類型的值為參數，傳回一個 Number 類型。名字為 length，domain 不做限制（因為任何 string 都有長度）最後一個參數是一個閉包函數，作為 length 的 eval 實作部分。

registry.register_1_arg::<StringType, NumberType<u64>, _, _>(
    "length",
    FunctionProperty::default(),
    |_| None,
    |val, _| val.len() as u64,
);

複制代碼

在 register_1_arg 的實作中，我們看到調用的函數是 register_passthrough_nullable_1_arg，函數名包含一個 nullable。而 eval 被 vectorize_1_arg 調用。

注意：請不要手動修改 register_1_arg 所在的檔案 [src/query/expression/src/register.rs](github.com/datafuselab…) 。因為它是被 [src/query/codegen/src/writes/register.rs](github.com/datafuselab…) 生成的。

pub fn register_1_arg<I1: ArgType, O: ArgType, F, G>(
    &mut self,
    name: &'static str,
    property: FunctionProperty,
    calc_domain: F,
    func: G,
) where
    F: Fn(&I1::Domain) -> Option<O::Domain> + 'static + Clone + Copy,
    G: Fn(I1::ScalarRef<'_>, FunctionContext) -> O::Scalar + 'static + Clone + Copy,
{
    self.register_passthrough_nullable_1_arg::<I1, O, _, _>(
        name,
        property,
        calc_domain,
        vectorize_1_arg(func),
    )
}

複制代碼

這是因為 eval 在實際應用場景中接受的不隻是字元或者數字，還可能是 null 或者其他各種類型。而 null 無疑是最特殊的一種。而我們接收的參數也可能是一個列或者一個值。比如

select length(null);
+--------------+
| length(null) |
+--------------+
|         NULL |
+--------------+
select length(id) from t;
+------------+
| length(id) |
+------------+
|          2 |
|          3 |
+------------+

複制代碼

基于此，如果我們在函數中無需對 null 類型的值做特殊處理，直接使用 register_x_arg 即可。如果需要對 null 類型做特殊處理，參考 [try_to_timestamp](github.com/datafuselab…

而對于需要在 vectorize 中進行特化的函數則需要調用 register_passthrough_nullable_x_arg，對要實作的函數進行特定的向量化優化。

例如 comparison 函數 regexp 的實作：regexp 接收兩個 String 類型的值，傳回 Bool 值。在向量化執行中，為了進一步優化減少重複正規表達式的解析，引入了 HashMap 結構。是以單獨實作了 `vectorize_regexp`。

registry.register_passthrough_nullable_2_arg::<StringType, StringType, BooleanType, _, _>(
    "regexp",
    FunctionProperty::default(),
    |_, _| None,
    vectorize_regexp(|str, pat, map, _| {
        let pattern = if let Some(pattern) = map.get(pat) {
            pattern
        } else {
            let re = regexp::build_regexp_from_pattern("regexp", pat, None)?;
            map.insert(pat.to_vec(), re);
            map.get(pat).unwrap()
        };
        Ok(pattern.is_match(str))
    }),
);


複制代碼

函數測試

Unit Test

函數相關單元測試在 [scalars](github.com/datafuselab…) 目錄中。

Logic Test

Functions 相關的 logic 測試在 [02_function](github.com/datafuselab…) 目錄中。

關于 Databend

Databend 是一款開源、彈性、低成本，基于對象存儲也可以做實時分析的新式數倉。期待您的關注，一起探索雲原生數倉解決方案，打造新一代開源 Data Cloud。

給 Databend 添加 Scalar 函數 | 函數開發系例一

函數注冊

函數構成

示例

以 length 的實作為例：

函數測試

Unit Test

Logic Test

關于 Databend

繼續閱讀

轉換算子 java和scala示例代碼

Android config.gradle

eclipse中安裝scala插件

關于sbt下載下傳速度過慢的問題

國外交友網站開發源碼第十二篇

【YOLO學習筆記】之YOLO初體驗

【Scala謎題】使用占位符

Scala的通路權限控制

Spark的RDD轉換算子-雙value型Spark的RDD轉換算子-雙value型

Scala中的match(模式比對)

《快學Scala》——基礎

《快學scala》第13章練習答案

9.spark Core 進階2--Cashe

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method