如何在 Apache Flink 1.10 中使用 Python UDF?

鏡像下載下傳、域名解析、時間同步請點選

一、安裝 PyFlink

我們需要先安裝 PyFlink，可以通過 PyPI 獲得，并且可以使用 pip install 進行便捷安裝。

注意: 安裝和運作 PyFlink 需要 Python 3.5 或更高版本。

$ python -m pip install apache-Apache Flink

二、定義一個 UDF

除了擴充基類 ScalarFunction 之外，定義 Python UDF 的方法有很多。下面的示例顯示了定義 Python UDF 的不同方法，該函數以 BIGINT 類型的兩列作為輸入參數，并傳回它們的和作為結果。

Option 1: extending the base class ScalarFunction

class Add(ScalarFunction):
  def eval(self, i, j):
    return i + j
add = udf(Add(), [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT())

Option 2: Python function

@udf(input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT())
def add(i, j):
  return i + j

Option 3: lambda function

add = udf(lambda i, j: i + j, [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT())

Option 4: callable function

class CallableAdd(object):
  def __call__(self, i, j):
    return i + j
add = udf(CallableAdd(), [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT())

Option 5: partial function

return i + j + k
add = udf(functools.partial(partial_add, k=1), [DataTypes.BIGINT(), DataTypes.BIGINT()],
          DataTypes.BIGINT())

三、注冊一個UDF

table_env.register_function("add", add)

Invoke a Python UDF

my_table.select(```js
"add(a, b)")

Example Code

下面是一個使用 Python UDF 的完整示例。

from PyFlink.table import StreamTableEnvironment, DataTypes
from PyFlink.table.descriptors import Schema, OldCsv, FileSystem
from PyFlink.table.udf import udf
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
t_env = StreamTableEnvironment.create(env)
t_env.register_function("add", udf(lambda i, j: i + j, [DataTypes.BIGINT(), DataTypes.BIGINT()], DataTypes.BIGINT()))
t_env.connect(FileSystem().path('/tmp/input')) \
    .with_format(OldCsv()
                 .field('a', DataTypes.BIGINT())
                 .field('b', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('a', DataTypes.BIGINT())
                 .field('b', DataTypes.BIGINT())) \
    .create_temporary_table('mySource')
t_env.connect(FileSystem().path('/tmp/output')) \
    .with_format(OldCsv()
                 .field('sum', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('sum', DataTypes.BIGINT())) \
    .create_temporary_table('mySink')
t_env.from_path('mySource')\
    .select("add(a, b)") \
    .insert_into('mySink')
t_env.execute("tutorial_job")

送出作業

首先，您需要在“ / tmp / input”檔案中準備輸入資料。例如，

$ echo "1,2" > /tmp/input

接下來，您可以在指令行上運作此示例：

$ python python_udf_sum.py

通過該指令可在本地小叢集中建構并運作 Python Table API 程式。您還可以使用不同的指令行将 Python Table API 程式送出到遠端叢集。

最後，您可以在指令行上檢視執行結果：

$ cat /tmp/output
3

四、Python UDF 的依賴管理

在許多情況下，您可能希望在 Python UDF 中導入第三方依賴。下面的示例将指導您如何管理依賴項。

假設您想使用 mpmath 來執行上述示例中兩數的和。Python UDF 邏輯可能如下：

@udf(input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT())
def add(i, j):
    from mpmath import fadd # add third-party dependency
    return int(fadd(1, 2))

要使其在不包含依賴項的工作節點上運作，可以使用以下 API 指定依賴項：

# echo mpmath==1.1.0 > requirements.txt
# pip download -d cached_dir -r requirements.txt --no-binary :all:
t_env.set_python_requirements("/path/of/requirements.txt", "/path/of/cached_dir")

使用者需要提供一個 requirements.txt 檔案，并且在裡面申明使用的第三方依賴。如果無法在群集中安裝依賴項（網絡問題），則可以使用參數“requirements_cached_dir”，指定包含這些依賴項的安裝包的目錄，如上面的示例所示。依賴項将上傳到群集并脫機安裝。

下面是一個使用依賴管理的完整示例：

from PyFlink.datastream import StreamExecutionEnvironment
from PyFlink.table import StreamTableEnvironment, DataTypes
from PyFlink.table.descriptors import Schema, OldCsv, FileSystem
from PyFlink.table.udf import udf
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
t_env = StreamTableEnvironment.create(env)
@udf(input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT())
def add(i, j):
    from mpmath import fadd
    return int(fadd(1, 2))
t_env.set_python_requirements("/tmp/requirements.txt", "/tmp/cached_dir")
t_env.register_function("add", add)
t_env.connect(FileSystem().path('/tmp/input')) \
    .with_format(OldCsv()
                 .field('a', DataTypes.BIGINT())
                 .field('b', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('a', DataTypes.BIGINT())
                 .field('b', DataTypes.BIGINT())) \
    .create_temporary_table('mySource')
t_env.connect(FileSystem().path('/tmp/output')) \
    .with_format(OldCsv()
                 .field('sum', DataTypes.BIGINT())) \
    .with_schema(Schema()
                 .field('sum', DataTypes.BIGINT())) \
    .create_temporary_table('mySink')
t_env.from_path('mySource')\
    .select("add(a, b)") \
    .insert_into('mySink')
t_env.execute("tutorial_job")

首先，您需要在“/ tmp / input”檔案中準備輸入資料。例如，

echo "1,2" > /tmp/input
1
2

其次，您可以準備依賴項需求檔案和緩存目錄：

$ echo "mpmath==1.1.0" > /tmp/requirements.txt
$ pip download -d /tmp/cached_dir -r /tmp/requirements.txt --no-binary :all:

$ python python_udf_sum.py

$ cat /tmp/output
3

五、快速上手

PyFlink 為大家提供了一種非常友善的開發體驗方式 - PyFlink Shell。當成功執行 python -m pip install apache-flink 之後，你可以直接以 pyflink-shell.sh local 來啟動一個 PyFlink Shell 進行開發體驗，如下所示：

六、更多場景

不僅僅是簡單的 ETL 場景支援，PyFlink 可以完成很多複雜場的業務場景需求，比如我們最熟悉的雙 11 大屏的場景，如下：

說道系列-如何在PyFlink-1-10中自定義Python-UDF/

“ 提供全面，高效和穩定的鏡像下載下傳服務。釘釘搜尋 ' 21746399 ‘ 加入鏡像站官方使用者交流群。”

如何在 Apache Flink 1.10 中使用 Python UDF?

一、安裝 PyFlink

二、定義一個 UDF

三、注冊一個UDF

四、Python UDF 的依賴管理

五、快速上手

六、更多場景

繼續閱讀

Apache2.4.x 配置檔案詳解Apache配置需要了解如下：開始講解：

配置apache支援PHP（win7）

tab滑鼠經過菜單切換

ACS基本配置-權限等級管理

vue （vue2.0）使用總結(從大體結構總結)

vue搭建過程及出現問題

/\B(?=(?:\d{3})+$)/g 一條令人費解的正規表達式

适用于JavaScript的ECMAScript 2020規範向前發展

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

JS生成uuid的四種方法

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

layui多任務上傳添加進度條

在python中建立excel并寫入