1.需求說明
這是一個關于POI的應用,資料從水經微圖下載下傳而來,需要處理的是
街道
層級的資料,但是最終的POI資訊要有
省、市、縣
資料,所有需要用到行政區劃表來補全資料。
2.程式設計執行個體
2.1 實作樹結構
首先看一下具有樹結構的資料:
通過
WITH RECURSIVE table_name AS
實作遞歸查詢樹結構資料【這裡要特别注意一下 t0 和 t1 表】:
WITH RECURSIVE t1 AS (
SELECT "level", parent_code, area_code, "name"
FROM data_divisions
WHERE "name" = '楓楊街道'
UNION ALL
SELECT t0."level", t0.parent_code, t0.area_code, t0."name"
FROM data_divisions t0, t1
WHERE t0.area_code = t1.parent_code
)
SELECT "level", "name" FROM t1
結果驗證:
2.2 自定義函數
使用
STRING_AGG
把省市縣資料拼接成一個字段【函數等價于
GROUP_CONCAT
】:
SELECT
STRING_AGG ( "name", ',' ORDER BY "level" ) AS "divisions" FROM
( WITH RECURSIVE t1 AS (
SELECT "level", parent_code, area_code, "name"
FROM data_divisions
WHERE "name" = '楓楊街道'
UNION ALL
SELECT t0."level", t0.parent_code, t0.area_code, t0."name"
FROM data_divisions t0, t1
WHERE t0.area_code = t1.parent_code
)
SELECT "level", "name" FROM t1 ) t2
建立自定義函數:
CREATE OR REPLACE FUNCTION getdivisionsbyname ( TEXT ) RETURNS TEXT AS $BODY$
SELECT
STRING_AGG ( "name", ',' ORDER BY "level" ) AS "divisions" FROM
( WITH RECURSIVE t1 AS (
SELECT "level", parent_code, area_code, "name"
FROM data_divisions
WHERE "name" = '楓楊街道'
UNION ALL
SELECT t0."level", t0.parent_code, t0.area_code, t0."name"
FROM data_divisions t0, t1
WHERE t0.area_code = t1.parent_code
)
SELECT "level", "name" FROM t1 ) t2;
$BODY$ LANGUAGE SQL IMMUTABLE STRICT COST 100;
函數調用測試:
2.3 函數使用
data_address_point 表的記錄數是
261
條,執行耗時
119.451s
,這效率明顯是由于多次調用自定義函數導緻的 😢
SELECT
getdivisionsbyname(zone_name) || NAME AS "poi",
SPLIT_PART( coordinates, ',', 1 ) AS "longitude",
SPLIT_PART( coordinates, ',', 2 ) AS "latitude",
NAME AS "address",
SPLIT_PART( getdivisionsbyname(zone_name), ',', 1 ) AS "prov",
SPLIT_PART( getdivisionsbyname(zone_name), ',', 2 ) AS "city",
SPLIT_PART( getdivisionsbyname(zone_name), ',', 3 ) AS "district",
SPLIT_PART( getdivisionsbyname(zone_name), ',', 4 ) AS "town"
FROM data_address_point;
避免多次調用相同的自定義函數,優化後耗時
23.634s
,是之前的5分之1:
WITH t1 AS ( SELECT getdivisionsbyname ( zone_name ) AS "divisions", coordinates, "name", poi_type FROM data_address_point )
SELECT
ROW_NUMBER ( ) OVER ( ORDER BY "name" ) AS "id",
REPLACE ( divisions, ',', '' ) || "name" AS "poi",
poi_type,
SPLIT_PART( coordinates, ',', 1 ) AS "longitude",
SPLIT_PART( coordinates, ',', 2 ) AS "latitude",
NAME AS "address",
SPLIT_PART( divisions, ',', 1 ) AS "prov",
SPLIT_PART( divisions, ',', 2 ) AS "city",
SPLIT_PART( divisions, ',', 3 ) AS "district",
SPLIT_PART( divisions, ',', 4 ) AS "town"
FROM
t1
3.報錯問題
實際上,上邊的函數使用并是非順利的,第一次進行查詢時報錯
function cannot execute on a QE slice because it accesses relation
WITH t1 AS ( SELECT getdivisionsbyname ( zone_name ) AS "divisions", coordinates, "name", poi_type FROM data_address_point )
SELECT
ROW_NUMBER ( ) OVER ( ORDER BY "name" ) AS "id",
REPLACE ( divisions, ',', '' ) || "name" AS "poi",
poi_type,
SPLIT_PART( coordinates, ',', 1 ) AS "longitude",
SPLIT_PART( coordinates, ',', 2 ) AS "latitude",
NAME AS "address",
SPLIT_PART( divisions, ',', 1 ) AS "prov",
SPLIT_PART( divisions, ',', 2 ) AS "city",
SPLIT_PART( divisions, ',', 3 ) AS "district",
SPLIT_PART( divisions, ',', 4 ) AS "town"
FROM
t1
> ERROR: function cannot execute on a QE slice because it accesses relation "public.data_divisions" (seg0 slice1 192.168.0.123:6000 pid=168995)
CONTEXT: SQL function "getdivisionsbyname" during startup
UDF(User Defined Function)使用者自定義函數在 segment 上不能通路任何表。由于 MPP 的特性,任何 segment 僅僅包含部分資料,因而在 segment 執行的 UDF 不能通路任何表,否則資料計算錯誤。Greenplum 支援另一種分布政策:複制表,即整張表在每個節點上都有一個完整的拷貝。可使用以下指令進行設定:
資料量大的表不适合使用複制表模式,一些不經常變動的資料量比較小的比如碼表可以使用
DISTRIBUTED REPLICATED
模式,查詢性能也會有明顯的提升。