相對于單例資料庫的查詢操作，分布式資料查詢會有很多技術難題。

本文記錄 Mysql 分庫分表和 Elasticsearch Join 查詢的實作思路，了解分布式場景資料處理的設計方案。

文章從常用的關系型資料庫 MySQL 的分庫分表 Join 分析，再到非關系型 ElasticSearch 來分析 Join 實作政策。逐漸深入 Join 的實作機制。

①Mysql 分庫分表 Join 查詢場景

分庫分表場景下，查詢語句如何分發，資料如何組織。相較于 NoSQL 資料庫，Mysql 在 SQL 規範的範圍内，相對比較容易适配分布式場景。

基于 sharding-jdbc 中間件的方案，了解整個設計思路。

sharding-jdbc

sharding-jdbc 代理了原始的 datasource, 實作 jdbc 規範來完成分庫分表的分發群組裝，應用層無感覺。
執行流程：SQL 解析 => 執行器優化 => SQL 路由 => SQL 改寫 => SQL 執行 => 結果歸并 io.shardingsphere.core.executor.ExecutorEngine#execute
Join 語句的解析，決定了要分發 SQL 到哪些執行個體節點上。對應 SQL 路由。
SQL 改寫就是要把原始（邏輯）表名，改為實際分片的表名。
複雜情況下，Join 查詢分發的最多執行的次數 = 資料庫執行個體 × 表 A 分片數 × 表 B 分片數

Code Insight

示例代碼工程：[email protected]:cluoHeadon/sharding-jdbc-demo.git

/**
 * 執行查詢 SQL 切入點，從這裡可以完整 debug 執行流程
 * @see ShardingPreparedStatement#execute()
 * @see ParsingSQLRouter#route(String, List, SQLStatement) Join 查詢實際涉及哪些表，就是在路由規則裡比對得出來的。
 */
public boolean execute() throws SQLException {
    try {
        // 根據參數（決定分片）和具體的SQL 來比對相關的實際 Table。
        Collection<PreparedStatementUnit> preparedStatementUnits = route();
        // 使用線程池，分發執行和結果歸并。
        return new PreparedStatementExecutor(getConnection().getShardingContext().getExecutorEngine(), routeResult.getSqlStatement().getType(), preparedStatementUnits).execute();
    } finally {
        JDBCShardingRefreshHandler.build(routeResult, connection).execute();
        clearBatch();
    }
}

SQL 路由政策

啟用 sql 列印，直覺看到實際分發執行的 SQL

# 列印的代碼，就是在上述route 得出 ExecutionUnits 後，列印的
sharding.jdbc.config.sharding.props.sql.show=true

sharding-jdbc 根據不同的 SQL 語句，會有不同的路由政策。我們關注的 Join 查詢，實際相關就是以下兩種政策。

StandardRoutingEngine binding-tables 模式
ComplexRoutingEngine 最複雜的情況，笛卡爾組合關聯關系。

-- 參數不明，不能定位分片的情況
select * from order o inner join order_item oi on o.order_id = oi.order_id 

-- 路由結果
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id

②Elasticsearch Join 查詢場景

首先，對于 NoSQL 資料庫，要求 Join 查詢，可以考慮是不是使用場景和用法有問題。

然後，不可避免的，有些場景需要這個功能。Join 查詢的實作更貼近 SQL 引擎。

基于 elasticsearch-sql 元件的方案，了解大概的實作思路。

elasticsearch-sql

這是個 elasticsearch 插件，通過提供 http 服務實作類 SQL 查詢的功能，高版本的 elasticsearch 已經具備該功能⭐
因為 elasticsearch 沒有 Join 查詢的特性，是以實作 SQL Join 功能，需要提供更加底層的功能，涉及到 Join 算法。

Code Insight

源碼位址：[email protected]:NLPchina/elasticsearch-sql.git

/**
 * Execute the ActionRequest and returns the REST response using the channel.
 * @see ElasticDefaultRestExecutor#execute
 * @see ESJoinQueryActionFactory#createJoinAction Join 算法選擇
 */
@Override
public void execute(Client client, Map<String, String> params, QueryAction queryAction, RestChannel channel) throws Exception{
    // sql parse
    SqlElasticRequestBuilder requestBuilder = queryAction.explain();

    // join 查詢
    if(requestBuilder instanceof JoinRequestBuilder){
        // join 算法選擇。包括：HashJoinElasticExecutor、NestedLoopsElasticExecutor
        // 如果關聯條件為等值（Condition.OPEAR.EQ）,則使用 HashJoinElasticExecutor
        ElasticJoinExecutor executor = ElasticJoinExecutor.createJoinExecutor(client,requestBuilder);
        executor.run();
        executor.sendResponse(channel);
    }
    // 其他類型查詢 ...
}

③More Than Join

Join 算法

常用三種 Join 算法：Nested Loop Join，Hash Join、 Merge Join
MySQL 隻支援 NLJ 或其變種，8.0.18 版本後支援 Hash Join
NLJ 相當于兩個嵌套循環，用第一張表做 Outter Loop，第二張表做 Inner Loop，Outter Loop 的每一條記錄跟 Inner Loop 的記錄作比較，最終符合條件的就将該資料記錄。
Hash Join 分為兩個階段； build 建構階段和 probe 探測階段。
可以使用 Explain 檢視 MySQL 使用哪種 Join 算法。需要的文法關鍵字： FORMAT=JSON or FORMAT=Tree

EXPLAIN FORMAT=JSON  
SELECT * FROM
    sale_line_info u
    JOIN sale_line_manager o ON u.sale_line_code = o.sale_line_code;

{
    "query_block": {
        "select_id": 1,
        // 使用的join 算法： nested_loop
        "nested_loop": [
            // 涉及join 的表以及對應的 key,其他的資訊與常用explain 類似
            {
                "table": {
                    "table_name": "o",
                    "access_type": "ALL"
                }
            },
            {
                "table": {
                    "table_name": "u",
                    "access_type": "ref"
                }
            }
        ]
    }
}

Elasticsearch Nested 類型

分析 Elasticsearch 業務資料以及使用場景，還有一種選擇是直接存儲關聯資訊的文檔。在 Elasticsearch 中，是以完整文檔形式提供查詢和檢索，徹底避開使用 Join 相關的技術。

這樣就牽扯到關聯是歸屬類型的資料還是公用類型的資料、關聯資料量的大小、關聯資料的更新頻率等。這些都是使用 Nested 類型需要考慮的因素。

更多的使用方法，可以從網上和官網找到，不做贅述。

我們現在有個業務功能正好使用到 Nested 類型，在查詢和優化過程中，解決了非常大的難題。

總結

通過運作原理分析，對于運作流程有了清晰和深入的認知。

對于中間件的優化和技術選型更加有目的性，使用上會更加謹慎和小心。

明确的篩選條件，更小的篩選範圍，limit 取值資料，都可以減少計算陳本，提高性能。

參考

如何在分布式資料庫中實作 Hash Join
一文詳解 MySQL——Join 的使用優化 - 掘金

作者：京東物流楊攀

來源：京東雲開發者社群

分布式資料庫 Join 查詢設計與實作淺析

①Mysql 分庫分表 Join 查詢場景

sharding-jdbc

Code Insight

SQL 路由政策

②Elasticsearch Join 查詢場景

elasticsearch-sql

Code Insight

③More Than Join

Join 算法

Elasticsearch Nested 類型

總結

參考

繼續閱讀

深入了解分布式資料庫：實作可擴充性和高可用性的關鍵

2 分布式資料庫系統的結構

高一緻分布式資料庫Galera Cluster

【SQL程式設計】Greenplum 實作樹結構+自定義函數+避免函數重複調用+ function cannot execute on a QE slice 問題處理（優化過程全記錄）

微信分布式資料存儲協定對比——Paxos和Quorum

HBase最佳實踐－讀性能優化政策 HBase最佳實踐－讀性能優化政策

分布式系統的分類總結

分布式資料中心網絡互聯技術實作

近年來航空航天工業領域的激烈競争，使得高效率的計算模型在航空航天工業中的應用越來越廣泛，也讓飛機在設計過程中對于高品質低

基于分布式資料管理系統的讀寫性能優化研究

華為高斯資料庫正式開源

論分布式資料庫TiDB架構的“存”與“算”

MySQL資料庫讀寫分離中間件Atlas

SequoiaDB分布式資料庫2023.7月刊

國産分布式資料庫——TDSQL性能分析工具

飛5的Spring Boot2（29）- Cassandra