天天看點

PostgreSQL vs Greenplum Hash outer join (hash表的選擇)

postgresql , greenplum , hash outer join , hash table

資料分析、大表join、多表join時,哈希join是比較好的提速手段。

hash join會首先掃描其中的一張表(包括需要輸出的字段),根據join列生成哈希表。然後掃描另一張表。

<a href="https://www.postgresql.org/docs/10/static/planner-optimizer.html">https://www.postgresql.org/docs/10/static/planner-optimizer.html</a>

the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys.

next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

理論上應該選擇小表作為哈希表。但是2011年以前的版本,對hash表的選擇是有講究的,并不是自由選擇,隻支outer join時傳回可以為空的表生成哈希表。

postgresql在1997年的時候已經支援hashjoin,greenplum基于postgresql 8.2開發,是以也是天然支援hashjoin的。

在2011年時,postgresql對hashjoin做出了兩個改進,支援full outer join,同時支援outer join任意表生成哈希表(原來的版本隻支outer join時傳回可以為空的表生成哈希表):

<a href="https://www.postgresql.org/docs/current/static/release-9-1.html">https://www.postgresql.org/docs/current/static/release-9-1.html</a>

allow full outer join to be implemented as a hash join, and allow either side of a left outer join or right outer join to be hashed (tom lane)

previously full outer join could only be implemented as a merge join, and left outer join and right outer join could hash only the nullable side of the join.

these changes provide additional query optimization possibilities.

對應patch

<a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f4e4b3274317d9ce30de7e7e5b04dece7c4e1791">https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f4e4b3274317d9ce30de7e7e5b04dece7c4e1791</a>

這個改進非常有意義,特别是可以為空的表非常龐大時,作為哈希表是不合适的。後面就來對比一下。

for each row r1 of t1, the joined table has a row for each row in t2 that satisfies the join condition with r1.

first, an inner join is performed.

then, for each row in t1 that does not satisfy the join condition with any row in t2,

a joined row is added with null values in columns of t2.

thus, the joined table always has at least one row for each row in t1.

then, for each row in t2 that does not satisfy the join condition with any row in t1,

a joined row is added with null values in columns of t1.

this is the converse of a left join: the result table will always have a row for each row in t2.

also, for each row of t2 that does not satisfy the join condition with any row in t1,

a joined row with null values in the columns of t1 is added.

postgresql自動選擇了小表作為哈希表。

greenplum隻能選擇nullable端的表作為哈希表。即t2.

postgresql 9.1+ 支援full outer join使用hash join.

greenplum 8.2版本,不支援full outer join使用hash join.

使用了merge join.

繼續閱讀