概述
我們知道,關系資料庫的表更适合扁平的清單,而不是像 XML 那樣可以直管的儲存具有父子關系的層次結構資料。
首先定義一下我們讨論的層次結構,是這樣的一組資料,每個條目隻能有一個父條目,可以有零個或多個子條目(唯一的例外是根條目,它沒有父條目)。許多依賴資料庫的應用都會遇到層次結構的資料,例如論壇或郵件清單的線索、企業的組織結構圖、内容管理系統或商城的分類目錄等等。我們如下資料作為示例:
資料來源于維基百科的這個頁面,為什麼挑了這幾個條目,以及是否準确合理在這裡就不深究了。
Mike Hillyer 考慮了兩種不同的模型——鄰接表(Adjacency List)和嵌套集(Nested Set)來實作這個層次結構。
鄰接表(Adjacency List)模型
我們可以很直覺的使用下面的方式來儲存如圖所示的結構。
建立名為
distributions
的表:
CREATE TABLE distributions (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(32) NOT NULL,
parent INT NULL DEFAULT NULL,
PRIMARY KEY (id)
)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;
插入資料:
INSERT INTO distributions VALUES
(1, 'Linux', NULL),
(2, 'Debian', 1),
(3, 'Knoppix', 2),
(4, 'Ubuntu', 2),
(5, 'Gentoo', 1),
(6, 'Red Hat', 1),
(7, 'Fedora Core', 6),
(8, 'RHEL', 6),
(9, 'CentOS', 8),
(10, 'Oracle Linux', 8);
執行:
SELECT * FROM distributions;
可以看到表中的資料形如:
+----+--------------+--------+
| id | name | parent |
+----+--------------+--------+
| 1 | Linux | NULL |
| 2 | Debian | 1 |
| 3 | Knoppix | 2 |
| 4 | Ubuntu | 2 |
| 5 | Gentoo | 1 |
| 6 | Red Hat | 1 |
| 7 | Fedora Core | 6 |
| 8 | RHEL | 6 |
| 9 | CentOS | 8 |
| 10 | Oracle Linux | 8 |
+----+--------------+--------+
使用連結表模型,表中的每一條記錄都包含一個指向其上層記錄的指針。頂層記錄(這個例子中是
Linux
)的這個字段的值為
NULL
。鄰接表的優勢是相當直覺和簡單,我們一眼就能看出
CentOS
衍生自
RHEL
,後者又是從
Red Hat
發展而來的。雖然用戶端程式可能處理起來相當簡單,但是使用純 SQL 處理鄰接表則會遇到一些麻煩。
擷取整棵樹以及單個節點的完整路徑
第一個處理層次結構常見的任務是顯示整個層次結構,通常包含一定的縮進。使用純 SQL 處理時通常需要借助所謂的 self-join 技巧:
SELECT
t1.name AS level1,
t2.name as level2,
t3.name as level3,
t4.name as level4
FROM
distributions AS t1
LEFT JOIN distributions AS t2
ON t2.parent = t1.id
LEFT JOIN distributions AS t3
ON t3.parent = t2.id
LEFT JOIN distributions AS t4
ON t4.parent = t3.id
WHERE t1.name = 'Linux';
結果如下:
+--------+---------+-------------+--------------+
| level1 | level2 | level3 | level4 |
+--------+---------+-------------+--------------+
| Linux | Red Hat | RHEL | CentOS |
| Linux | Red Hat | RHEL | Oracle Linux |
| Linux | Debian | Knoppix | NULL |
| Linux | Debian | Ubuntu | NULL |
| Linux | Red Hat | Fedora Core | NULL |
| Linux | Gentoo | NULL | NULL |
+--------+---------+-------------+--------------+
可以看到,實際上用戶端代碼拿到這個結果也不容易處理。對比原文,我們發現傳回結果的順序也是不确定的。在實踐中沒有什麼參考意義。不過可以通過增加一個
WHERE
條件,擷取一個節點的完整路徑:
SELECT
t1.name AS level1,
t2.name as level2,
t3.name as level3,
t4.name as level4
FROM
distributions AS t1
LEFT JOIN distributions AS t2
ON t2.parent = t1.id
LEFT JOIN distributions AS t3
ON t3.parent = t2.id
LEFT JOIN distributions AS t4
ON t4.parent = t3.id
WHERE
t1.name = 'Linux'
AND t4.name = 'CentOS';
+--------+---------+--------+--------+
| level1 | level2 | level3 | level4 |
+--------+---------+--------+--------+
| Linux | Red Hat | RHEL | CentOS |
+--------+---------+--------+--------+
找出所有的葉節點
使用
LEFT JOIN
我們可以找出所有的葉節點:
SELECT
distributions.id, distributions.name
FROM
distributions
LEFT JOIN distributions as child
ON distributions.id = child.parent
WHERE child.id IS NULL;
+----+--------------+
| id | name |
+----+--------------+
| 3 | Knoppix |
| 4 | Ubuntu |
| 5 | Gentoo |
| 7 | Fedora Core |
| 9 | CentOS |
| 10 | Oracle Linux |
+----+--------------+
鄰接表模型的限制
使用純 SQL 處理鄰接表模型即便在最好的情況下也是困難的。要獲得一個分類的完整路徑之前我們需要知道它的層次有多深。除此之外,當我們删除一個節點時我們需要格外的謹慎,因為這可能潛在的在處理過程中整個子樹成為孤兒(例如删除『便攜式小家電』則所有其子分類都成為孤兒了)。其中一些限制可以在用戶端代碼或存儲過程中定位并處理。例如在存儲過程中我們可以自下而上的周遊這個結構以便傳回整棵樹或一個路徑。我們也可以使用存儲過程來删除節點,通過提升其一個子節點的層次并重新設定所有其它子節點的父節點為這個節點,來避免整棵子樹成為孤兒。
嵌套集(Nested Set)模型
由于使用純 SQL 處理鄰接表模型存在種種不便,是以 Mike Hillyer 鄭重的介紹了嵌套集(Nested Set)模型。當使用這種模型時,我們把層次結構的節點和路徑從腦海中抹去,把它們想象為一個個容器:
可以看到層次關系沒有改變,大的容器包含子容器。我們使用容器的左值和右值來建立資料表:
CREATE TABLE nested (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(32) NOT NULL,
`left` INT NOT NULL,
`right` INT NOT NULL,
PRIMARY KEY (id)
)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;
需要注意
left
和
right
是 MySQL 的保留字,是以使用辨別分隔符來标記它們。
INSERT INTO nested VALUES
(1, 'Linux', 1, 20),
(2, 'Debian', 2, 7),
(3, 'Knoppix', 3, 4),
(4, 'Ubuntu', 5, 6),
(5, 'Gentoo', 8, 9),
(6, 'Red Hat', 10, 19),
(7, 'Fedora Core', 11, 12),
(8, 'RHEL', 13, 18),
(9, 'CentOS', 14, 15),
(10, 'Oracle Linux', 16, 17);
檢視内容:
SELECT * FROM nested ORDER BY id;
可以看到:
+----+--------------+------+-------+
| id | name | left | right |
+----+--------------+------+-------+
| 1 | Linux | 1 | 20 |
| 2 | Debian | 2 | 7 |
| 3 | Knoppix | 3 | 4 |
| 4 | Ubuntu | 5 | 6 |
| 5 | Gentoo | 8 | 9 |
| 6 | Red Hat | 10 | 19 |
| 7 | Fedora Core | 11 | 12 |
| 8 | RHEL | 13 | 18 |
| 9 | CentOS | 14 | 15 |
| 10 | Oracle Linux | 16 | 17 |
+----+--------------+------+-------+
我們是如何确定左編号和右編号的呢,通過下圖我們可以直覺的發現隻要會數數即可完成:
回到樹形模型該怎麼處理,通過下圖,對資料結構稍有概念的人都會知道,稍加改動的先序周遊算法即可完成這項編号的工作:
擷取整棵樹
一個節點的左編号總是介于其父節點的左右編号之間,利用這個特性使用 self-join 連結到父節點,可以擷取整棵樹:
SELECT node.name
FROM
nested AS node,
nested AS parent
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
AND parent.name = 'Linux'
ORDER BY node.`left`;
+--------------+
| name |
+--------------+
| Linux |
| Debian |
| Knoppix |
| Ubuntu |
| Gentoo |
| Red Hat |
| Fedora Core |
| RHEL |
| CentOS |
| Oracle Linux |
+--------------+
但是這樣我們丢失了層次的資訊。怎麼辦呢?使用
COUNT()
函數和
GROUP BY
子句,可以實作這個目的:
SELECT
node.name, (COUNT(parent.name) - 1) AS depth
FROM
nested AS node,
nested AS parent
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);
需要注意 MySQL 5.7.5 開始預設啟用了
only_full_group_by
模式,讓
GROUP BY
的行為與 SQL92 标準一緻,是以直接使用
ORDER BY node.`left`
會産生錯誤:
ERROR 1055 (42000): Expression #1 of ORDER BY clause is not in GROUP BY clause and contains nonaggregated column'test.node.left' which is not functionally dependent on columns in GROUP BY clause; this is incompatible withsql_mode=only_full_group_by
ANY_VALUE()
是避免這個問題的一種的途徑。
+--------------+-------+
| name | depth |
+--------------+-------+
| Linux | 0 |
| Debian | 1 |
| Knoppix | 2 |
| Ubuntu | 2 |
| Gentoo | 1 |
| Red Hat | 1 |
| Fedora Core | 2 |
| RHEL | 2 |
| CentOS | 3 |
| Oracle Linux | 3 |
+--------------+-------+
稍作調整就可以直接顯示層次:
SELECT
CONCAT(REPEAT(' ', COUNT(parent.name) - 1), node.name) AS name
FROM
nested AS node,
nested AS parent
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);
結果相當漂亮:
+-----------------+
| name |
+-----------------+
| Linux |
| Debian |
| Knoppix |
| Ubuntu |
| Gentoo |
| Red Hat |
| Fedora Core |
| RHEL |
| CentOS |
| Oracle Linux |
+-----------------+
當然用戶端代碼可能會更傾向于使用
depth
值,對傳回的結果集進行循環,Web 開發人員可以根據其增大或減小使用
<li>
/
</li>
或
<ul>
</ul>
等。
擷取節點在子樹中的深度
要擷取節點在子樹中的深度,我們需要第三個 self-join 以及子查詢來将結果限制在特定的子樹中以及進行必要的計算:
SELECT
node.name, (COUNT(parent.name) - ANY_VALUE(sub_tree.depth) - 1) AS depth
FROM
nested AS node,
nested AS parent,
nested AS sub_parent,
(
SELECT
node.name, (COUNT(parent.name) - 1) AS depth
FROM
nested AS node,
nested AS parent
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
AND node.name = 'Red Hat'
GROUP BY node.name, node.`left`
ORDER BY node.`left`
) AS sub_tree
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
AND node.`left` BETWEEN sub_parent.`left` AND sub_parent.`right`
AND sub_parent.name = sub_tree.name
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);
結果是:
+--------------+-------+
| name | depth |
+--------------+-------+
| Red Hat | 0 |
| Fedora Core | 1 |
| RHEL | 1 |
| CentOS | 2 |
| Oracle Linux | 2 |
+--------------+-------+
尋找一個節點的直接子節點
使用鄰接表模型時這相當簡單。使用嵌套集時,我們可以在上面擷取子樹各節點深度的基礎上增加一個
HAVING
子句來實作:
SELECT
node.name, (COUNT(parent.name) - ANY_VALUE(sub_tree.depth) - 1) AS depth
FROM
nested AS node,
nested AS parent,
nested AS sub_parent,
(
SELECT
node.name, (COUNT(parent.name) - 1) AS depth
FROM
nested AS node,
nested AS parent
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
AND node.name = 'Red Hat'
GROUP BY node.name, node.`left`
ORDER BY node.`left`
) AS sub_tree
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
AND node.`left` BETWEEN sub_parent.`left` AND sub_parent.`right`
AND sub_parent.name = sub_tree.name
GROUP BY node.name
HAVING depth = 1
ORDER BY ANY_VALUE(node.`left`);
結果:
+-------------+-------+
| name | depth |
+-------------+-------+
| Fedora Core | 1 |
| RHEL | 1 |
+-------------+-------+
擷取所有葉節點
觀察帶編号的嵌套模型,葉節點的判斷相當簡單,右編号恰好比左編号多 1 的節點就是葉節點:
SELECT id, name FROM nested WHERE `right` = `left` + 1;
+----+--------------+
| id | name |
+----+--------------+
| 3 | Knoppix |
| 4 | Ubuntu |
| 5 | Gentoo |
| 7 | Fedora Core |
| 9 | CentOS |
| 10 | Oracle Linux |
+----+--------------+
擷取單個節點的完整路徑
仍然是使用 self-join 技巧,不過現在無需顧慮節點的深度了:
SELECT parent.name
FROM
nested AS node,
nested AS parent
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
AND node.name = 'CentOS'
ORDER BY parent.`left`;
+---------+
| name |
+---------+
| Linux |
| Red Hat |
| RHEL |
| CentOS |
+---------+
聚集操作
我們添加一張
releases
表,來展示一下在嵌套集模型下的聚集(aggregate)操作:
CREATE TABLE releases (
id INT NOT NULL AUTO_INCREMENT,
distribution_id INT NULL,
name VARCHAR(32) NOT NULL,
PRIMARY KEY (id),
INDEX distribution_id_idx (distribution_id ASC),
CONSTRAINT distribution_id
FOREIGN KEY (distribution_id)
REFERENCES nested (id)
ON DELETE CASCADE
ON UPDATE CASCADE
)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;
加入一些資料,假設這些資料是指某個軟體支援的發行版:
INSERT INTO releases (distribution_id, name) VALUES
(2, '7'), (2, '8'),
(4, '14.04 LTS'), (4, '15.10'),
(7, '22'), (7, '23'),
(9, '5'), (9, '6'), (9, '7');
那麼,下面的查詢可以知道每個節點下涉及的釋出版數量,如果這是一個軟體支援的釋出版清單,或許測試人員想要知道他們得準備多少種虛拟機吧:
SELECT
parent.name, COUNT(releases.name)
FROM
nested AS node ,
nested AS parent,
releases
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
AND node.id = releases.distribution_id
GROUP BY parent.name
ORDER BY ANY_VALUE(parent.`left`);
+-------------+----------------------+
| name | COUNT(releases.name) |
+-------------+----------------------+
| Linux | 9 |
| Debian | 4 |
| Ubuntu | 2 |
| Red Hat | 5 |
| Fedora Core | 2 |
| CentOS | 3 |
+-------------+----------------------+
如果層次結構是一個分類目錄,這個技巧可以用于查詢各個類别下有多少關聯的商品。
添加節點
再次回顧這張圖:
如果我們要在
Gentoo
之後增加一個
Slackware
,這個新節點的左右編号分别是 10 和 11,而原來從 10 開始的所有編号都需要加 2。我們可以:
LOCK TABLE nested WRITE;
SELECT @baseIndex := `right` FROM nested WHERE name = 'Gentoo';
UPDATE nested SET `right` = `right` + 2 WHERE `right` > @baseIndex;
UPDATE nested SET `left` = `left` + 2 WHERE `left` > @baseIndex;
INSERT INTO nested (name, `left`, `right`) VALUES
('Slackware', @baseIndex + 1, @baseIndex + 2);
UNLOCK TABLES;
使用之前掌握的技巧看一下現在的情況:
SELECT
CONCAT(REPEAT(' ', COUNT(parent.name) - 1), node.name) AS name
FROM
nested AS node,
nested AS parent
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);
結果為:
+-----------------+
| name |
+-----------------+
| Linux |
| Debian |
| Knoppix |
| Ubuntu |
| Gentoo |
| Slackware |
| Red Hat |
| Fedora Core |
| RHEL |
| CentOS |
| Oracle Linux |
+-----------------+
如果新增的節點的父節點原來是葉節點,我們需要稍微調整一下之前的代碼。例如,我們要新增
Slax
作為
Slackware
的子節點:
LOCK TABLE nested WRITE;
SELECT @baseIndex := `left` FROM nested WHERE name = 'Slackware';
UPDATE nested SET `right` = `right` + 2 WHERE `right` > @baseIndex;
UPDATE nested SET `left` = `left` + 2 WHERE `left` > @baseIndex;
INSERT INTO nested(name, `left`, `right`) VALUES ('Slax', @baseIndex + 1, @baseIndex + 2);
UNLOCK TABLES;
現在,資料形如:
+-----------------+
| name |
+-----------------+
| Linux |
| Debian |
| Knoppix |
| Ubuntu |
| Gentoo |
| Slackware |
| Slax |
| Red Hat |
| Fedora Core |
| RHEL |
| CentOS |
| Oracle Linux |
+-----------------+
删除節點
删除節點的操作與添加操作相對,當要删除一個葉節點時,移除該節點并将所有比該節點右編碼大的編碼減 2。這個思路可以擴充到删除一個節點及其所有子節點的情況,删除左編碼介于節點左右編号之間的所有節點,将右側的節點編号全部左移該節點原編号寬度即可:
LOCK TABLE nested WRITE;
SELECT
@nodeLeft := `left`,
@nodeRight := `right`,
@nodeWidth := `right` - `left` + 1
FROM nested
WHERE name = 'Slackware';
DELETE FROM nested WHERE `left` BETWEEN @nodeLeft AND @nodeRight;
UPDATE nested SET `right` = `right` - @nodeWidth WHERE `right` > @nodeRight;
UPDATE nested SET `left` = `left` - @nodeWidth WHERE `left` > @nodeRight;
UNLOCK TABLES;
可以看到
Slackware
子樹被删除了:
+-----------------+
| name |
+-----------------+
| Linux |
| Debian |
| Knoppix |
| Ubuntu |
| Gentoo |
| Red Hat |
| Fedora Core |
| RHEL |
| CentOS |
| Oracle Linux |
+-----------------+
稍加調整,如果對介于要删除節點左右編号直接的節點對應編号左移 1,右側節點對應編号左移 2,則可以實作删除一個節點,其子節點提升一層的效果,例如我們嘗試删除
RHEL
但保留它的子節點:
LOCK TABLE nested WRITE;
SELECT
@nodeLeft := `left`,
@nodeRight := `right`
FROM nested
WHERE name = 'RHEL';
DELETE FROM nested WHERE `left` = @nodeLeft;
UPDATE nested SET `right` = `right` - 1, `left` = `left` - 1 WHERE `left` BETWEEN @nodeLeft AND @nodeRight;
UPDATE nested SET `right` = `right` - 2 WHERE `right` > @nodeRight;
UPDATE nested SET `left` = `left` - 2 WHERE `left` > @nodeRight;
UNLOCK TABLES;
SELECT
CONCAT(REPEAT(' ', COUNT(parent.name) - 1), node.name) AS name
FROM
nested AS node,
nested AS parent
WHERE
node.`left` BETWEEN parent.`left` AND parent.`right`
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);
+----------------+
| name |
+----------------+
| Linux |
| Debian |
| Knoppix |
| Ubuntu |
| Gentoo |
| Red Hat |
| Fedora Core |
| CentOS |
| Oracle Linux |
+----------------+