PostgreSQL 相似文字檢索與去重 - (銀屑病怎麼治？銀屑病怎麼治療？銀屑病怎麼治療好？銀屑病怎麼能治療好？)

标簽

PostgreSQL , 相似字元串 , 全文檢索 , 去重 , 相似問題 , 醫療 , plr , plpython , madlib , 文本處理

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E8%83%8C%E6%99%AF 背景

在雲栖社群的問答區，有一位網友提到有一個問題：

表裡相似資料太多，想删除相似度高的資料，有什麼辦法能實作嗎？       例如：       銀屑病怎麼治？       銀屑病怎麼治療？       銀屑病怎麼治療好？       銀屑病怎麼能治療好？       等等

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E8%A7%A3%E8%BF%99%E4%B8%AA%E9%97%AE%E9%A2%98%E7%9A%84%E6%80%9D%E8%B7%AF 解這個問題的思路

1. 首先如何判斷内容的相似度，PostgreSQL中提供了中文分詞，pg_trgm(将字元串切成多個不重複的token,計算兩個字元串的相似度) .

對于本題，我建議采取中文分詞的方式，首先将内容拆分成詞組。

2. 在拆分成詞組後，首先分組聚合，去除完全重複的資料。

3. 然後自關聯生成笛卡爾(矩陣)，計算出每條記錄和其他記錄的相似度。相似度的算法很簡單，重疊的token數量除以集合的token去重後的數量。

4. 根據相似度，去除不需要的資料。

這裡如果資料量非常龐大，使用專業的分析程式設計語言會更好例如 PL/R。

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E5%AE%9E%E6%93%8D%E7%9A%84%E4%BE%8B%E5%AD%90 實操的例子

首先要安裝PostgreSQL 中文分詞插件

(阿裡雲AliCloudDB PostgreSQL已包含這個插件，用法參考官方手冊)

git clone https://github.com/jaiminpan/pg_jieba.git       mv pg_jieba $PGSRC/contrib/       export PATH=/home/digoal/pgsql9.5/bin:$PATH       cd $PGSRC/contrib/pg_jieba       make clean;make;make install       git clone https://github.com/jaiminpan/pg_scws.git       mv pg_jieba $PGSRC/contrib/       export PATH=/home/digoal/pgsql9.5/bin:$PATH       cd $PGSRC/contrib/pg_scws       make clean;make;make install

建立插件

psql       # create extension pg_jieba;       # create extension pg_scws;

建立測試CASE

create table tdup1 (id int primary key, info text);       create extension pg_trgm;       insert into tdup1 values (1, '銀屑病怎麼治？');       insert into tdup1 values (2, '銀屑病怎麼治療？');       insert into tdup1 values (3, '銀屑病怎麼治療好？');       insert into tdup1 values (4, '銀屑病怎麼能治療好？');

這兩種分詞插件，可以任選一種。

postgres=# select to_tsvector('jiebacfg', info),* from tdup1 ;            to_tsvector     | id |         info                ---------------------+----+----------------------        '治':3 '銀屑病':1   |  1 | 銀屑病怎麼治？        '治療':3 '銀屑病':1 |  2 | 銀屑病怎麼治療？        '治療':3 '銀屑病':1 |  3 | 銀屑病怎麼治療好？        '治療':4 '銀屑病':1 |  4 | 銀屑病怎麼能治療好？       (4 rows)       postgres=# select to_tsvector('scwscfg', info),* from tdup1 ;                   to_tsvector            | id |         info                -----------------------------------+----+----------------------        '治':2 '銀屑病':1                 |  1 | 銀屑病怎麼治？        '治療':2 '銀屑病':1               |  2 | 銀屑病怎麼治療？        '好':3 '治療':2 '銀屑病':1        |  3 | 銀屑病怎麼治療好？        '好':4 '治療':3 '能':2 '銀屑病':1 |  4 | 銀屑病怎麼能治療好？       (4 rows)

建立三個函數，

計算2個數組的集合（去重後的集合）

postgres=# create or replace function array_union(text[], text[]) returns text[] as $$         select array_agg(c1) from (select c1 from unnest($1||$2) t(c1) group by c1) t;       $$ language sql strict;       CREATE FUNCTION

數組去重

postgres=# create or replace function array_dist(text[]) returns text[] as $$                  select array_agg(c1) from (select c1 from unnest($1) t(c1) group by c1) t;           $$ language sql strict;       CREATE FUNCTION

計算兩個數組的重疊部分（去重後的重疊部分）

postgres=# create or replace function array_share(text[], text[]) returns text[] as $$         select array_agg(unnest) from (select unnest($1) intersect select unnest($2) group by 1) t;       $$ language sql strict;       CREATE FUNCTION

笛卡爾結果是這樣的：

regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')

用于将info轉換成數組。

postgres=# with t(c1,c2,c3) as        (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)        select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)        simulate from t t1,t t2) t;        t1c1 | t2c1 |         t1c2         |         t2c2         |       t1c3        |       t2c3        | simulate        ------+------+----------------------+----------------------+-------------------+-------------------+----------           1 |    1 | 銀屑病怎麼治？       | 銀屑病怎麼治？       | {'銀屑病','治'}   | {'銀屑病','治'}   |     1.00           1 |    2 | 銀屑病怎麼治？       | 銀屑病怎麼治療？     | {'銀屑病','治'}   | {'銀屑病','治療'} |     0.33           1 |    3 | 銀屑病怎麼治？       | 銀屑病怎麼治療好？   | {'銀屑病','治'}   | {'銀屑病','治療'} |     0.33           1 |    4 | 銀屑病怎麼治？       | 銀屑病怎麼能治療好？ | {'銀屑病','治'}   | {'銀屑病','治療'} |     0.33           2 |    1 | 銀屑病怎麼治療？     | 銀屑病怎麼治？       | {'銀屑病','治療'} | {'銀屑病','治'}   |     0.33           2 |    2 | 銀屑病怎麼治療？     | 銀屑病怎麼治療？     | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           2 |    3 | 銀屑病怎麼治療？     | 銀屑病怎麼治療好？   | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           2 |    4 | 銀屑病怎麼治療？     | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           3 |    1 | 銀屑病怎麼治療好？   | 銀屑病怎麼治？       | {'銀屑病','治療'} | {'銀屑病','治'}   |     0.33           3 |    2 | 銀屑病怎麼治療好？   | 銀屑病怎麼治療？     | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           3 |    3 | 銀屑病怎麼治療好？   | 銀屑病怎麼治療好？   | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           3 |    4 | 銀屑病怎麼治療好？   | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           4 |    1 | 銀屑病怎麼能治療好？ | 銀屑病怎麼治？       | {'銀屑病','治療'} | {'銀屑病','治'}   |     0.33           4 |    2 | 銀屑病怎麼能治療好？ | 銀屑病怎麼治療？     | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           4 |    3 | 銀屑病怎麼能治療好？ | 銀屑病怎麼治療好？   | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           4 |    4 | 銀屑病怎麼能治療好？ | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00       (16 rows)

以上生成的實際上是一個矩陣,simulate就是矩陣中我們需要計算的相似度：

https://github.com/digoal/blog/blob/master/201803/20180329_01_pic_001.png

我們在去重計算時不需要所有的笛卡爾積，隻需要這個矩陣對角線的上部分或下部分資料即可。

是以加個條件就能完成。

postgres=# with t(c1,c2,c3) as        (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)        select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)        simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t;        t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate        ------+------+--------------------+----------------------+-------------------+-------------------+----------           1 |    2 | 銀屑病怎麼治？     | 銀屑病怎麼治療？     | {'銀屑病','治'}   | {'銀屑病','治療'} |     0.33           1 |    3 | 銀屑病怎麼治？     | 銀屑病怎麼治療好？   | {'銀屑病','治'}   | {'銀屑病','治療'} |     0.33           1 |    4 | 銀屑病怎麼治？     | 銀屑病怎麼能治療好？ | {'銀屑病','治'}   | {'銀屑病','治療'} |     0.33           2 |    3 | 銀屑病怎麼治療？   | 銀屑病怎麼治療好？   | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           2 |    4 | 銀屑病怎麼治療？   | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           3 |    4 | 銀屑病怎麼治療好？ | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00       (6 rows)

開始對這些資料去重，去重的第一步，明确simulate, 例如相似度大于0.5的，需要去重。

postgres=# with t(c1,c2,c3) as        (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)        select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)        simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5;        t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate        ------+------+--------------------+----------------------+-------------------+-------------------+----------           2 |    3 | 銀屑病怎麼治療？   | 銀屑病怎麼治療好？   | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           2 |    4 | 銀屑病怎麼治療？   | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           3 |    4 | 銀屑病怎麼治療好？ | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00       (3 rows)

去重第二步，将t2c1列的ID對應的記錄删掉即可。

delete from tdup1 where id in (with t(c1,c2,c3) as        (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)        select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)        simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);       例如 :        postgres=# insert into tdup1 values (11, '白血病怎麼治？');       INSERT 0 1       postgres=# insert into tdup1 values (22, '白血病怎麼治療？');       INSERT 0 1       postgres=# insert into tdup1 values (13, '白血病怎麼治療好？');       INSERT 0 1       postgres=# insert into tdup1 values (24, '白血病怎麼能治療好？');       INSERT 0 1       postgres=#        postgres=# with t(c1,c2,c3) as                                    (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)        select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)        simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5;        t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate        ------+------+--------------------+----------------------+-------------------+-------------------+----------           2 |    3 | 銀屑病怎麼治療？   | 銀屑病怎麼治療好？   | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           2 |    4 | 銀屑病怎麼治療？   | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00           3 |    4 | 銀屑病怎麼治療好？ | 銀屑病怎麼能治療好？ | {'銀屑病','治療'} | {'銀屑病','治療'} |     1.00          22 |   24 | 白血病怎麼治療？   | 白血病怎麼能治療好？ | {'治療','白血病'} | {'治療','白血病'} |     1.00          13 |   22 | 白血病怎麼治療好？ | 白血病怎麼治療？     | {'治療','白血病'} | {'治療','白血病'} |     1.00          13 |   24 | 白血病怎麼治療好？ | 白血病怎麼能治療好？ | {'治療','白血病'} | {'治療','白血病'} |     1.00       (6 rows)       postgres=# begin;       BEGIN       postgres=# delete from tdup1 where id in (with t(c1,c2,c3) as        postgres(# (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)        postgres(# select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)        postgres(# simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);       DELETE 4       postgres=# select * from tdup1 ;        id |        info               ----+--------------------         1 | 銀屑病怎麼治？         2 | 銀屑病怎麼治療？        11 | 白血病怎麼治？        13 | 白血病怎麼治療好？       (4 rows)

用資料庫解會遇到的問題, 因為我們的JOIN filter是<>和<，用不上hashjoin。

資料量比較大的情況下，耗時會非常的長。

postgres=# explain delete from tdup1 where id in (with t(c1,c2,c3) as        (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)        select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)        simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);                                                             QUERY PLAN                                                             ----------------------------------------------------------------------------------------------------------------------        Delete on tdup1  (cost=10005260133.58..10005260215.84 rows=2555 width=34)          ->  Hash Join  (cost=10005260133.58..10005260215.84 rows=2555 width=34)                Hash Cond: (tdup1.id = "ANY_subquery".t2c1)                ->  Seq Scan on tdup1  (cost=0.00..61.10 rows=5110 width=10)                ->  Hash  (cost=10005260131.08..10005260131.08 rows=200 width=32)                      ->  HashAggregate  (cost=10005260129.08..10005260131.08 rows=200 width=32)                            Group Key: "ANY_subquery".t2c1                            ->  Subquery Scan on "ANY_subquery"  (cost=10000002667.20..10005252911.99 rows=2886838 width=32)                                  ->  Subquery Scan on t  (cost=10000002667.20..10005224043.61 rows=2886838 width=4)                                        Filter: (t.simulate > 0.5)                                        CTE t                                          ->  Seq Scan on tdup1 tdup1_1  (cost=0.00..2667.20 rows=5110 width=36)                                        ->  Nested Loop  (cost=10000000000.00..10005113119.99 rows=8660513 width=68)                                              Join Filter: ((t1.c1 <> t2.c1) AND (t1.c1 < t2.c1))                                              ->  CTE Scan on t t1  (cost=0.00..102.20 rows=5110 width=36)                                              ->  CTE Scan on t t2  (cost=0.00..102.20 rows=5110 width=36)       (16 rows)

其他更優雅的方法，使用PLR或者R進行矩陣運算，得出結果後再進行篩選。

PLR

或者使用MPP資料庫例如Greenplum加上R和madlib可以對非常龐大的資料進行處理。

MADLIB

MPP

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E5%B0%8F%E7%BB%93 小結

這裡用到了PG的什麼特性？

1. 中文分詞

2. 視窗查詢功能

(本例中沒有用到，但是如果你的資料沒有主鍵時，則需要用ctid和row_number來定位到一條唯一記錄)

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E5%8F%82%E8%80%83 參考

《[未完待續] PostgreSQL 全文檢索大結果集優化 - fuzzy match》《PostgreSQL 全文檢索 - 詞頻統計》《[未完待續] PostgreSQL 流式fft傅裡葉變換 (plpython + numpy + 資料庫流式計算)》《PostgreSQL UDF實作tsvector(全文檢索), array(數組)多值字段與scalar(單值字段)類型的整合索引(類分區索引) - 單值與多值類型複合查詢性能提速100倍+ 案例 (含，單值+多值列合成)》《PostgreSQL 全文檢索之 - 位置比對過濾文法(例如 '速度 <1> 激情')》《多流實時聚合 - 記錄級實時快照 - JSON聚合與json全文檢索的功能應用》《PostgreSQL - 全文檢索内置及自定義ranking算法介紹與案例》《用PostgreSQL 做實時高效搜尋引擎 - 全文檢索、模糊查詢、正則查詢、相似查詢、ADHOC查詢》《HTAP資料庫 PostgreSQL 場景與性能測試之 14 - (OLTP) 字元串搜尋 - 全文檢索》《HTAP資料庫 PostgreSQL 場景與性能測試之 7 - (OLTP) 全文檢索 - 含索引實時寫入》《[未完待續] 流式機器學習(online machine learning) - pipelineDB with plR and plPython》《PostgreSQL 中英文混合分詞特殊規則(中文單字、英文單詞) - 中英分明》《在PostgreSQL中使用 plpythonu 調用系統指令》《多國語言字元串的加密、全文檢索、模糊查詢的支援》《全文檢索不包含優化 - 阿裡雲RDS PostgreSQL最佳實踐》《PostgreSQL 10.0 preview 功能增強 - JSON 内容全文檢索》《PostgreSQL 中如何找出記錄中是否包含編碼範圍内的字元，例如是否包含中文》《PostgreSQL Python tutorial》《如何解決資料庫分詞的拼寫糾正問題 - PostgreSQL Hunspell 字典複數形容詞動詞等變異還原》《聊一聊雙十一背後的技術 - 毫秒分詞算啥, 試試正則和相似度》《聊一聊雙十一背後的技術 - 分詞和搜尋》《PostgreSQL 全文檢索加速快到沒有朋友 - RUM索引接口(潘多拉魔盒)》《PostgreSQL 如何高效解決按任意字段分詞檢索的問題 - case 1》《如何加快PostgreSQL結巴分詞加載速度》《中文模糊查詢性能優化 by PostgreSQL trgm》《PostgreSQL 行級全文檢索》《使用阿裡雲PostgreSQL zhparser中文分詞時不可不知的幾個參數》《一張圖看懂MADlib能幹什麼》《PostgreSQL Greenplum 結巴分詞(by plpython)》《NLPIR 分詞準确率接近98.23%》《PostgreSQL chinese full text search 中文全文檢索》《PostgreSQL 多元線性回歸 - 1 MADLib Installed in PostgreSQL 9.2》《PostgreSQL USE plpythonu get Linux FileSystem usage》《PostgreSQL 使用 nlpbamboo chinesecfg 中文分詞》 https://github.com/jaiminpan/pg_jieba https://github.com/jaiminpan/pg_scws http://joeconway.com/plr/ https://www.postgresql.org/docs/devel/static/plpython.html http://madlib.apache.org/

PostgreSQL 相似文字檢索與去重 - (銀屑病怎麼治？銀屑病怎麼治療？銀屑病怎麼治療好？銀屑病怎麼能治療好？)

标簽

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E8%83%8C%E6%99%AF 背景

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E8%A7%A3%E8%BF%99%E4%B8%AA%E9%97%AE%E9%A2%98%E7%9A%84%E6%80%9D%E8%B7%AF 解這個問題的思路

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E5%AE%9E%E6%93%8D%E7%9A%84%E4%BE%8B%E5%AD%90 實操的例子

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E5%B0%8F%E7%BB%93 小結

https://github.com/digoal/blog/blob/master/201803/20180329_01.md#%E5%8F%82%E8%80%83 參考

繼續閱讀

set define off關閉替代變量功能

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

報錯：'mysql' 不是内部或外部指令，也不是可運作的程式或批處理檔案。

Linxu常用指令技巧彙總

ERROR 1 (HY000): Can't create/write to file '/tmp/#sql_4188_1.MYI' (Errcode: 28)

艱難安裝LDAP,SSL認證

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述