SQL 對郵箱資料的處理（分類統計）

需求

在許多業務場景中，需要處理使用者的郵箱資訊，為了探索不同郵箱的不同特定，便于對使用者進行分類處理，需要對郵箱進行處理。

前、字尾分離

郵箱的格式大緻是一緻的

字首@字尾

的樣式。需要将郵箱進行分類。

email
[email protected]
[email protected]

明顯看到他們前字尾通過

複合連接配接，将他們分别分離成兩列即可。

select  email_list[0] as pre_email , email_list[1] as pro_email
from (
    select split(email,\'@\') email_list
    from table a
) a

結果是：

pre_email	pro_email
123456	qq.com
123456	163.com

此時可以檢視，究竟有那些郵箱的字尾，計數是多少，便于進行分類。

select pro_email, count(*) cnt
from table
group by pro_email

字首的處理

對于字首，主要可以為以下幾種類型

手機号： 13579246810
純數字： 3123443242（ 這種可能會與手機号重複，是以需要先篩選手機号）
純字母： qwert
字母與數字： xiaoming1982
其他： xiaoming_19

分類代碼如下：

case when pre_email rlike "^(\\+?86)?((13[0-9])|(14[5,7])|(15[0-3,5-9])|(17[0,3,5-8])|(18[0-9])|161|166|198|199|(147))\\d{8}$" then \'手機号\'
    when pre_email rlike \'^\\d+$\' then \'純數字\'
    when pre_email rlike \'^[a-zA-Z]*$\' then \'純字母\'
    when pre_email rlike \'^[a-z0-9A-Z]+$\' then \'字母與數字\'
    else \'其他\' end as pre_flag

字尾的處理

字尾主要是公司，可以根據使用的場景，檢視那些郵箱用的比較多，定義對應的字尾漢語名稱。

騰訊郵箱： qq.com , qq.vip.com , foxmail.com
網易郵箱： 163.com , 126.com , yeah.cn
電信郵箱： 139.com , 189.com

分類代碼如下：

case when pro_email rlike \'.*(163|126|yeah).*\' then \'網易郵箱\'
    when pro_email rlike \'.*(sina).*\' then \'新浪郵箱\'
    when pro_email rlike \'.*(qq|foxmail).*\' then \'騰訊郵箱\'
    when pro_email rlike \'.*(sohu).*\' then \'搜狐郵箱\'
    when pro_email rlike \'.*(139|189).*\' then \'電信郵箱\'
    else \'others\' end as pro_flag

統計

這樣就可以處理成不同的組合。進行單獨、交叉統計。

-- 交叉統計
select pre_flag, pro_flag, count(*) cnt
from table
group by pre_flag, pro_flag