Python字典的應用之大資料量的對賬

我所在的項目有個對賬系統，最初我是寫了腳本來校驗對賬結果的，對賬的2張表的資料量大概是 100條、 100w條；這次提測，手動造了幾百萬的資料，成了100萬條、 100萬條數量級，執行我的腳本，卻跑不出來結果，真的難住我了，做了些優化，分享下。

個人部落格：https://blog.csdn.net/zyooooxie

情景

這個系統是C-G的對賬【原始資料共6個表】，線上C的一個月資料量超過600w，因為對賬時間比較長，在更換技術方案後，提測給我；我在3個表每個都造了100-200w的資料，剩下3個表是爬蟲爬取回來的，是20w-200w的資料量。

其實，想要看對賬結果對不對，直接抽樣N條記錄，從對賬結果表反推到原始資料表，對比預期結果，就能校驗完。

但最初就是用的腳本來做，也就沒按上面去做，非靠我的腳本來搞。

非作死不可。

最早一版

最初2個表資料量分别是 100條、100萬條，我寫的腳本跑的溜溜的，沒問題。

思路如下：

list_a 是C系統的資料list；list_b是G系統的資料list；

将2個list中每個元素的4個字段進行比對，來把符合條件的元素扔到對應的對賬結果list 【共6個結果，如下圖】

new_list = [a for a in list_a if a in list_b]
        Log.info('Consistent-1000的有：{}'.format(new_list))

        new_list1 = [[a, b] for a in list_a for b in list_b if a[0] == b[0] and a[2] != b[2] and a[1] == b[1]]
        Log.info('Amounts Difference-1002的有：{}'.format(new_list1))

        new_list2 = [[a, b] for a in list_a for b in list_b if a[0] == b[0] and a[3] != b[3] and a[2] == b[2] and a[1] == b[1]]
        Log.info('Date difference-1003的有：{}'.format(new_list2))

        new_list3 = [[a, b] for a in list_a for b in list_b if a[0] == b[0] and a[1] != b[1]]
        Log.info('Status Difference-1001的有：{}'.format(new_list3))

        new_list4 = list()
        for a in list_a:
            for b in list_b:
                if a[0] != b[0]:
                    pass
                else:
                    break
                if b is list_b[-1]:
                    new_list4.append(a)
        Log.info('Miss G Bill-1004的有：{}'.format(new_list4))

        new_list5 = list()
        for b in list_b:
            for a in list_a:
                if b[0] != a[0]:
                    pass
                else:
                    break
                if a is list_a[-1]:
                    new_list5.append(b)
        Log.info('Miss C Bill-1005的有：{}'.format(new_list5))

第一版

這次優化是因為最早一版用清單生成式看似代碼簡化，實際至少要執行6次循環嵌套；在資料量少的時候，無所謂；可當一個list是100w的長度，可能一次循環嵌套就要100w X 100w；是以優化為第一版，執行2次循環嵌套，拿出6個結果；

for a in list_a:
            for b in list_b:
                if a[0] != b[0]:
                    pass
                else:
                    if a == b:      # 4個字段全相同
                        consistent_1000.append([a, b])
                    elif a[0] == b[0] and a[3] != b[3] and a[2] == b[2] and a[1] == b[1]:
                        date_1003.append([a, b])
                    elif a[0] == b[0] and a[2] != b[2] and a[1] == b[1]:
                        amount_1002.append([a, b])
                    elif a[0] == b[0] and a[1] != b[1]:
                        status_1001.append([a, b])

                    break
                if b is list_b[-1]:
                    g_miss_1004.append(a)
        
        new_list5 = list()
        for b in list_b:
            for a in list_a:
                if b[0] != a[0]:
                    pass
                else:
                    break
                if a is list_a[-1]:
                    new_list5.append(b)

但即便如此，想跑出來結果還是沒戲啊。。。

【下圖是其中1條用例還沒執行完的日志】

Python字典的應用之大資料量的對賬

第二版

其實這次優化，真的是很為難，已經用習慣了list；感覺沒啥思路。

和同僚溝通過 + 看到資料說：選擇合适的資料結構就可以實作優化，

Python字典的應用之大資料量的對賬

是以才想到用字典；然後就被驚豔了【4個 100w X 100w的對帳差不多3分鐘搞定】

怎麼用呢？

比如說：把list_a、list_b的元素分别變為某字典的key，在第二個字典查找這第一個字典的某個key，找到就說明此key 2邊都有，即這個元素 2邊都有 =》這個元素屬于對賬結果中 4個字段值全相同的；

dict_a_new = dict.fromkeys(list_a, True)
        dict_b_new = dict.fromkeys(list_b, True)
        # 所有字段全相同的元素 list
        consistent_1000 = [i for i in dict_a_new if i in dict_b_new]

那部分字段值相同呢？

我的思路：4個相同的(少) 和 3個相同的(多) 對比，某元素在3個相同但不在4個相同，就說明此元素第4個字段不同的；以此類推；

dict_a_new = dict.fromkeys(list_a, True)
        dict_b_new = dict.fromkeys(list_b, True)
        # 所有字段全相同的元素 list
        consistent_1000 = [i for i in dict_a_new if i in dict_b_new]

        # 全相同元素的前三個字段 list
        new_consistent_1000_list = [(i[0], i[1], i[2]) for i in consistent_1000]
        # 全相同元素的前三個字段 dict
        new_consistent_1000_dict = dict.fromkeys(new_consistent_1000_list, True)

        # 前三個字段 list
        id_status_amont_list_a = [(i[0], i[1], i[2]) for i in list_a]
        id_status_amont_list_b = [(i[0], i[1], i[2]) for i in list_b]
        # 前三個字段 dict
        id_status_amont_dict_a = dict.fromkeys(id_status_amont_list_a, True)
        id_status_amont_dict_b = dict.fromkeys(id_status_amont_list_b, True)

        # 前三個字段相同值的元素 list
        id_status_amont = [i for i in id_status_amont_dict_a if i in id_status_amont_dict_b]
        # 前三個字段相同值的元素 dict
        id_status_amont_dict = dict.fromkeys(id_status_amont, True)

        # 在前三個相同，但不在四個字段全相同 -> date不同
        date_1003 = [i for i in id_status_amont_dict if i not in new_consistent_1000_dict]

最後看下這樣的優化，實際執行的情況：【下圖是 4條用例跑的】

Python字典的應用之大資料量的對賬

交流技術歡迎 + QQ/微信 153132336 zy

個人部落格 https://blog.csdn.net/zyooooxie

Python字典的應用之大資料量的對賬

情景

最早一版

第一版

第二版

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入