天天看點

增量計算海量資料均值、方差、标準差

增量計算海量資料均值、方差、标準差增

本文轉載自部落格:

http://www.calmkart.com/?p=369

前言:

最近需要從海量資料中擷取均值,方差,标準差

顯然直接讀如記憶體中按公式做肯定是gg的,于是考慮是否可以增量計算

最終寫了一個增量計算海量資料均值方差标準差的python通用類

參考了以下公式推導:

增量計算海量資料均值、方差、标準差

代碼如下:

# -*- coding: utf-8 -*-
from __future__ import division
import numpy


class incre_std_avg():
    '''
    增量計算海量資料平均值和标準差,方差
    1.資料
    obj.avg為平均值
    obj.std為标準差
    obj.n為資料個數
    對象初始化時需要指定曆史平均值,曆史标準差和曆史資料個數(初始資料集為空則可不填寫)
    2.方法
    obj.incre_in_list()方法傳入一個待計算的資料list,進行增量計算,獲得新的avg,std和n(海量資料請循環使用該方法)
    obj.incre_in_value()方法傳入一個待計算的新資料,進行增量計算,獲得新的avg,std和n(海量資料請将每個新參數循環帶入該方法)
    '''

    def __init__(self, h_avg=0, h_std=0, n=0):
        self.avg = h_avg
        self.std = h_std
        self.n = n

    def incre_in_list(self, new_list):
        avg_new = numpy.mean(new_list)
        incre_avg = (self.n*self.avg+len(new_list)*avg_new) / \
            (self.n+len(new_list))
        std_new = numpy.std(new_list, ddof=1)
        incre_std = numpy.sqrt((self.n*(self.std**2+(incre_avg-self.avg)**2)+len(new_list)
                                * (std_new**2+(incre_avg-avg_new)**2))/(self.n+len(new_list)))
        self.avg = incre_avg
        self.std = incre_std
        self.n += len(new_list)

    def incre_in_value(self, value):
        incre_avg = (self.n*self.avg+value)/(self.n+1)
        incre_std = numpy.sqrt((self.n*(self.std**2+(incre_avg-self.avg)
                                        ** 2)+(incre_avg-value)**2)/(self.n+1))
        self.avg = incre_avg
        self.std = incre_std
        self.n += 1


if __name__ == "__main__":
    c = incre_std_avg()
    c.incre_in_value(0.05)
    print c.avg
    print c.std
    print c.n
    c.incre_in_value(0.02)
    c.incre_in_list([0.5, 0.2, 0.3])
    print c.avg
    print c.std
    print c.n
           

其他參考資料:

  1. https://blog.csdn.net/zdy0_2004/article/details/46822685
  2. https://www.cnblogs.com/June2005/p/11498392.html
  3. 關于np.std的使用,參數ddof需注意,參考;

繼續閱讀