天天看點

NumPy 之 案例(随機漫步)

import numpy as np      

The numpy.random module supplements(補充) the built-in Python random with functions for efficiently generating whole arrays of sample values from many kinds of probalility distributions. For example, you can get a 4x4 array of samples from the standard normal distribution using normal:

samples = np.random.normal(size=(4,4))
samples      
array([[-0.49777854,  1.01894039,  0.3542692 ,  1.0187122 ],
       [-0.07139068, -0.44245259, -2.05535526,  0.49974435],
       [ 0.80183078, -0.11299759,  1.22759314,  1.37571884],
       [ 0.32086762, -0.79930024, -0.31965109,  0.23004107]])      

Python's built-in random module, by contrast(對比), only samples one value at a time. As you can see from this benchmark, numpy.random is well over an order of magnitude faster for generating very large samples:

from random import normalvariate

n = 100000

'python 運作時間:'
%timeit samples = [normalvariate(0,1) for _ in range(n)]      
'python 運作時間:'      
127 ms ± 7.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      
'np.random 運作時間:'
%timeit np.random.normal(size=n)      
'np.random 運作時間:'      
4.2 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)      
%%time 

# 我後來還是喜歡 %%time 這樣的計時方式

from random import normalvariate 

n = 1000000

'Python run time:'
samples = [normalvariate(0,1) for _ in range(n)]      
Wall time: 1.08 s      

We say that these are pseudorandom numbers(僞随機數) because they are generated by an algorithim with deterministic behavior(确定行為的算法生成的) You can change NumPy's random number generation seed number generation seed using np.random.seed:

"cj還是不了解seed(), 是讓算法不改變嗎? 每次都是同一個?"

np.random.seed(1234)
"現在了解了, seed(n),叫做随機種子, n表示在接下來的n次(抽樣)的結果是一樣的"      
'cj還是不了解seed(), 是讓算法不改變嗎? 每次都是同一個?'      
'現在了解了, seed(n),叫做随機種子, n表示在接下來的n次(抽樣)的結果是一樣的'      

The data generation functions in numpy.random use a global random seed. To avoid global state, you can use numpy.random.RandomState to create a random number generator islolated(單獨的) from others.

rng = np.random.RandomState(1234)

rng.randn(10)      
array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873,
        0.88716294,  0.85958841, -0.6365235 ,  0.01569637, -2.24268495])      

See Table 4-8 for a partial list of functions available in numpy.random. I wil give some examples of leveraging(利用) these function's ablility to generate large arrays of samples all at onece in the next section.

  • seed Seed the random number generator (不明白這個函數還是)
  • permutation Return a random pemutation(排列) of a sequence, or return a permuted range
  • shuffle Randomly permute(轉換) a sequence in-place (随機洗牌)
  • rand Draw samples from a uniform distribution U~[0, 1], rand(shape)(均勻分布, 每個值出現的可能性是一樣的)
  • uniform U~[0, 1], uniform(low=0, high=1, size)
  • randint Draw random integers from a given low-to-high range. (a,b) 邊界都是可以取到的
  • randn 标準正态分布\(N(\mu=0, \sigma=1)\), randn(shape)
  • normal 正态分布\(N(\mu, \sigma, size)\), normal(low=0, high=1, size)
  • binomial 二項分布
  • chisquare 卡方分布
  • beta
  • gamma
"permutaion(x), 産生0-x範圍内x個随機自然數的一個排列"
np.random.permutation(6)

"shuffle(seq) 将一個序列随機打亂, in-place 哦 "
arr1 = np.arange(6)
np.random.shuffle(arr1)
arr1      
'permutaion(x), 産生0-x範圍内x個随機自然數的一個排列'      
array([1, 2, 5, 0, 3, 4])      
'shuffle(seq) 将一個序列随機打亂, in-place 哦 '      
array([5, 2, 3, 0, 4, 1])      
"rand(shape) 0-1的均勻分布哦"
"shape=(2x2x3)-> 2裡的每個1是3,每個1裡面是1 "
"[[], []]"
"[  [  [6],[6],[6] ],   [ [6],[6],[6]  ]  ]"

np.random.rand(2,1,1)

"uniform()"
np.random.uniform(3,4,5)      
'rand(shape) 0-1的均勻分布哦'      
'shape=(2x2x3)-> 2裡的每個1是3,每個1裡面是1 '      
'[[], []]'      
'[  [  [6],[6],[6] ],   [ [6],[6],[6]  ]  ]'      
array([[[0.06152103]],

       [[0.8725525 ]]])      
'uniform()'      
array([3.35219682, 3.62783057, 3.51758469, 3.37434633, 3.64026243])      
"randn(shape), 标準正态分布"
np.random.randn(1,2,3)

"滿足normal(loc=0, scale=1, size=None) 均值80, 标準差10"
np.random.normal(80, 20, 6)      
'randn(shape), 标準正态分布'
      
array([[[0.49112636, 0.90638754, 0.05000051],
        [1.21431522, 0.67847748, 1.3797269 ]]])
      
'滿足normal(loc=0, scale=1, size=None) 均值80, 标準差10'
      
array([55.07130243, 56.34397557, 68.95608996, 31.40875572, 89.80741058,
       37.38567435])
      
"binorma(n, p, size), n次試驗, p次成功機率, 5個樣本量"
np.random.binomial(100, 0.4, 5)

"chisquare(10,2) 服從于自由度為10的卡方分布下的2個樣本"
np.random.chisquare(10, 2)      
'binorma(n, p, size), n次試驗, p次成功機率, 樣本量'
      
array([47, 42, 41, 34, 44])
      
array([20.07301382, 14.54581473])
      

Example:Random Walks

The simulation of random walks(随機漫步) provides an illustrative(執行個體) of utilizing(使用) array operations. Let's first consider a simple random walk starting at 0 with steps of 1 and -1 occuring(出現) with equal probalility. (1,-1 等機率出現)

Here is a pure Python way to implement a single random walk with 1000 steps using the built-in random module.

import random
import matplotlib.pyplot as plt


def random_walk(position=0, steps=1000, walk=[]):
    """
    position=0  # 初始位置
    walk = [] # 結果清單
    steps = 1000
    """
    for i in range(steps):
        state = 1 if random.randint(0,1) else -1
        position += state
        # 将每次位置存入結果清單中
        walk.append(position)
        
    return walk

# test
walk_result = random_walk()

"plot the first 100 values on one of these random walks:"
"預設折線圖"

plt.plot(walk_result[:100])      
'plot the first 100 values on one of these random walks:'
      
'預設折線圖'
      
[<matplotlib.lines.Line2D at 0x19329c95208>]
      

You might make the observation that walk is simply the

cumulative(累積的) sum of the random steps and could be evaluate as an array expression, Thus, I use the np.random module to draw 1000 coin flips at once, set these to 1 and -1, and compute the cumulative sum:

nsteps = 1000

"用size代替for循環, 面向數組程式設計哦"
draws = np.random.randint(0, 2, size=nsteps)

"np.where 三元很厲害, 代替if-else"
steps = np.where(draws > 0, 1, -1)

'cumsum 累積求和'
walk = steps.cumsum()

plt.plot(walk)      
'用size代替for循環, 面向數組程式設計哦'
      
'np.where 三元很厲害, 代替if-else'
      
'cumsum 累積求和'
      
[<matplotlib.lines.Line2D at 0x1932a032278>]
      

From this we can begin to extract(提取) statistics like the minmun and maxmun value along the walks trajectory(軌迹)

"walk的極值"
walk.min()
walk.max()

type(walk)      
'walk的極值'
      
-17
      
15
      
numpy.ndarray
      
# cj test cumsum()
cj_arr = np.array([[1,2,3],[4,5,6]])
cj_arr

"cumsum()累積求和, 傳回的是narray 可看到累積的過程哦"
np.cumsum(cj_arr, axis=0) # 往下, 顯示沒列的和

np.cumsum(cj_arr, axis=1) # 往右, 顯示每行的和      
array([[1, 2, 3],
       [4, 5, 6]])
      
'cumsum()累積求和, 傳回的是narray 可看到累積的過程哦'
      
array([[1, 2, 3],
       [5, 7, 9]], dtype=int32)
      
array([[ 1,  3,  6],
       [ 4,  9, 15]], dtype=int32)
      

A more complicated(更複雜的) statistic is the first crossing time, the step at which the random walk reaches a particular value. Here we might want to know how long it took the random walk to get at least 10 steps away from the orgin 0 in either direction. (距離原點為10, 需要多少次) np.abs(walk) >= 10 gives us a boolean array indicating(指明) where(是否) the walk has reached or exceeded 10, but we want the index of the first 10 or -10, Turn out(結果是), we can compute this using argmax, which returns the first index of maximum value in the boolean array(True is the maximum): -> argmax()傳回數組中, 最大值的第一個索引, 配合maximum使用

"查詢累積數組中, 絕對距離大于10的,的第一個值的索引 "
(np.abs(walk) >= 10).argmax()      
'查詢累積數組中, 絕對距離大于10的,的第一個值的索引 '
      
49
      

Note that(注意) using argmax here is not always efficient because it always makes a full scan of the array. -> 用argmax()來找最大值索引, 效率是不高的,因為需要周遊整個數組. In this special case, once a True is observed we know it to be the maximum value.

# cj test

np.max([1,3,2,5,7])

"maximum傳回一個數組, 逐個比較傳入數組的值,和第二個參數比較, quit其大,替換"
np.maximum([10,1,3,2,5,7],6)      
7
      
'maximum傳回一個數組, 逐個比較傳入數組的值,和第二個參數比較, quit其大,替換'
      
array([10,  6,  6,  6,  6,  7])
      
"待補充中..."
np.maximum(np.abs(walk), 10).argmax()      
530
      
# cj_arr1 = np.array([10,6,6,6,6,7]).argmax()
# cj_arr1      

Simulating Many Random Walk at Once

If your goal to simulate many random walks, say 5000 of them, you can generate all of the random walks with minor modiffcations(微小的修改) to the preceding code(之前的代碼). If passed a 2-tuple, the numpy.random functions will generate a two-dimensional array of draws, and we can compute the cumulative sum across the rows to compute all 5000 random walks in one shot:

# cj test
np.random.randint(0,2, size=(3,4))      
array([[0, 0, 0, 1],
       [0, 1, 0, 1],
       [0, 0, 1, 1]])
      
"一次生成500次行走記錄, 每次走1000不的大資料集5000 x 1000"
nwalks = 5000
nsteps = 1000

# 0 or 1
%time draws = np.random.randint(0, 2, size=(nwalks, nsteps))

"5000000萬次, 這耗時也太短了,厲害了, 1秒=10^3ms"

" >0, 1, <0, -1 "
steps = np.where(draws > 0, 1, -1)

"軸1, 列方向, 右邊, 按照每行累積"
walks = steps.cumsum(axis=1)

walks      
'一次生成500次行走記錄, 每次走1000不的大資料集5000 x 1000'
      
Wall time: 58 ms
      
'5000000萬次, 這耗時也太短了,厲害了'
      
' >0, 1, <0, -1 '
      
'軸1, 列方向, 右邊, 按照每行累積'
      
array([[ -1,   0,  -1, ...,  22,  21,  22],
       [ -1,  -2,  -3, ...,  52,  53,  54],
       [  1,   0,  -1, ..., -20, -19, -20],
       ...,
       [ -1,  -2,  -3, ..., -48, -47, -48],
       [  1,   2,   1, ...,  -8,  -7,  -8],
       [ -1,  -2,  -1, ..., -10, -11, -10]], dtype=int32)
      

Now, we can compute the maximun and minimum values obtained(獲得) over all of the walks: ->擷取最大值, 最小值

"整個數組的"

'最大值', walks.max()
'最小值', walks.min()      
'整個數組的'
      
('最大值', 112)
      
('最小值', -109)
      

Out of these walks, let's compute the minimum crossing time to 30 or -30. This is slightly tricky(稍微有些棘手) because not all 5000 of them reach 30. We can check this using the any method:

hist30 = (np.abs(walks) >= 30).any(axis=1)  # 按照行

hist30

'統計 number that hit 30 or -30, 有多少行'
hist30.sum()      
array([ True,  True,  True, ...,  True,  True,  True])
      
'統計 number that hit 30 or -30, 有多少行'
      
3441
      

We can use this boolean array to select out the rows of walks that actually cross the absolute 30 level an d call argmax across axis=1 to get the crossing times:

'cj待了解'
crossing_times = (np.abs(walks[hist30]) >= 30).argmax(1)
crossing_times.mean()      
'cj待了解'
      
497.102877070619
      

Feel free to experiment(積極地嘗試) with other distributions for the steps other than equal-sized coin flips(硬币試驗). You need only use a different random number generation function, like normal to generate normally distribute steps with some mean and standard deviation(标準差)

import sys

steps = np.random.normal(loc=0, scale=0.25, size=(5000, 1000))

"檢視這個對象占多少記憶體"

"{}的{}占用{}位元組".format(steps.shape, steps.dtype, sys.getsizeof(steps))      
'檢視這個對象占多少記憶體'
      
'(5000, 1000)的float64占用40000112位元組'
      

Conclusion

While(當然) much of the rest of the book will focus on building data wrangling(資料整理) skills with pandas, we will continue to work in a similar array-based style. In Appendix A, we will dig deeper(深入挖掘) into NumPy features to help you further develop your array computing skills.