天天看點

人分九等,數有階梯 - PostgreSQL 階品(顆粒)分析函數width_bucket, kmean應用

postgresql , width_bucket , 資料分布 , 包裹篩選 , 顆粒度篩選 , oracle , 階級分布 , kmean

從古至今,現實社會中,到處充滿了等級劃分,例如東漢史學家、文學家班固《漢書》《古今人表》把人分九等。九品量表”之中,分為上(上智).中(中人).下(下愚)三等。在每個等級中又分為:上上.上中.上下,中上.中中.中下,以及下上.下中和下下三等。

現代的一些見解:

第一等:聖人。已參破紅塵卻仍然悲天憫人,已近神界卻不孤芳自賞。

第二等:英雄。胸懷遠大,智慧超群,忍辱負重,力挽狂瀾,解群眾于水火。

第三等:才俊。獨立性強,才智過人,在科學、藝術領域對人類貢獻極大。

第四等:志士。徒有一腔熱血,滿腹才華,無奈命運多舛,加上自身的局限性,雖也做了一些事情,終是昙花一現。代表人物:屈原、鄭和、譚嗣同、魯迅、遇洛克、原《南方周末》有良知、有骨氣的記者及網上有正義感的遊俠。

第五等:仁人。善良、純淨,急民之所需,痛民之所痛。 願意放棄優越生活,去支援貧苦百姓的人。

第六等:大衆。善良純樸,見淺識陋,奔波勞碌,人窮志短。雖有美好意願,卻無正義精神,可為“希望工程”捐款,而遇到邪惡立刻明哲保身。既可推動文明,亦能充當犧牲品或殺人工具。

第七等:賤民。生活在底層,偷雞摸狗,渾渾噩噩,苟延殘喘,着實龌龊。代表人物:妓女、乞丐、騙子。如遇良好教育尚有希望。

第八等:暴君。上帝的玩偶,派來警示人類的。

第九等:小人。輕則小有才華,無病呻吟,蔑視苦難,以醜為美,誤導大衆,嘩衆取寵,玷污智慧,成就庸俗,如王家衛、金庸、轉型後的張藝謀、明星、娛記、《vogue》、纨绔子弟太子黨,重則奴顔婢膝,脅肩谄笑,口是心非,生性陰暗見不得陽光,如呂後、郭沫若,以及多數中國當代官僚。

人分九等,數有階梯 - PostgreSQL 階品(顆粒)分析函數width_bucket, kmean應用

除了人分九等,實際上我們還能看到各種的等級分布。例如:

1、對全國聯考成績進行分布透視,你的聯考成績拿出來,會落在第幾等呢?

2、氣溫分布,你所在的城市,全年的平均氣溫會在全國排第幾等呢?

3、雨量分布,你所在的城市,全年的降雨量會在全國排在第幾等呢?

4、包裹分揀,按重量、按體積進行分揀,友善物流的運輸。

5、商品顆粒篩選,例如大米、枸杞、羅漢果、水果、大閘蟹等商品,按顆粒度的大小,分為不同的等級。

人分九等,數有階梯 - PostgreSQL 階品(顆粒)分析函數width_bucket, kmean應用

6、收入等級,你的薪資水準落在第幾等呢?是不是戳中小心髒了,是不是又拖全國平均工資後腿了呢?

7、用水用電等級,現在水電都是階梯價,不同的階梯,定價都不一樣。

8、交稅也按收入分等級。

9、按每年接待的遊客數分幾個檔,評選出不同級别的景區。

人分九等,數有階梯 - PostgreSQL 階品(顆粒)分析函數width_bucket, kmean應用

10、對玩王者榮耀的時間進行統計,按遊戲時長,對人群進行歸類。

11、對淘寶店鋪的點選率、銷量資料進行統計,劃分店鋪等級。

還有好多分類或分級的例子。分級是為了更好的對資料進行歸類,友善資料的透視。

在資料庫中,存儲的通常是明細資料,如何進行等級劃分或者分揀呢?

接下來隆重推出postgresql的兩大歸類分析利器。

1、指定預設邊界和等級個數,傳回value所處等級。

适合求均勻分布的等級劃分,例如逾時的商品,按單價進行均勻分布的劃分,看看每種商品落在哪個消費區間。

2、指定預設邊界數組,傳回value所處等級。

适合求非均勻分布的等級劃分,例如求收入水準、學習成績的非均勻分布資料。60分以下為不及格,60-80為中,80-90為良,90-97為優,97以上為拔尖。

一個聚類算法。

人分九等,數有階梯 - PostgreSQL 階品(顆粒)分析函數width_bucket, kmean應用

postgresql kmeans插件:

<a href="https://github.com/digoal/blog/blob/master/201508/20150817_01.md">《k-means 資料聚集算法》</a>

淘寶店鋪每天都有點選率,銷量資料。對淘寶店鋪在100 ~ 5000次浏覽量均勻劃分為10個等級,低于100為0等,高于5000為11等。

1、設計表結構

2、生成正态分布的銷量資料

寫入測試資料

3、等級統計透視

在100 ~ 5000次浏覽量均勻劃分為10個等級,低于100為0等,高于5000為11等。

4、等級占比透視

還是以上資料,但是不按等值分布,而是用數組表示自定義區間分布。分布如下:

還是使用其那面的測試資料。對店鋪銷量劃分為10類,第一次劃分不使用種子,很多點都被歸為噪點,是以分級集中在正态資料分布的部分。

第二次劃分,使用聚類種子(中心點)。

透視

除了使用kmeans插件進行分類,還可以使用madlib插件,madlib插件是postgresql的開源機器學習庫。

<a href="https://github.com/apache/incubator-madlib">https://github.com/apache/incubator-madlib</a>

<a href="http://madlib.incubator.apache.org/">http://madlib.incubator.apache.org/</a>

when the desired output is categorical in nature, we use classification methods to build a model that predicts which of the various categories a new result would fall into. the goal of classification is to be able to correctly label incoming records with the correct class for the record.

example: if we had data that described various demographic data and other features of individuals applying for loans, and we had historical data that included what past loans had defaulted, then we could build a model that described the likelihood that a new set of demographic data would result in a loan default. in this case, the categories are “will default” or “won’t default” which are two discrete classes of output.

when the desired output is continuous in nature, we use regression methods to build a model that predicts the output value.

example: if we had data that described properties of real estate listings, then we could build a model to predict the sale value for homes based on the known characteristics of the houses. this is a regression problem because the output response is continuous in nature, rather than categorical.

here we are trying to identify groups of data such that the items within one cluster are more similar to each other than they are to the items in any other cluster.

example: in customer segmentation analysis, the goal is to identify specific groups of customers that behave in a similar fashion, so that various marketing campaigns can be designed to reach these markets. when the customer segments are known in advance this would be a supervised classification task. when we let the data itself identify the segments, this becomes a clustering task.

topic modeling is similar to clustering in that it attempts to identify clusters of documents that are similar to each other, but it is more specific to the text domain where it is also trying to identify the main themes of those documents.

also called market basket analysis or frequent itemset mining, this is attempting to identify which items tend to occur together more frequently than random chance would indicate, suggesting an underlying relationship between the items.

example: in an online web store, association rule mining can be used to identify what products tend to be purchased together. this can then be used as input into a product recommendation engine to suggest items that may be of interest to the customer and provide upsell opportunities.

descriptive statistics don’t provide a model and thus are not considered a learning method. however, they can be helpful in providing information to an analyst to understand the underlying data, and can provide valuable insights into the data that may influence choice of data model.

example: calculating the distribution of data within each variable of a dataset can help an analyst understand which variables should be treated as categorical variables, and which should be treated as continuous variables, including the sort of distribution the values fall in.

using a model without understanding the accuracy of that model can lead to a poor outcome. for that reason, it is important to understand the error of a model and to evaluate the model for accuracy on test data. frequently in data analysis, a separation is made between training data and test data solely for the purpose of providing statistically valid analysis of the validity of the model, and assessment that the model is not over-fitting the training data. n-fold cross validation is also frequently utilized.

1、oracle width_bucket不支援數組,隻支援均勻分布透視,不支援非均勻分布的資料透視。

例如聯考成績分布,分數從0分到750分都有,如果隻能均勻透視,沒法真正區分有意義的等級區間。

而使用postgresql width_bucket數組解決這個問題。例如 array[300, 400, 520, 580, 630, 690] 這樣可以根據實際情況進行透視,出來的透視結果是比較有意義的。

2、postgresql支援衆多機器學習算法。

3、postgresql 支援生成正态分布,随機分布,泊松分布的測試資料,便于測試。

人分九等,數有階梯 - PostgreSQL 階品(顆粒)分析函數width_bucket, kmean應用

<a href="https://www.postgresql.org/docs/10/static/pgbench.html">https://www.postgresql.org/docs/10/static/pgbench.html</a>

<a href="https://www.postgresql.org/docs/10/static/functions-math.html">https://www.postgresql.org/docs/10/static/functions-math.html</a>

<a href="https://docs.oracle.com/cd/b19306_01/server.102/b14200/functions214.htm">https://docs.oracle.com/cd/b19306_01/server.102/b14200/functions214.htm</a>

壓測資料,正泰分布

<a href="https://www.postgresql.org/docs/9.6/static/pgbench.html">https://www.postgresql.org/docs/9.6/static/pgbench.html</a>

<a href="https://github.com/digoal/blog/blob/master/201606/20160614_04.md">《在postgresql中如何生成測試kmean算法的資料》</a>

<a href="https://github.com/digoal/blog/blob/master/201506/20150618_01.md">《生成泊松、高斯、指數、随機分布資料 - postgresql 9.5 new feature - pg_bench improve, gaussian (standard normal) &amp; exponential distribution》</a>

<a href="https://github.com/digoal/blog/blob/master/201506/20150616_01.md">《postgresql 9.5 new feature - width_bucket return the bucket number》</a>

<a href="https://github.com/digoal/blog/blob/master/201303/20130319_01.md">《postgresql fdw mongo_fdw usage》</a>

繼續閱讀