信用評分卡 (part 3of 7)

信用評分卡 (part 3of 7)
時值螞蟻上市之際，馬雲在上海灘發表演講。馬雲的核心邏輯其實隻有一個，在全球數字經濟時代，有且隻有一種金融優勢，那就是基于消費者大資料的純信用！

我們不妨稱之為資料信用，它比抵押更靠譜，它比擔保更保險，它比監管更高明，它是一種面向未來的财産權，它是數字貨币背後核心的抵押資産，它決定了數字貨币時代信用創造的方向、速度和規模。一句話，誰掌握了資料信用，誰就控制了數字貨币的發行權！

資料信用判斷依靠的就是金融風控模型。更準确的說誰能掌握風控模型知識，誰就掌握了數字貨币的發行權！信用評分卡是風控模型中最常見的模型，基于線性算法和sigmoid函數二分類，可以實作自動預測壞客戶機率和變量量化分析，有利于高層上司決策。

歡迎各位同學學習python信用評分卡模組化（附代碼）視訊系列教程

位址為：https://edu.csdn.net/course/detail/30611

信用評分卡 (part 3of 7)

接下來，我講解信用評分卡系列内容第3集變量篩選。變量篩選是模組化前最重要的一個基礎工作。很多模型所謂依賴大資料，用了成千上萬個變量模組化，這是不科學的。真正優雅模型的次元是經過科學設計的，不會太多，也不會太少。模型次元太高，模型系統風險高，操作風險高，運作慢，效率低。模型次元太低，模型預測能力不足，ks,auc可能上不去。----By Toby，一個持牌照消費金融模型專家

希望這一節課能為初學者帶來幫助。

變量篩選Variables Selection in Predictive Analytics

信用評分卡 (part 3of 7)

Predictive Analytics: Variables Selection – by Roopam

The following story goes back to the time when I just started my transition from physics to business. I met this investment banker* in his mid-thirties during a Friday night party. After gulping down a few pints of beer, his mood became a bit somber and he told me how he hates his job. However, he had a plan of working his ass off until he retires at 45. Then he will do everything that makes him happy. I was thoroughly confused, how could someone debar himself from an emotion – happiness – for so many years and rediscover it later? I was wondering about the recipe for happiness – raindrops on roses and whiskers on kittens. An individual’s happiness is a tricky thing; however, I shall attempt to tackle this issue in my later article on logistic regression. For now, let us try to explore how states measure the collective well-being of their people. I shall use this topic of population well-being to explore an interesting topic in analytical scorecard development: variables selection.

以下故事可以追溯到我剛開始從實體到商業的過渡時期。我在周五晚上的聚會期間遇到了這位投資銀行家*。在喝了幾品脫啤酒之後，他的心情變得有些憂郁，他告訴我他是如何讨厭自己的工作的。然而，他有一個計劃工作他的屁股，直到他在45退休。然後他會做一切讓他開心的事情。我徹底搞糊塗了，這麼多年以後，有多少人會從情感 - 快樂中貶低自己，并在以後重新發現它？我想知道快樂的秘訣 - 玫瑰上的雨滴和小貓的胡須。個人的幸福是一件棘手的事情; 但是，我将在後面關于邏輯回歸的文章中嘗試解決這個問題。現在，讓我們試着探讨各國如何衡量其人民的集體福祉。我将利用這個人口福祉主題來探索分析記分卡開發中的一個有趣話題：

Variables Selection – Lessons from GDP & GNH

The most popular measure for national prosperity, unanimously projected by economists and TV channels, is Gross Domestic Product (GDP). The equation for measuring GDP as taught in macroeconomics 101 is:

信用評分卡 (part 3of 7)

Clearly, there are 5 factors/variables that govern GDP according to this equation. The first look at GDP as a measure for national well-being seemed incomplete to me. All the variables for GDP were from commerce. They are important but cannot be the only factors for country’s well-being, more so in a highly diverse & complicated country like India.

ariables Selection - 來自GDP和GNH的經驗教訓

經濟學家和電視訊道一緻預測的最受國民興趣的衡量标準是國内生産總值（GDP）。宏觀經濟學101中教授的衡量GDP的等式是：

GDP方程式

顯然，根據這個等式，有5個因素/變量可以控制GDP。首先将國内生産總值視為衡量國家福祉的名額對我來說似乎不完整。 GDP的所有變量都來自商業。它們很重要，但不能成為國家福祉的唯一因素，在印度等高度多樣化和複雜的國家更是如此。

Gross National Happiness Index – The Story of Bhutan Naresh

信用評分卡 (part 3of 7)

Variables Selection

Ok, so what else do we have? A lesser-known index is Gross National Happiness (GNH). The origins of GNH are in Bhutan. They measure their country’s progress through GNH. The term was coined and implemented by Jigme Singye Wangchuck. This name immediately takes me back to the early nineties live telecast of the SAARC summit by India’s national broadcaster Doordarshan (DD). The old-timer Hindi commentators were referring to a modest man in a bathrobe-like-attire as ‘Bhutan Naresh’ – King of Bhutan. At first glance, he did not fit well with the power horses of the south Asian region. Nevertheless, he seems to have devised a more holistic metric to measure his country’s well-being. GNH is a combination of the following broad categories:

1. Living standard & income

2. Health coverage

3. Physiological well-being

4. Time spent at work and relaxing

5. Good governance

6. Schooling & education

7. Cultural diversity

8. Community vitality

9. Environmentalism and conservatism

There are 72 total variables in GNH measured on a scale of 0 to 1, such as daily hours of sleep and trust in media; hmmm, not a bad start! You could do your own research on GNH and let me know what you feel about it. Actually, we can work out our own formula for a GNH like metric. The idea is to select the right variables to build your model!

國民幸福總指數 - 不丹納雷什的故事

變量選擇

好的，那我們還有什麼呢？一個鮮為人知的指數是國民幸福總值（GNH）。 GNH的起源在不丹。他們通過GNH衡量他們國家的進步。該術語由Jigme Singye Wangchuck創造和實施。這個名字讓我回到了印度國家廣播公司Doordarshan（DD）在九十年代早期的SAARC峰會現場直播。舊時的印地語評論員指的是一個穿着浴衣般裝扮的謙虛男人，就像不丹之王“不丹納雷什”。乍一看，他并不适合南亞地區的動力馬。然而，他似乎已經設計了一個更全面的衡量标準來衡量他的國家的福祉。 GNH是以下大類的組合：

1.生活水準和收入

2.健康保險

3.生理健康

4.工作和放松的時間

5.善治6.學校教育

7.文化多樣性

8.社群活力

9.環境保護主義和保守主義

GNH中有72個總變量，按0到1的等級測量，例如每天的睡眠時間和對媒體的信任;嗯，這不是一個糟糕的開始！你可以自己研究GNH，讓我知道你對它的看法。實際上，我們可以為GNH度量标準制定出我們自己的公式。我們的想法是選擇正确的變量來建構您的模型！

Variables Selection in Credit Scoring

In data mining and statistical model building exercises, similar to credit scoring, variables selection process is performed through statistical significance – a reasonably automated process through advanced software. However, the variables are still created and measured by humans. High impact analyses in businesses are still driven by hunches. Human intelligence is not obsolete yet.

In one of the projects I did with a financial organization, the result of credit risk analysis and scoring led to redesigning of the application form. Application forms are a major source of data collection regarding the borrower. However, nobody wants to fill a lengthy form hence an optimal size of the form ensures accurate information provided by the borrower. The idea is to select the right variable and ensure accurate measurement.

There are several aspects regarding variables but I will mention just one of them here (coarse classing).

信用評分中的變量選擇

在資料挖掘和統計模型建構練習中，類似于信用評分，變量選擇過程通過統計顯着性來執行 - 通過進階軟體進行合理自動化的過程。但是，變量仍由人類創造和測量。企業的高影響力分析仍然受到預感的驅動。人類智慧尚未過時。

在我與金融機構合作的一個項目中，信用風險分析和評分的結果導緻了申請表的重新設計。申請表是有關借款人的主要資料收集來源。然而，沒有人想要填寫冗長的表格，是以表格的最佳尺寸確定了借款人提供的準确資訊。我們的想法是選擇正确的變量并確定準确的測量。

關于變量有幾個方面，但我在這裡隻提到其中一個（粗略分類）。

Coarse Classing in Credit Scoring

信用評分卡 (part 3of 7)

One of my favorite activities as a kid was going to a shoe store and getting my feet measured every summer before the school started. The shoe shops had a strange, miniature, slide-like device to measure foot size. It was fun to see my feet grow from one size to another every year or two. The growth was quantized i.e you are size-2 or 3 never 2.5 or 2.7. This aspect of converting measure such as 2.5 & 2.7 to 3 is called grouping, bucketing or classing. This is an integral part of creating scorecards that you will find in all the books I have listed in the first part of this blog series.

I have been a part of several heated discussions on the relevance of coarse class in scorecard development throughout my career. In most, if not all academic articles you will rarely see coarse classing as a technique during model development. Quite a few academicians & practitioners for a good reason believe that coarse classing results in loss of information. However, in my opinion, coarse classing has the following advantage over using raw measurement for a variable.

1. It reduces random noise that exists in raw variables – similar to averaging and yes, you lose some information here.

2. It handles extreme events – on two extremes of a variable – much better where you have thin data.

3. It handles the non-linear relationship between dependent and independent variable without a lot of effort of variable transformation from the analyst.

信用評分中的粗分類

3鞋子測量我小時候最喜歡的一項活動是去一家鞋店，每年夏天在學校開始前測量我的腳。這些鞋店有一個奇怪的，微型的滑動式裝置來測量腳的大小。每年或每兩年看到我的腳從一個尺寸增長到另一個尺寸很有趣。增量被量化，即你的大小為2或3從不2.5或2.7。将諸如2.5和2.7之類的度量轉換為3的這一方面稱為分組，分組或分類。這是建立記分卡的一個組成部分，您可以在本部落格系列的第一部分列出的所有書籍中找到這些記分卡。

在我的職業生涯中，我參與了幾個關于粗俗課程在記分卡開發中的相關性的熱烈讨論。在大多數情況下，如果不是所有的學術文章，你很少會在模型開發過程中看到粗略的分類。相當多的學者和從業者有充分理由相信粗略的分類會導緻資訊丢失。但是，在我看來，粗略分類比使用變量的原始測量具有以下優勢。

1.它減少了原始變量中存在的随機噪聲 - 類似于平均值，是的，你在這裡丢失了一些資訊。

它處理極端事件 - 在變量的兩個極端情況下 - 在您擁有精簡資料的情況下更好。

3.它處理依賴變量和自變量之間的非線性關系，而無需分析師進行變量轉換。

Sign-off Note

We are half way through this series on ‘Analytical Scorecard Development’ and I am enjoying writing this thoroughly. I hope as a reader you are on the same page. Scorecard building is highly technical and I have tried to discuss some aspects with easy to understand examples. However, to manage the length of the article, I am not able to get into the details. I must say that I love the details!

關于“分析記分卡開發”的系列文章我們已經完成了一半，我很喜歡把這篇文章寫得很透徹。我希望作為一個讀者，你也有同感。記分卡的建構是高度技術性的，我試圖用易于了解的例子讨論一些方面。但是，為了管理這篇文章的篇幅，我無法深入到細節。我必須說我喜歡細節！

歡迎各位同學學習系列課python金融風控評分卡模型和資料分析，包括邏輯回歸，評分卡，樹模型（xgboost,lightbm,catboost）,神經網絡算法，信貸使用者資料分析和使用者畫像等全面性知識。

位址為：https://edu.csdn.net/combo/detail/1927

信用評分卡 (part 3of 7)