天天看點

第一章.Classification -- 01.Introduction to Classification翻譯

01.Introduction to Classification

Classification is a core problem of machine learning. Now machine learning is a field that grew out of artificial intelligence within computer science, and the goal is to teach computers by example. Now if you want to teach the computer to recognize images of chairs,

then we give the computer a whole bunch of chairs and tell it which ones are chairs and which ones are not, and then it’s supposed to learn to recognize chairs, even ones it hasn’t seen before. It’s not like we tell the computer how to recognize a chair.

We don’t tell it a chair has four legs and a back and a flat surface to sit on and so on, we just give it a lot of examples. Now machine learning has close ties to statistics;

in fact it’s hard to say what’s different about predictive statistics than machine learning.

These fields are very closely linked right now. Now the problem I just told you about

is a classification problem where we’re trying to identify chairs, and the way we set the problem up is that we have a training set of observations – in other words, like labeled images here – and we have a test set that we use only for evaluation.

We use the training set to learn a model of what a chair is, and the test set are images that are not in the training set and we want to be able to make predictions on those as to whether or not each image is a chair. And it could be that some of the labels on the training set are noisy. In fact, that – you know, that could happen in fact one of these labels is noisy, right here. And that’s okay because as long as there isn’t too much noise,

we should still be able to learn a model for a chair; it just won’t be able to classify perfectly and that happens. Some prediction problems are just much harder than others, but that’s okay because we just do the best we can from the training data,

and in terms of the size of the training data: the more the merrier.

We want as much data as we can to train these models. Now how do we represent an image of a chair or a flower or whatever in the training set? Now I just zoomed in on a little piece of a flower here,

and you can see the pixels in the image, and we can represent each pixel in the image according to its RGB values, its red-green-blue values. So we get three numbers representing each pixel in the image.

So you can represent the whole image as a collection of RGB values.

So the image becomes this very large vector of numbers, and in general when doing machine learning, we need to represent each observation in the training and test sets as a vector of numbers, and the label is also represented by a number.

Here are the labels minus 1, because the image is not a chair and the images of chairs would all get labels plus 1.

Here’s another example: this is a problem that comes from New York City’s power company, where they want to predict which manholes are going to have a fire.

So we would represent each manhole as a vector, and here are the components of the vector.

Right the first component might be the number of serious events that the manhole had last year, like a fire or smoking manhole that’s very serious, or an explosion or something like that.

And then maybe we would actually have a category for the number of serious events last year, so only three of these five events were very serious.

The number of electrical cables in the manhole, the number of electrical cables that were installed before 1930, and so on and so forth. And you can make – you know in general, the first step is to figure out how to represent your data like this as a vector, and you can make this vector very large.

You could include lots of factors if you like, that’s totally fine. Computationally, things are easier if you use fewer features, but then you risk leaving out information.

So there’s a trade-off right there that you’re going to have to worry about, and we’ll talk more about that later. But in any case, you can’t do machine learning if you don’ have your data represented in the right way so that’s the first step.

Now you think that manholes with more cables more recent serious events and so on would be more prone to explosions and fires in the near future, but what combination of them would give you the best predictor?

How do you combine them together? You could add them all up, but that might not be the best thing. You could give them all weights and add them all up, but how do you know the weights?

And that’s what machine learning does for you. It tells you what combinations to use to get the best predictors. But for the manhole problem, we want to use the data from the past to predict the future.

So for instance, the future data might be from 2014 and before, and the label would be 1 if the manhole had an event in 2015.

So that’s our training set, and then for the test set, the feature data would be from 2015 and before and then we would try to predict what would happen in 2016.

So just to be formal about it, we have each observation being represented by our set of features,

and the features are also called predictors or covariant or explanatory variables or independent variables, whatever they are – you can choose whatever terminology you like.、

And then we have the labels, which are called y. Even more formally, we’re given a training set of feature label pairs xi yi, and there are n of them, and we want to create a classification model f that can predict a label y or a new x.

Let’s take a simple example of – simple version of the manhole example, where we have only two features: the year the oldest cable was installed and the number of events that happened last year.

So each observation can be represented as a point on a two-dimensional graph, which means I can plot the whole dataset.

So something like this, where each point here is a manhole and I’ve labelled it with whether or not it had a serious event in the training set.

So these are the manholes that didn’t have events, and these are the ones that did.

And then I’m going to try to create a function here that’s going to divide the space into two pieces, where on one side of this – on one side over here of the decision boundary, I’m going to predict that there’s going to be an event,

and on the other side of the decision boundary, I predict there will be no event. So this decision boundary is actually just the equation where the function is 0,

and then where the function is positive we’ll predict positive and where the function is negative we’ll predict negative. And so this is going to be a function of these two variables, the oldest cable and then the number of events last year.

And the same idea holds for the commuter vision problem that we discussed earlier.

We’re trying to create this decision boundary that’s going to chop the space into two pieces, where on one side of the decision boundary we would predict positive, and then on the other side we’d predict negative.

And the trick is, how do we create this decision boundary? How do we create this function f? Okay, so given our training data, we want to create our classification model f that can make predictions.

The machine learning algorithm is going to create the function f for you, and no matter how complicated that function f is, the way to use it is not very complicated at all.

The way to use it is just this: the predicted value of y for a new x that you haven’t seen before is just the sign of that function f. Classification is for yes or no questions.

You can do a lot if you can answer yes or no questions. So for instance, think about like handwriting recognition. For each letter on a page, we’re going to evaluate whether it’s a letter A, a yes or no.

And if you’re doing like spam detection, right, the spam detector on your computer has a machine learning algorithm in it. Each email that comes in has to be evaluated as to whether or not it’s spam.

Credit defaults: right, whether or not you get a loan depends on whether the bank predicts that you’re going to default on your loan, yes or no. and in my lab, we do a lot of work on predicting medical outcomes. We want to know whether something will happen to a patient within a particular period of time. Here’s a list of common classification algorithms. Most likely, unless you’re interested in developing your own algorithms, 

you never need to program these yourself; they’re already programmed in by someone else. If you’re just going to be a consumer of these, you can use the code that’s already written.

And all these are, you know – we’re going to cover a good chunk of these methods, and in order to use them effectively you’ve really got to know what you’re doing, otherwise you could really run into some issues. 

But if you can figure out how to use these, you’ve got a really powerful tool on your hands.

分類是機器學習的核心問題。機器學習是在計算機科學中由人工智能發展而來的一個領域,其目标是通過執行個體來教授計算機。如果你想教電腦識别椅子的圖像,

然後我們給電腦一大堆椅子,告訴它哪些是椅子,哪些不是,然後它應該學會識别椅子,甚至是以前沒見過的椅子。我們并不是告訴電腦如何識别椅子。

我們不告訴它,椅子有四條腿,一個背,一個平面,等等,我們給它舉了很多例子。機器學習與統計有着密切的聯系;

事實上,很難說預測統計與機器學習有什麼不同。

這些領域目前聯系緊密。現在我剛才告訴你們的問題是一個分類問題,我們試圖識别椅子,我們設定問題的方式是我們有一個訓練集的觀察,換句話說,就像這裡有标簽的圖像,我們有一個測試集,我們隻用它來做評估。

我們使用訓練集來學習一個椅子的模型,測試集是沒有在訓練集裡的圖像我們希望能夠預測每個圖像是否為椅子。可能是訓練集上的一些标簽很雜亂。事實上,這可能會發生實際上其中一個标簽很雜亂,就在這裡。沒關系,隻要沒有太大的雜亂,盡管它不能完美地分類,但我們仍然可以學習一個椅子模型。有些預測問題比其他的要困難得多,但沒關系,因為我們隻是盡我們所能從訓練資料中做到最好,

就教育訓練資料的規模而言,越多越好。

我們需要盡可能多的資料來用于這些模型的訓練。現在我們如何來表示一張椅子、花的圖像或者在訓練集中的任何東西?現在我放大了一小片花,

你可以看到圖像中的像素,我們可以根據它的RGB值,它的紅-綠-藍值來表示圖像中的每個像素。我們得到三個代表圖像中每個像素的數字。

是以,您可以将整個圖像表示為RGB值的集合。

是以圖像變成了這個非常大的數字矢量,通常在機器學習的時候,我們需要把訓練集和測試集中的每一次觀察都表示成一個數字的矢量,這個标簽也用一個數字表示。

這裡的标簽減1,因為圖像不是椅子,椅子的圖像都是标簽加1。

這是另一個例子:這是來自紐約電力公司的一個問題,他們想要預測哪個檢修孔會發生火災。

是以我們把每個檢修孔作為一個矢量來表示,這裡是矢量的分量。

第一個因素可能是去年的檢修孔發生的嚴重事件,比如火災或冒煙的檢修孔,非常嚴重,或者爆炸之類的。

然後,也許我們會有一個類别,用于去年的重大事件,是以這5個事件中隻有3個是非常嚴重的。

在檢修孔中電纜的數量,在1930年之前安裝的電纜的數量,等等。一般來說,第一步是找出如何用向量表示你的資料,你可以使這個向量很大。

如果你喜歡的話,可以包含很多因素,這是完全可以的。在計算上,如果使用更少的特性,事情就會更容易,但是您可能會遺漏資訊。

是以這裡有一個權衡,你需要擔心,以後我們會詳細讨論。但無論如何,如果你的資料以正确的方式表示,你就不能做機器學習,這是第一步。

現在你認為,更多的電纜,更近期的嚴重事件等等,更容易在不久的将來發生爆炸和火災,但是他們的結合會給你最好的預測嗎?

如何将它們結合在一起?你可以把它們都加起來,但這可能不是最好的。你可以給他們所有的權重并把它們加起來,但是你怎麼知道權重呢?

這就是機器學習對你的作用。它告訴你用什麼組合來得到最好的預測因子。但是對于檢修孔問題,我們想用過去的資料來預測未來。

例如,未來的資料可能是2014年及其之前的資料,如果檢修孔在2015年有一個事件,那麼标簽将是1。

剛剛是我們的訓練集,現在是測試集,特征資料将會是2015年之前,然後我們會嘗試預測2016年會發生什麼。

是以要正式一點,我們每個觀察都用我們的特征集合來表示,

這些特征也被稱為預測因子或協變或解釋變量或自變量,不管它們是什麼-你可以選擇你喜歡的任何術語。

然後我們有标簽,叫做y,更正式地說,我們得到了一組特征标簽對(xi , yi)的訓練集,其中有n個,我們想要建立一個分類模型f可以預測一個标簽y或一個新的x。

讓我們舉一個簡單的例子,簡單版本的檢修孔例子,我們隻有兩個參數:最古老的電纜安裝的年份和去年發生的事件的數量。

是以每一個觀察都可以表示為二維圖上的一個點,這意味着我可以繪制整個資料集。

像這樣,這裡的每一點都是一個檢修孔,我給它标上了它是否在訓練集中發生了嚴重的事件。

這些是沒有發生過火災的檢修孔,而這些是發生過火災的。

然後我要在這裡建立一個函數它将空間分成兩部分,在這一側,在這裡,在決定邊界的一邊,我要預測會有一個事件,

在決策邊界的另一邊,我預測不會有事件發生。是以這個決策邊界實際上就是函數為0的方程,

當函數為正的時候,我們會預測正的。當函數是負的,我們會預測負的。是以這将是這兩個變量的函數,最古老的纜線,然後是去年發生火災的事件數。

我們之前讨論過的通勤視力問題也是同樣的道理。

我們試圖建立這個決策邊界将空間分割成兩部分,在決策邊界的一邊我們可以預測為正,然後在另一邊我們預測為負。

關鍵是,我們如何建立決策邊界?如何建立這個函數f?好的,考慮到我們的訓練資料,我們想要建立一個可以做出預測的分類模型f。

機器學習算法會為你建立函數f,不管函數f有多複雜,用它的方法都不是很複雜。

使用它的方法是這樣的:y的預測值,你之前沒見過的,隻是函數f的符号,分類是對的,或者是沒有問題的。

如果你能回答是或沒有問題,你可以做很多事情。 舉個例子,自動手寫識别、語音識别、生物識别技術、文檔分類、垃圾郵件檢測、預測信用違約風險、信用卡欺詐檢測、檢測客戶流失、預測醫療結果(中風、副作用) 比如手寫識别: 每一頁上的每一個字母,我們都要評估它是否是字母a,是或否。

如果你做的是垃圾郵件檢測,你的電腦上的垃圾郵件檢測器有一個機器學習算法。每封郵件都必須評估是否為垃圾郵件。

信用違約:對,你是否獲得貸款取決于銀行是否預測你會拖欠你的貸款,是還是不是。在我的實驗室裡,我們做了很多關于預測醫療結果的工作。

我們想知道病人在一段時間内是否會發生一些事情。這裡有一個常用分類算法的清單。很有可能,除非你有興趣開發自己的算法,

你不需要自己程式設計;他們已經被其他人程式設計了。如果你隻是一個消費者,你可以使用已經寫好的代碼。

所有這些,你知道-我們将會涵蓋這些方法的大部分,為了有效地使用它們你必須知道你在做什麼,否則你可能會遇到一些問題。

但是如果你能想出如何使用這些工具,你就有了一個非常強大的工具。

繼續閱讀