天天看點

皮爾遜相關系數 相似系數_皮爾遜相關系數 散點圖 (Scatterplots) 散點圖矩陣 (Scatter Plot Matrix) 皮爾遜相關系數 (Pearson Correlation Coefficient) 摘要 (Summary)

皮爾遜相關系數 相似系數

資料科學和機器學習統計 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)

In the last post, we analyzed the relationship between categorical variables and categorical and continuous variables. In this case, we will analyze the relation between two ratio level or continuous variables.

在上一篇文章中,我們分析了類别變量與類别變量和連續變量之間的關系。 在這種情況下,我們将分析兩個比率級别或連續變量之間的關系。

Peason’s Correlation, sometimes just called correlation, is the most used metric for this purpose, it searches the data for a linear relationship between two variables.

Peason的相關性 (有時也稱為相關性 )是為此目的最常用的度量标準,它在資料中搜尋兩個變量之間的線性關系。

Analyzing the correlations is one of the first steps to take in any statistics, data analysis, or machine learning process, it allows data scientists to early detect patterns and possible outcomes of the machine learning algorithms, so it guides us to choose better models.

分析相關性是進行任何統計,資料分析或機器學習過程的第一步之一,它使資料科學家能夠及早發現機器學習算法的模式和可能的結果,進而指導我們選擇更好的模型。

Correlation is a measure of relation between variables, but cannot prove causality between them.

相關性是變量之間關系的度量,但不能證明變量之間的因果關系。

Some examples of random correlations that exist in the world are found un this website.

在此網站上可以找到世界上存在的随機相關性的一些示例。

皮爾遜相關系數 相似系數_皮爾遜相關系數 散點圖 (Scatterplots) 散點圖矩陣 (Scatter Plot Matrix) 皮爾遜相關系數 (Pearson Correlation Coefficient) 摘要 (Summary)

This example is taken from https://tylervigen.com/spurious-correlations. 此示例取自https://tylervigen.com/spurious-correlations。

In the case of the last graph, it’s clearly not true that one of these variables implies the other one, even having a correlation of 99.79%

在最後一張圖的情況下,顯然這些變量中的一個隐含了另一個變量,即使相關性為99.79%

散點圖 (Scatterplots)

To take the first look to our dataset, a good way to start is to plot pairs of continuous variables, one in each coordinate. Each point on the graph corresponds to a row of the dataset.

首先看一下我們的資料集,一個好的開始方法是繪制成對的連續變量,每個坐标中一個。 圖上的每個點都對應于資料集的一行。

Scatterplots give us a sense of the overall relationship between two variables:

散點圖使我們大緻了解兩個變量之間的整體關系:

  • Direction: positive or negative relation, when one variable increases the second one increases or decreases?

    方向:正向或負向關系,當一個變量增加時,第二個變量增加或減少?

  • Strength: how much a variable increases when the second one increases.

    強度:第二個變量增加時變量增加多少。

  • Shape: The relation is linear, quadratic, exponential…?

    形狀:該關系是線性,二次方,指數...?

Using scatterplots is a fast technique for detecting outliers if a value is widely separated from the rest, checking the values for this individual will be useful.

如果值與其他值之間的距離較遠,則使用散點圖是檢測異常值的快速技術,檢查該個人的值将非常有用。

We will go with the most used data frame when studying machine learning, Iris, a dataset that contains information about iris plant flowers, and the objective of this one is to classify the flowers into three groups: (setosa, versicolor, virginica).

在研究機器學習時,我們将使用最常用的資料架構Iris,該資料集包含有關鸢尾花的資訊,而該資料集的目的是将花分為三類:(setosa,versicolor,virginica)。

皮爾遜相關系數 相似系數_皮爾遜相關系數 散點圖 (Scatterplots) 散點圖矩陣 (Scatter Plot Matrix) 皮爾遜相關系數 (Pearson Correlation Coefficient) 摘要 (Summary)

Scatter plot of two iris dataset variables, self-generated. 自生成的兩個虹膜資料集變量的散點圖。

The objective of the iris dataset is to classify the distinct types of iris with the data that we have, to deliver the best approach to this problem, we want to analyze all the variables that we have available and their relations.

虹膜資料集的目的是用我們擁有的資料對虹膜的不同類型進行分類,以提供解決此問題的最佳方法,我們要分析所有可用變量及其關系。

In the last plot we have the petal length and width variables, and separate the distinct classes of iris in colors, what we can extract from this plot is:

在最後一個繪圖中,我們具有花瓣的長度和寬度變量,并用顔色分隔了虹膜的不同類别,我們可以從該繪圖中提取出以下内容:

  • There’s a positive linear relationship between both variables.

    這兩個變量之間存在正線性關系。

  • Petal length increases approximately 3 times faster than the petal width.

    花瓣長度的增加速度大約是花瓣寬度的3倍。

  • Using these 2 variables the groups are visually differentiable.

    使用這兩個變量,這些組在視覺上是可區分的。

散點圖矩陣 (Scatter Plot Matrix)

To plot all relations at the same time and on the same graph, the best approach is to deliver a pair plot, it’s just a matrix of all variables containing all the possible scatterplots.

要同時在同一張圖上繪制所有關系,最好的方法是繪制一對圖,它隻是包含所有可能的散點圖的所有變量的矩陣。

As you can see, the plot of the last section is in the last row and third column of this matrix.

如您所見,最後一部分的圖形位于此矩陣的最後一行和第三列中。

皮爾遜相關系數 相似系數_皮爾遜相關系數 散點圖 (Scatterplots) 散點圖矩陣 (Scatter Plot Matrix) 皮爾遜相關系數 (Pearson Correlation Coefficient) 摘要 (Summary)

Pair plot of two iris dataset variables, self-generated. 自生成兩個虹膜資料集變量的配對圖。

In this matrix, the diagonal can show distinct plots, in this case, we used the distributions of each one of the iris classes.

在此矩陣中,對角線可以顯示不同的圖,在這種情況下,我們使用了每個虹膜類别的分布。

Being a matrix, we have two plots for each combination of variables, there’s always a plot combining the same variables inverse of the (column, row), the other side of the diagonal.

作為一個矩陣,對于每種變量組合,我們都有兩個圖,總有一個圖将(列,行)的反變量(對角線的另一側)的相同變量組合在一起。

Using this matrix we can obtain all the information about all the continuous variables in the dataset easily.

使用此矩陣,我們可以輕松擷取有關資料集中所有連續變量的所有資訊。

皮爾遜相關系數 (Pearson Correlation Coefficient)

Scatter plots are an important tool for analyzing relations, but we need to check if the relation between variables is significant, to check the lineal correlation between variables we can use the Person’s r, or Pearson correlation coefficient.

散點圖是分析關系的重要工具,但是我們需要檢查變量之間的關系是否顯着,要檢查變量之間的線性相關性,可以使用Person的r或Pearson相關系數。

The range of the possible results of this coefficient is (-1,1), where:

該系數可能的結果範圍是(-1,1) ,其中:

  • 0 indicates no correlation.

    0表示沒有相關性。

  • 1 indicates a perfect positive correlation.

    1表示完全正相關。

  • -1 indicates a perfect negative correlation.

    -1表示完美的負相關。

To calculate this statistic we use the following formula:

要計算此統計資訊,我們使用以下公式:

皮爾遜相關系數 相似系數_皮爾遜相關系數 散點圖 (Scatterplots) 散點圖矩陣 (Scatter Plot Matrix) 皮爾遜相關系數 (Pearson Correlation Coefficient) 摘要 (Summary)

Peason’s correlation formula, self-generated. Peason的相關公式,自生成。

相關系數的檢驗顯着性 (Test significance of correlation coefficient)

We need to check if the correlation is significant for our data, as we already talked about hypothesis testing, in this case:

我們已經讨論過假設檢驗,在這種情況下,我們需要檢查相關性對我們的資料是否有意義:

  • H0 = The variables are unrelated, r = 0

    H0 =變量無關,r = 0

  • Ha = The variables are related, r ≠ 0

    Ha =變量相關,r≠0

This statistic has a t-student distribution with (n-2) degrees of significance, being n the number of values.

此統計資訊的t學生分布的有意義度為(n-2)個,值為n個值。

The formula for the t value is the following, and we need to compare the result with the t-student table.

t值的公式如下,我們需要将結果與t學生表進行比較。

皮爾遜相關系數 相似系數_皮爾遜相關系數 散點圖 (Scatterplots) 散點圖矩陣 (Scatter Plot Matrix) 皮爾遜相關系數 (Pearson Correlation Coefficient) 摘要 (Summary)

Peason’s correlation t-student formula, self-generated. Peason的相關t型學生公式,自生成的。

If our result is bigger than the table value we reject the null hypothesis and say that the variables are related.

如果我們的結果大于表值,則我們拒絕原假設,并說變量是相關的。

确定系數 (Coefficient of determination)

To calculate how much the variation of a variable can affect the variation of the other one, we can use the coefficient of determination, calculated as the r². This measure will be very important in regression models.

為了計算一個變量的變化能對另一個變量的變化産生多大的影響,我們可以使用确定系數 ,計算為r² 。 該度量在回歸模型中将非常重要。

摘要 (Summary)

In the last post, we talked about correlation for categorical data and mentioned that the correlation for continuous variables is easier, in this case, we explained how to perform this correlation analysis and how to check if it’s statistically significant.

在上一篇文章中,我們讨論了分類資料的相關性,并提到了連續變量的相關性更容易,在這種情況下,我們說明了如何執行此相關性分析以及如何檢查其是否具有統計意義。

Adding to the typical analysis of the statistical significance will give a better understanding about how to use each variable.

除了對統計意義進行典型分析之外,還将對如何使用每個變量有更好的了解。

This is the eleventh post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).

這是我特别#100daysofML第十一屆文章中,我将出版在GitHub上,Twitter和中型企業(這一挑戰的進步阿德裡亞塞拉 )。

https://twitter.com/CrunchyML

https://twitter.com/CrunchyML

https://github.com/CrunchyPistacho/100DaysOfML

https://github.com/CrunchyPistacho/100DaysOfML

翻譯自: https://medium.com/ai-in-plain-english/pearson-correlation-coefficient-14c55d32c1bb

皮爾遜相關系數 相似系數