天天看點

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

機器學習算法機器人足球

An approach that is better than random guessing or choosing players from a pool of 18000 professional players.

這種方法比從18000名專業玩家中随機猜測或選擇玩家更好。

As we are progressing into a world where sports have become a vital part of our lives, it has also become a hot market for investors to gain better returns and interact with the audience and make their presence felt. Also, we can see that there has been a surge in sports viewership which leads to more tournaments, and capitalizing on them has become a difficult task for an investor. We have taken up a challenge to help major investors to pick the best players amongst 18000 Soccer players to build a dream Soccer team which can participate and outperform other clubs in major leagues. We have leveraged Machine learning algorithms to classify potential team members in our club and potential budget for an investor to optimize their market gains. As a result, we have come up with a strategy to build the best team whilst keeping in mind the investor’s budget which has a limit of 1 billion Euros.

随着我們步入一個世界,體育已成為我們生活中至關重要的一部分,它也已成為投資者擷取更好的回報并與觀衆互動并提高他們的存在感的熱門市場。 此外,我們可以看到體育觀衆的數量激增,導緻舉辦更多的比賽,而利用這些機會成為投資者的一項艱巨任務。 我們已經接受了一項挑戰,以幫助主要投資者從18000名足球運動員中挑選最佳球員,以建立一支理想的足球隊,使其能夠參與并超越大聯盟中的其他俱樂部。 我們利用機器學習算法對俱樂部中的潛在團隊成員和潛在預算進行分類,以幫助投資者優化其市場收益。 是以,我們提出了建立最佳團隊的戰略,同時要牢記投資者的預算上限為10億歐元。

介紹 (INTRODUCTION)

We have a FIFA dataset in which there are few columns named — rating, release clause, and wages. We assume that we do not have these variables in upcoming out of time datasets. These can be used in various formations like rating a player as a better performing player or moderate or not up to the marked player. Additionally, it can also translate into a player who should be sent an invite to some club gatherings, events, etc. We built 2 models using Supervised Learning on Rating variables and we made it a classification problem by splitting this variable into 2 classes: greater than or equal to 70 and less than 70 as our potential club member or not. We chose 70 as our threshold as most of the major clubs have only players whose rating is greater than 70. In order to compete with them, we are inclined to have only those players having ratings greater than our threshold. Additionally, we have worked with our model results to predict the cost to investors for offering a club membership annually. Our second model utilizes predicted rating class obtained from our previous best classifiers instead of “actual rating” and here, and a combination of release_clause and annual wages as the cost to investors as our dependent variable.

我們有一個FIFA資料集,其中很少有名為-評級,免責條款和工資的列。 我們假設在即将到來的逾時資料集中沒有這些變量。 這些可以多種形式使用,例如将玩家評為表現較好的玩家,或将玩家評為中等或不超過标記玩家。 此外,它還可以轉換為應邀請其參加一些俱樂部聚會,活動等的球員。我們使用監督學習的評分變量建構了2個模型,并将該變量分為2個類将其分類為一個問題:等于或等于70且小于等于我們潛在俱樂部成員的70。 我們選擇70作為我們的門檻,因為大多數主要俱樂部隻有等級高于70的球員。為了與他們競争,我們傾向于隻讓那些得分高于我們的門檻的球員。 此外,我們使用模型結果來預測投資者每年提供俱樂部會員資格所需的費用。 我們的第二個模型使用從我們以前的最佳分類器中獲得的預測評級級别代替“實際評級”,在這裡,以及釋放條款和年薪的組合作為投資者的成本作為我們的因變量。

資料集 (DATASET)

We are using the FIFA 2019 and 2020 data from the Kaggle FIFA complete player dataset. FIFA complete player dataset contains 18k+ unique players and 100+ attributes extracted from the latest edition of FIFA. It contains:

我們正在使用Kaggle FIFA完整球員資料集中的FIFA 2019和2020資料。 FIFA完整的球員資料集包含從最新版本的FIFA中提取的18k +獨特球員和100+屬性。 它包含了:

  • Files present in CSV format.

    檔案以CSV格式顯示。

  • FIFA 2020–18,278 unique players and 104 attributes for each player. (Test dataset)

    FIFA 2020–18,278位獨特的球員,每個球員有104個屬性。 (測試資料集)

  • FIFA 2019–17,770 unique players and 104 attributes for each player. (Train dataset)

    FIFA 2019–17,770名獨特球員和每位球員的104個屬性。 (火車資料集)

  • Player positions, with the role in the club and in the national team.

    球員位置,在俱樂部和國家隊中發揮作用。

  • Player attributes with statistics as Attacking, Skills, Defense, Mentality, GK Skills, etc.

    玩家屬性的統計資料包括進攻,技能,防守,心态,GK技能等。

  • Player personal data like Nationality, Club, DateOfBirth, Wage, Salary, etc.

    球員的個人資料,例如國籍,俱樂部,出生日期,工資,薪水等。

資料清理 (DATA CLEANING)

  • In some places, both the datasets have different data types of the same features. After reading the data dictionary, we brought them into sync.

    在某些地方,兩個資料集都具有相同特征的不同資料類型。 閱讀資料字典後,我們将它們同步。

  • Some variables have in-built formulas, so we corrected their formatting.

    一些變量具有内置公式,是以我們更正了它們的格式。

  • We removed ‘sofifa_id’, ‘player_url’, ‘short_name’, ‘long_name’, ‘real_face’, ‘dob’, ‘gk_diving’, ‘gk_handling’, ‘gk_kicking’, ‘gk_reflexes’, ‘gk_speed’, ‘gk_positioning’ and ‘body_type’ based on dictionary definitions or repeated columns as they add no useful impact in our analysis.

    我們删除了'sofifa_id','player_url','short_name','long_name','real_face','dob','gk_diving','gk_handling','gk_kicking','gk_reflexes','gk_speed','gk_positioning'和基于字典定義或重複列的“ body_type” ,因為它們對我們的分析沒有任何有益的影響。

  • We converted overall ratings into 2 binary classes, with rating > 70 (as many big clubs use this threshold) to recruit their team players and will be treated as our dependent variable

    我們将整體評分轉換為2個二進制類别,評分> 70(許多大型俱樂部使用此門檻值)以招募其團隊成員,并将其視為因變量

探索性資料分析 (EXPLORATORY DATA ANALYSIS)

We first considered various interesting statistics for performing our exploratory data analysis.

我們首先考慮了各種有趣的統計資料來進行探索性資料分析。

  • Univariate statistics like Missing values percentage in the whole data to treat the missing values, univariate statistics of the continuous variables (count, mean, std, min, max, skewness, kurtosis, unique, missing, IQR) and their distributions.

    單變量統計量,例如整個資料中的缺失值百分比以處理缺失值,連續變量的單變量統計量(計數,平均值,std,最小值,最大值,偏度,峰度,唯一性,缺失,IQR)及其分布。

  • Bivariate statistics: Correlation among the features and T-test for continuous variables and chi-square test & Cramer’s V for categorical variables.

    雙變量統計:連續變量的特征和T檢驗以及分類變量的卡方檢驗和Cramer V的相關性。

單變量 (Univariate)

We performed univariate analysis on the continuous variables to get the sense of the distribution of different fields in our dataset. According to our observation (mean, std, skewness, kurtosis, etc.), we observed many key features that follow a normal distribution. Moreover, the Interquartile range (IQR) was used to detect outliers using Tukey’s method.

我們對連續變量進行了單變量分析,以了解資料集中不同字段的分布。 根據我們的觀察結果(平均值,标準差,偏度,峰度等),我們觀察到許多遵循正态分布的關鍵特征。 此外,四分位間距(IQR)用于使用Tukey方法檢測異常值。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

For the categorical variables, the univariate analysis consists of their count, unique values, categories with maximum counts (i.e., top), their frequency, and the number of missing values they have. From the categorical table, we can see that player_tags, loaned_from, nation_position, player_traits have more than 54% of missing values. It would not be easy to impute these with any promising values.

對于分類變量,單變量分析包括其計數,唯一值,具有最大計數(即最高)的類别,其頻率以及它們具有的缺失值數量。 從分類表中,我們可以看到player_tags,loaned_from,national_position,player_traits的缺失值超過54%。 用任何有前途的價值來估算這些都不是容易的。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

雙變量 (Bivariate)

對于連續變量 (For continuous variables)

We built a correlation matrix to get a sense of the extent of the linear relationship between rating and other explanatory variables and which variables can be excluded in later stages. We used the seaborn package in Python to create the above heat map.

我們建立了一個相關矩陣,以了解等級與其他解釋變量之間線性關系的程度,以及在以後的階段可以排除哪些變量。 我們使用Python中的seaborn包建立了上面的熱圖。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

T檢驗 (T-test)

We also performed a t-test to check whether the mean of the variables when rating = 1 is significantly different from the mean of the variables when rating = 0. After this stage, we removed some variables which are either not significant or having no correlation at all with dependent variables.

我們還進行了t檢驗,以檢查在等級= 1時變量的平均值與在等級= 0時變量的平均值是否顯着不同。在此階段之後,我們删除了一些不重要或沒有相關性的變量根本沒有因變量。

對于分類變量 (For categorical variables)

We performed a chi-square test to check the significance of the variables with the dependent variable rating. The table below contains p values corresponding to categorical variables. we obtained that preferred_foot is not significant in our analysis.

我們執行了卡方檢驗,以檢查具有因變量等級的變量的顯着性。 下表包含與類别變量相對應的p個值。 我們認為在我們的分析中preferred_foot并不重要。

To find the correlation between categorical variables with the dependent variable, we applied Cramer’s V rule.

為了找到分類變量與因變量之間的相關性,我們應用了Cramer的V規則。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

V equals the square root of chi-square divided by sample size, n, times m, which is the smaller of (rows — 1) or (columns — 1): V = SQRT(X2/nm).

V等于卡方的平方根除以樣本大小n乘以m,它是(行_1)或(列_1 )中的較小者: V = SQRT(X2 / nm)。

  • Interpretation: V may be viewed as the association between two variables as a percentage of their maximum possible variation. V2 is the mean square canonical correlation between the variables. For 2-by-2 tables, V = chi-square-based measure of association.

    解釋: V可以看作是兩個變量之間的關聯,以其最大可能變化的百分比表示。 V2是變量之間的均方規範相關。 對于2×2表,V =基于卡方的關聯度。

  • Symmetricalness: V is a symmetrical measure. It does not matter which is the independent variable.

    對稱性: V是對稱度量。 哪個是自變量都沒有關系。

  • Data level: V may be used with nominal data or higher.

    資料級别: V可以用于标稱資料或更高的資料。

  • Values: Ranges from 0 to 1.

    取值範圍: 0〜1。

In this scenario, we kept columns that showed a decent correlation between independent variables and dependent variables. These are ‘club_new’, ‘Pos’, ‘attack_rate’, ‘nation’

在這種情況下,我們保留了顯示自變量和因變量之間良好相關性的列。 這些是'club_new','Pos','attack_rate','nation'

功能工程: (FEATURE ENGINEERING:)

1.重新分類/估算變量 (1. Re-categorizing/imputing Variables)

  • Since team_jersey_number, nation_jersey_number is not actually a continuous variable, we have decided to treat them as categorical variables.

    由于team_jersey_number,national_jersey_number實際上不是連續變量,是以我們決定将它們視為分類變量。

  • We further imputed team_position with ‘not played’ for the missing values and re-categorized the players into defender, attacker, goalkeeper, resting, Mid-Fielder, substitute, not played, to reduce 29 unique values to 7 levels.

    我們進一步對team_position的缺失值進行了“不玩”的估算,并将球員重新分類為防守者,攻擊者,守門員,休息者, 中場球員,替補球員(不參加比賽),以将29個唯一值減少到7個級别。

  • We conjecture that a goalkeeper will have minimum values for ‘pace’, ‘shooting’, ‘passing’, ‘dribbling’, ‘defending’, ‘physic’ thus imputing with such values.

    我們推測,守門員将具有“步速”,“射擊”,“傳球”,“盤帶”,“防守”,“體能”的最小值,進而用這些值來估算。

  • Moreover, 2 variables — nationality and club have very high cardinality. Based on their volume and event rate, we have re-categorized them into low cardinal variables.

    此外, 國籍和俱樂部這兩個變量具有很高的基數。 根據它們的數量和事件發生率,我們将它們重新分類為低基數變量。

2.建立變量: (2. Creating Variables:)

  • From the data, we observed that ‘player_positions’ gives the idea about players' multiple playing positions. so, we have decided to assign individual players with the total count of their availability at different on-field positions into ‘playing_positions’.

    從資料中,我們觀察到“ player_positions”給出了有關玩家多個遊戲位置的想法。 是以,我們決定為各個球員配置設定在不同場上位置的可用總得分 進入“ playing_positions”。

  • A player’s work_rate is given by his attack and defense rate; thus, we have separated them into variables.

    球員的工作率由他的進攻和防守率決定; 是以,我們将它們分為變量。

  • We have also calculated the term an individual player will be associated with the club to better understand their loyalty with the club.

    我們還計算出的個别球員将與俱樂部相關聯,以更好地了解他們與俱樂部的忠誠度術語 。

  • We have also used one-hot encoding to utilize categorical variables in a form that could be provided to ML algorithms to do a better job in predictions.

    我們還使用了一種熱編碼,以某種形式利用分類變量,該形式可以提供給ML算法以更好地進行預測。

3. 模型1 (3. MODEL 1)

Here, Y = Rating with population event rate as 31.23 % (which is class 1)

在這裡,Y =人口事件發生率為31.23%(等級1)的等級

3.1。 邏輯回歸: (3.1. Logistic Regression:)

For the logistic regression model, we first performed the classification without regularization followed by a ridge and lasso regression. L1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex optimization problems do not scale well enough to handle the large datasets encountered in many practical settings.

對于邏輯回歸模型,我們首先在不進行正則化的情況下執行分類,然後進行嶺和套索回歸。 L1正則邏輯回歸需要解決凸優化問題。 但是,用于解決凸優化問題的标準算法的伸縮性不足以處理許多實際設定中遇到的大型資料集。

The objective of Logistic Regression while applying a penalty to minimize loss function:

Logistic回歸的目标,同時應用懲罰以最小化損失函數:

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)
機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

The best result received from running the logistic regression models pre and post regularization (L1 and L2) can be summarized below:

從運作正則化前後的邏輯回歸模型(L1和L2)獲得的最佳結果可以總結如下:

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

3.2。 KNN: (3.2. KNN:)

kNN is a case-based learning method, which keeps all the training data for classification. One of the evaluation standards for different algorithms is their performance. As kNN is a simple but effective method for classification and it is convincing as one of the most effective methods it motivates us to build a model for kNN to improve its efficiency whilst preserving its classification accuracy as well.

kNN是基于案例的學習方法,可保留所有訓練資料進行分類。 不同算法的評估标準之一是它們的性能。 由于kNN是一種簡單但有效的分類方法,并且令人信服,它是最有效的方法之一,是以它促使我們建立kNN模型以提高其效率,同時也保留其分類精度。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

Looking at Figure 1, a training dataset including 11 data points with two classes {square, triangle} is distributed in 2-dimensional data space. If we use Euclidean distance as our similarity measure, many data points with the same class label are close to each other according to distance measure in the local area.

如圖1所示,一個訓練資料集包括11個資料點和兩個類别{正方形,三角形},分布在二維資料空間中。 如果我們使用歐幾裡得距離作為我們的相似性度量,那麼根據局部區域中的距離度量,許多具有相同類别标簽的資料點将彼此靠近。

For instance, if we take the region where k=3 represented with a solid line circle and check the majority voting amongst classes we observe that our data point {circle} will be classified as a triangle. However, if we increase the value of k =5 represented by the dotted circle, our data point will be classified as a square. This motivates us to optimize our k-Nearest Neighbors algorithms to find the optimal k where the classification error is minimal.

例如,如果我們用實心圓表示k = 3的區域,并檢查類别之間的多數表決,我們會發現我們的資料點{circle}将被分類為三角形。 但是,如果我們增加由虛線圓表示的k = 5的值,則我們的資料點将被分類為正方形。 這激勵我們優化我們的k最近鄰算法,以找到分類誤差最小的最優k 。

實驗: (Experiment:)

We initially trained our k-NN model with k=1, with splitting our data into 70% -30% as our training and validation data. From table 2, we observe that training accuracy is 1 which implies that the model fits perfectly, however, the accuracy and AUC for the test data are higher than validation data, which is indicative of overfitting, therefore, we are subjective to perform parameter tuning.

最初,我們以k = 1訓練了k- NN模型,然後将我們的資料分為70%-30%作為訓練和驗證資料。 從表2中,我們看到訓練精度為1,這表示模型完美拟合,但是測試資料的精度和AUC高于驗證資料,這表明過拟合,是以,我們主觀進行參數調整。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

優化: (Optimization:)

We utilized the elbow method to find the least error on training data. After running for the best k, we observed that the least error rate is observed when k=7, Although our optimized results performed better in train and validation, our test AUC has reduced.

我們利用彎頭法找到訓練資料上的最小誤差。 在獲得最佳k後,我們觀察到在k = 7時觀察到的錯誤率最小。盡管我們的優化結果在訓練和驗證中表現更好,但我們的測試AUC卻有所減少。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

Even though the accuracy for the test is reduced, we observe that the precision-recall for the same has increased indicating that our model is classifying more class (1) better as that is our target class. (players with greater than 70 ratings).

即使測試的準确性降低了,我們也注意到該模型的精度召回率有所提高,這表明我們的模型更好地将更多的類别(1)分類為目标類别。 (具有70個以上評分的玩家)。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

3.3。 決策樹: (3.3. DECISION TREE:)

The decision tree method is a powerful statistical tool for classification, prediction, interpretation, and data manipulation that has several potential applications in many fields.

決策樹方法是用于分類,預測,解釋和資料處理的強大統計工具,在許多領域中都有多種潛在應用。

Using decision tree models has the following advantages:

使用決策樹模型具有以下優點:

  • Simplifies complex relationships between input variables and target variables by dividing original input variables into significant subgroups.

    通過将原始輸入變量分成重要的子組,簡化了輸入變量和目标變量之間的複雜關系。

  • A non-parametric approach without distributional assumptions so, Easy to understand and interpret.

    一種無分布假設的非參數方法,易于了解和解釋。

The main disadvantage is that it can be subject to overfitting and underfitting, particularly when using a small data set.

主要缺點是,它可能會過度拟合和拟合不足,尤其是在使用小的資料集時。

Experiment:

實驗:

We trained our Decision tree classifier from the Sklearn library without passing any parameters. From the table, we observed that there is overfitting of the data thus we must tune our parameters to get optimized results.

我們從Sklearn庫訓練了決策樹分類器,而沒有傳遞任何參數。 從表中,我們觀察到資料過度拟合,是以我們必須調整參數以獲得最佳結果。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

Optimization:

優化 :

We worked with the following parameters:

我們使用以下參數:

  • criterion: string, optional (default=” Gini”):

    條件:字元串,可選(預設=“ Gini”):

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)
  • max_depth: int or None, optional (default=None):

    max_depth: int或None,可選(預設= None):

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

樹的最大深度。 如果為None,則将節點展開,直到所有葉子都是純淨的,或者直到所有葉子都包含少于min_samples_split個樣本。

  • min_samples_split: int, float, optional (default=2):

    min_samples_split: int,float,可選(預設= 2):

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

樹的最大深度。 如果為None,則将節點展開,直到所有葉子都是純淨的,或者直到所有葉子都包含少于min_samples_split個樣本。

  • min_weight_fraction_leaf: float, optional ():

    min_weight_fraction_leaf:浮點數,可選():

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided

在所有葉節點處(所有輸入樣本)的權重總和中的最小權重分數。 未提供sample_weight時,樣本的權重相等

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

From the experiments above, we see that Gini is outperforming the entropy in all the variants of the experimental parameters. Thus, our criteria are Gini. Similarly, we can observe the other parameters for max_depth = 10, min_samples_split = 17.5, Min_weight_fraction_leaf =0, Gini gives higher accuracy. Thus, utilizing these parameters, we train our model to observe that there is no overfitting and we can capture more true classes in the class 1 category.

從上面的實驗中,我們可以看出,在所有實驗參數的變體中, Gini的性能都優于熵 。 是以,我們的标準是Gini 。 同樣,我們可以觀察到其他參數, 例如max_depth = 10,min_samples_split = 17.5,Min_weight_fraction_leaf = 0, Gini給出了更高的精度。 是以,利用這些參數,我們訓練模型以觀察到沒有過度拟合,并且可以捕獲1類類别中的更多真實類。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

3.4。 支援向量機: (3.4. SUPPORT VECTOR MACHINES:)

The folklore view of SVM is that they find an “optimal” hyperplane as the solution to the learning problem. The simplest formulation of SVM is the linear one, where the hyperplane lies in the space of the input data x.

SVM的民間​​傳說觀點是,他們找到了“最佳”超平面作為學習問題的解決方案。 SVM的最簡單公式是線性公式,其中超平面位于輸入資料x的空間中。

In this case, the hypothesis space is a subset of all hyperplanes of the form:

在這種情況下,假設空間是以下形式的所有超平面的子集:

f(x) = w⋅x +b.

f( x )= w⋅x + b。

Hard Margin Case:

硬保證金案例:

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)
機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

The maximum margin separating hyperplane objective is to find:

分離超平面目标的最大餘量是找到:

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

Soft Margin Case:

軟保證金案例:

Slack variables are part of the objective function too:

松弛變量也是目标函數的一部分:

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)
機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

The cost coefficient C>0 is a hyperparameter that specifies the misclassification penalty and is tuned by the user based on the classification task and dataset characteristics.

成本系數C> 0是一個超參數,用于指定錯誤分類罰分,并由使用者根據分類任務和資料集特征進行調整。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

RBF SVMs

RBF支援向量機

In general, the RBF kernel is a reasonable first choice. This kernel nonlinearly maps samples into a higher-dimensional space, so it, unlike the linear kernel, can handle the case when the relation between class labels and attributes is nonlinear. Furthermore, the linear kernel is a special case of RBF since the linear kernel with a penalty parameter Ĉ has the same performance as the RBF kernel with some parameters (C, γ). The second reason is the number of hyperparameters which influences the complexity of model selection.

通常,RBF核心是合理的首選。 該核心将樣本非線性地映射到高維空間,是以,與線性核心不同,它可以處理類标簽和屬性之間的關系為非線性的情況。 此外,線性核是RBF的一種特殊情況,因為帶有懲罰參數Ĉ的線性核與具有某些參數(C,γ)的RBF核具有相同的性能。 第二個原因是超參數的數量會影響模型選擇的複雜性。

Experiments:

實驗:

We subjected our training data to a linear SVM classifier without training it for soft margins. The results observed does look promising, however,

我們對訓練資料進行了線性SVM分類器,而沒有對其進行軟邊距訓練。 觀察到的結果看起來确實很有希望,但是,

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

The reason for the good score was that the data was almost linearly separable most of the time with very few misclassifications.

得分高的原因是,在大多數情況下,資料幾乎可以線性分離 ,并且幾乎沒有錯誤分類。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

Optimization:

優化:

We decided to run a grid search with a linear, and radial basis function with varying C and γ to train our model efficiently. From the Grid search, we obtained the best estimators as

我們決定運作具有線性和徑向基函數(具有變化的C和γ)的網格搜尋,以有效地訓練我們的模型。 通過Grid搜尋,我們獲得了最佳估計量,即

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=’ovr’, degree=3, gamma=’auto_deprecated’, kernel=’linear’, max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)

SVC(C = 1,cache_size = 200,class_weight = None,coef0 = 0.0,Decision_function_shape ='ovr',degree = 3,gamma ='auto_deprecated',kernel ='linear',max_iter = -1,機率= True,random_state =無,縮小=正确,tol = 0.001,冗長= False)

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

And for the radial basis function, we got our best estimators as

對于徑向基函數,我們得到了最好的估計量

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=’ovr’, degree=3, gamma=0.001, kernel=’rbf’, max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False)

SVC(C = 100,cache_size = 200,class_weight =無,coef0 = 0.0,decision_function_shape ='ovr',度= 3,伽馬= 0.001,核心='rbf',max_iter = -1,機率= True,random_state =無,縮小= True,tol = 0.001,詳細= False)

Since the generalization error (expected loss) is used to approximate the population error, we observed the Errorval in RBF kernel model is the smallest amongst other models. Also, this is our best model as it fits the data better than the rest of the models.

由于使用泛化誤差(預期損失)來近似總體誤差,是以我們觀察到RBF核心模型中的Errorval在其他模型中最小。 另外,這是我們最好的模型,因為它比其他模型更适合資料。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

RBF kernel took the data into higher infinite-dimensional space which helped our model to stand out. The Precision-Recall curve shows us how well the positive class is getting predicted with AUC — 0.961.

RBF核心将資料帶入更高的無限維空間,這有助于我們的模型脫穎而出。 Precision-Recall曲線向我們展示了使用AUC-0.961預測正分類的效果。

4. 模型2: (4. MODEL 2:)

Here, X is the same including the predicted rating from Model 1 and Y = Release clause + 52*Wage as the cost to investors. (52 is multiplied as the weekly wage is given).

在這裡,X是相同的,包括來自模型1的預測評級, Y =釋放條款+ 52 *工資作為投資者的成本。 (将52乘以周工資)。

After selecting significant variables, from Univariate and Bi-variate analysis as earlier, we plotted a scatter plot of independent variables with the dependent variables.

在選擇了顯着變量之後,從前面的單變量和雙變量分析中,我們繪制了獨立變量與因變量的散點圖。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

It is clearly visible that they follow a relationship, but it does not seem linear. We confirmed this by developing a linear model.

很明顯,他們遵循某種關系,但似乎不是線性的。 我們通過建立線性模型來确認這一點。

4.1。 線性模型: (4.1. Linear Model:)

Results:

結果:

R square train 0.54

R方列車0.54

R square validation 0.55

R平方驗證0.55

R square test 0.54

R平方檢驗0.54

R square is the measure of closeness to perfect prediction. Here, R square is not good.

R平方是接近完美預測的度量。 在這裡,R平方不好。

Checking Linearity from residuals: Data should be randomly scattered. But here, we figured out that they are not random. This means a linear model would never be a good choice to fit this model.

從殘差檢查線性:資料應随機分散。 但是在這裡,我們發現它們不是随機的。 這意味着線性模型永遠不是适合此模型的好選擇。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

4.2。 決策樹:在這種情況下,這是比線性模型更好的選擇。 (4.2. Decision Trees: This was the better choice than linear models in this scenario.)

Results (Baseline):

結果(基線):

Train Data: R square — 0.99 and RMSE — 0.05

火車資料: R平方-0.99和RMSE-0.05

Validation Data: R square — 0.54 and RMSE — 8.05

驗證資料: R平方-0.54和RMSE-8.05

Test Data: R square — 0.59 and RMSE — 7.35

測試資料: R平方-0.59和RMSE-7.35

There was a clear indication of over-fitting. The model was not performing as expected. Therefore, we tried a grid search based on min_split, tree_depth, and min_weight_fraction_leaf and learning criteria.

明顯有過度合身的迹象。 該模型的表現未達到預期。 是以,我們嘗試了基于min_split,tree_depth和min_weight_fraction_leaf和學習準則的網格搜尋。

As shown above, Entropy performed better with min_split=3 and max_depth=15.

如上所示,熵在min_split = 3和max_depth = 15時表現更好。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

Results after Grid Search: (Main Model)

網格搜尋後的結果:(主模型)

Train Data: R square — 0.85 and RMSE — 4.40

火車資料: R平方-0.85和RMSE-4.40

Validation Data: R square — 0.69 and RMSE — 6.59

驗證資料: R平方-0.69和RMSE-6.59

Test Data: R square — 0.70 and RMSE — 6.26

測試資料: R平方-0.70和RMSE-6.26

R-squared value seems far better now. RMSE value is also low and the problem of overfitting is also solved.

R平方值現在似乎好得多。 RMSE值也很低,也解決了過拟合的問題。

Hence, Decision Trees performed better here in order to predict the cost to investors.

是以,決策樹在此處表現更好,以便預測投資者的成本。

Final Strategy:

最終政策:

The final step was to make a strategy to pick players for our team keeping in mind:

最後一步是制定政策,為我們的團隊選拔球員時要牢記:

  • Rating should be greater than 70 (means class 1)

    評分應大于70(平均等級1)

  • Budget — 1 billion Euros and the number of players around 30.

    預算-10億歐元,球員人數約30名。

Firstly, we selected only the players who had ratings greater than the threshold of 70. Number of players left — 5276

首先,我們隻選擇得分大于70的球員。剩餘球員數量-5276

Secondly, we performed some analysis like the decile analysis of the cost to investors. We made some buckets each having approximately 30 players from the remaining pool and sorted those buckets based on the cost to investors in descending order.

其次,我們進行了一些分析,例如對投資者成本的十分位分析。 我們從剩餘的彩池中抽取了一些約有30名玩家的存儲桶,并根據對投資者的降序對這些存儲桶進行了分類。

機器學習算法機器人足球_購買足球隊:一種機器學習方法 功能工程: (FEATURE ENGINEERING:) 3. 模型1 (3. MODEL 1) 4. 模型2: (4. MODEL 2:) 5. 結論: (5. CONCLUSION:)

Here, we can observe that the amount needed to pick the whole team from the first bucket is 3.45 billion Euros (which is out of budget). That means we can’t pick the top 30 players directly and the amount needed to pick the team from the 11th bucket is 0.945 billion Euros (which is in our budget). However, it would be a wrong strategy to pick all the players from this bucket only as we’d leave almost 300 high valued players who are above this bucket. So, the best solution is to pick 8–10 core players from the top buckets and the rest of the players from medium and low-valued buckets.

在這裡,我們可以看到從第一桶中選出整個團隊所需的資金為34.5億歐元 (超出預算)。 這意味着我們不能直接選拔前30名球員,而從第11名中選拔球隊所需的金額為9.45億歐元 (這在我們的預算中)。 但是,僅從該存儲桶中選擇所有玩家将是錯誤的政策,因為我們會留下将近300個高于該存儲桶的高價值玩家。 是以,最好的解決方案是從最進階的存儲桶中選擇8–10個核心參與者,從中低價值的存儲桶中選擇其餘的參與者。

This decision can be easily made by the above analysis and it is up to the investors and team managers to decide what kind of players they want in their team.

通過上面的分析可以很容易地做出此決定,這取決于投資者和團隊經理來決定他們想要什麼樣的球員。

5. 結論: (5. CONCLUSION:)

In this work, we constructed 2 models that utilize Machine learning algorithms to benefit investors while simultaneously capturing the meaningfully classifying players as good performers and then regressing them in the budget of the investor. The result, classification, and regression fitting is a new selection model for the supervised learning of players that outperforms other teams. Ultimately, we have narrowed down the selection process of a player within a club, which is rather better than selecting at random.

在這項工作中,我們建構了2個模型,這些模型利用機器學習算法使投資者受益,同時将有意義的參與者分類為表現良好的參與者,然後将他們歸還給投資者預算。 結果,分類和回歸拟合是一種新的選擇模型,用于監督球員的學習,其表現優于其他團隊。 最終,我們縮小了俱樂部内球員的選擇範圍,這比随機選擇要好。

Future Scope: We can also implement Time Series techniques. As our dependent variables — rating and cost both depend on previous years' data. For example — If a certain player has a rating of 85 in Dec’19, his rating in Jan’20 would be around 85 +/- 3. Therefore, Time Series techniques might be useful for this data.

未來範圍:我們還可以實施時間序列技術。 作為我們的因變量,評級和成本都取決于前幾年的資料。 例如,如果某位玩家在19年12月的評分為85,那麼他在20年1月的評分将為85 +/-3。是以,時間序列技術可能對該資料有用。

  1. Guo, Gongde & Wang, Hui & Bell, David & Bi, Yaxin. (2004). KNN Model-Based Approach in Classification.

    郭功德,王輝,鐘慧,大衛和畢亞新。 (2004)。 基于KNN模型的分類方法。

  2. Yan-yan SONG, Ying LU. (2015). Decision tree methods: applications for classification and prediction.

    宋豔豔,陸穎。 (2015)。 決策樹方法:分類和預測的應用程式。

  3. https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680

    https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680

  4. Apostolidis-Afentoulis, Vasileios. (2015). SVM Classification with Linear and RBF kernels. 10.13140/RG.2.1.3351.4083.

    Apostolidis-Afentoulis,Vasileios。 (2015)。 線性和RBF核心的SVM分類。 10.13140 / RG.2.1.3351.4083。

翻譯自: https://towardsdatascience.com/buying-a-soccer-team-a-machine-learning-approach-283f51d52511

機器學習算法機器人足球