Weka系列轉載之屬性選擇

2023-04-23 06:07:15

在這一節我們看看屬性選擇。在資料挖掘的研究中，通常要通過距離來計算樣本之間的距離，而樣本距離是通過屬性值來計算的。我們知道對于不同的屬性，它們在樣本空間的權重是不一樣的，即它們與類别的關聯度是不同的，是以有必要篩選一些屬性或者對各個屬性賦一定的權重。這樣屬性選擇的方法就應運而生了。

在屬性選擇方面InfoGain和GainRatio的比較常見，也是最通俗易懂的方法。它們與Decision Tree的構造原理比較相似，哪個節點擁有的資訊量就為哪個節點賦較高的權重。其它的還有根據關聯度的辦法來進行屬性選擇（Correlation-based Feature Subset Selection for Machine Learning）。具體它的工作原理大家可以在網上看論文。

現在我将簡單的屬性選擇執行個體給大家展示一下：

package com.csdn;

import java.io.File;

import weka.attributeSelection.InfoGainAttributeEval;

import weka.attributeSelection.Ranker;

import weka.classifiers.Classifier;

import weka.core.Instances;

import weka.core.converters.ArffLoader;

public class SimpleAttributeSelection {

public static void main(String[] args) {

// TODO Auto-generated method stub

Instances trainIns = null;

try{

File file= new File("C:\\Program Files\\Weka-3-6\\data\\segment-challenge.arff");

ArffLoader loader = new ArffLoader();

loader.setFile(file);

trainIns = loader.getDataSet();

//在使用樣本之前一定要首先設定instances的classIndex，否則在使用instances對象是會抛出異常

trainIns.setClassIndex(trainIns.numAttributes()-1);

Ranker rank = new Ranker();

InfoGainAttributeEval eval = new InfoGainAttributeEval();

eval.buildEvaluator(trainIns);

//System.out.println(rank.search(eval, trainIns));

int[] attrIndex = rank.search(eval, trainIns);

StringBuffer attrIndexInfo = new StringBuffer();

StringBuffer attrInfoGainInfo = new StringBuffer();

attrIndexInfo.append("Selected attributes:");

attrInfoGainInfo.append("Ranked attributes:\n");

for(int i = 0; i < attrIndex.length; i ++){

attrIndexInfo.append(attrIndex[i]);

attrIndexInfo.append(",");

attrInfoGainInfo.append(eval.evaluateAttribute(attrIndex[i]));

attrInfoGainInfo.append("\t");

attrInfoGainInfo.append((trainIns.attribute(attrIndex[i]).name()));

attrInfoGainInfo.append("\n");

}

System.out.println(attrIndexInfo.toString());

System.out.println(attrInfoGainInfo.toString());

}catch(Exception e){

e.printStackTrace();

}

在這個執行個體中，我用了InfoGain的屬性選擇類來進行特征選擇。InfoGainAttributeEval主要是計算出各個屬性的InfoGain資訊。同時在weka中為屬性選擇方法配備的有搜尋算法（seacher method），在這裡我們用最簡單的Ranker類。它對屬性進行了簡單的排序。在Weka中我們還可以對搜尋算法設定一些其它的屬性，例如設定搜尋的屬性集，門檻值等等，如果有需求大家可以進行詳細的設定。

在最後我們列印了一些結果資訊，列印了各個屬性的InfoGain的資訊。

本文來自CSDN部落格，轉載請标明出處：http://blog.csdn.net/anqiang1984/archive/2009/04/04/4048177.aspx

Weka系列轉載之屬性選擇

繼續閱讀

查找算法學習之二分查找（Python版本）——BinarySearch

浮點數計算精度控制

CQ V1.0分詞bates(基于雙數組tire樹)—應該是目前最快的中文分詞算法

Command Network(POJ 3164)---定根最小樹形圖模闆題題目描述輸入格式輸出格式輸入樣例輸出樣例分析源程式

坐标系統和投影變換在桌面産品中的應用

開源低帶寬語音編解碼器

241 Different Ways to Add Parentheses（C代碼版）

【趨高機器視覺】機器視覺技術原了解析及解決方案

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制規程及特點4． CSMA/CD協定5． CSMA/CD的優點6．結束語

極大似然法(ML)與最大期望法(EM)

C++ 第十五周報告1--《冒泡法排序》

筆試面試題目：滑動視窗(二)

資料結構與算法（27）——排序（二）

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

hdu7108哈希