天天看點

資料挖掘筆記 第一章:引言

教科書:資料挖掘:概念與技術(第二版),Jiawei Han和Micheline Kamber 著,機械工業出版社(2007)

Lecture 1: Introduction

1)  Why data mining?

Necessity Is the Mother of Invention需要是發明之母

2) What is data mining?

Data mining (knowledge discovery from data從大量資料中提取或挖掘知識)

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data從大量的資料中挖掘哪些令人感興趣的、有用的、隐含的、先前未知的和可能有用的模式或知識

Alternative names: Knowledge discovery (mining) in databases (KDD) 資料庫中的知識挖掘

Steps of a KDD Process

Learning the application domain: relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation:Find useful features, dimensionality/variable reduction, invariant representation

Choosing functions of data mining: summarization, classification, regression, association, clustering

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation: visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Architecture: Typical Data Mining System

資料挖掘筆記 第一章:引言

3) On what kind of data?

Traditional database and appllications

    Relational database, data warehouse, transactional database關系資料庫,資料倉庫,事務資料庫

Advanced database and advanced applications

   Object-relational databases對象-關系資料庫

   Temporal database, sequence data (incl. biosequences), time-series data時間資料庫、序列資料庫和時間序列資料庫

    Spatial database and spatiotemporal database空間資料庫和時間空間資料庫

    Text databases Multimedia database文本資料庫和多媒體資料庫

    Heterogeneous databases and legacy databases異構資料庫和遺産資料庫

    Data streams and sensor data資料流和傳感器資料

    Structure data, graphs, social networks and link databases

    The World-Wide Web網際網路

4) Data Mining Functionalities

   Lass/concept description: Characterization and discrimination 類/概念描述: 特性化和區分

   Frequent patterns, association, correlation and causality頻繁模式、關聯和相關

   Classification and prediction分類和預測 

   Cluster analysis聚類分析

   Outlier analysis離群點分析

   Trend and evolution analysis趨勢和演變分析

5) Are all the patterns interesting?

6) Classification of data mining systems