天天看點

文本和關鍵詞相似度計算(切詞、餘弦相似度)JAVA實作

文本和關鍵詞相似度計算(切詞、餘弦相似度)JAVA實作

問題描述:

文本分類計算:假設文章類别分為多個類别,每個類别都有自己的關鍵詞資訊。

如何給新的文本歸類?

如何修正每個類别的文章資訊?

解決思路:

1、文本切詞(IKAnalyzer開源):

借助于開源切詞工具對文本做切詞(注:如果項目用到了ES,需要排包,否則,有lucene的jar包沖突)。

<dependency>
    <groupId>com.janeluo</groupId>
    <artifactId>ikanalyzer</artifactId>
    <version>${ikanalyzer.version}</version>
    <exclusions>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queries</artifactId>
        </exclusion>
    </exclusions>
</dependency>      

2、相似度計算(餘弦相似度計算方法):

餘弦相似度,又稱為餘弦相似性,是通過計算兩個向量的夾角餘弦值來評估他們的相似度(具體原理百度一下,此處不再詳述)。

實作代碼:

package com.spider.search.service.util;

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Vector;

public class SimilarUtils {

    //門檻值
    public static double YUZHI = 0.001 ;

    /**
     * 傳回百分比
     */
    public static double getSimilarity(Vector<String> T1, Vector<String> T2) throws Exception {
        int size = 0 , size2 = 0 ;
        if ( T1 != null && ( size = T1.size() ) > 0 && T2 != null && ( size2 = T2.size() ) > 0 ) {

            Map<String, double[]> T = new HashMap<String, double[]>();

            //T1和T2的并集T
            String index = null ;
            for ( int i = 0 ; i < size ; i++ ) {
                index = T1.get(i) ;
                if( index != null){
                    double[] c = T.get(index);
                    c = new double[2];
                    c[0] = 1;  //T1的語義分數Ci
                    c[1] = YUZHI;//T2的語義分數Ci
                    T.put( index, c );
                }
            }

            for ( int i = 0; i < size2 ; i++ ) {
                index = T2.get(i) ;
                if( index != null ){
                    double[] c = T.get( index );
                    if( c != null && c.length == 2 ){
                        c[1] = 1; //T2中也存在,T2的語義分數=1
                    }else {
                        c = new double[2];
                        c[0] = YUZHI; //T1的語義分數Ci
                        c[1] = 1; //T2的語義分數Ci
                        T.put( index , c );
                    }
                }
            }

            //開始計算,百分比
            Iterator<String> it = T.keySet().iterator();
            double s1 = 0 , s2 = 0, Ssum = 0;  //S1、S2
            while( it.hasNext() ){
                double[] c = T.get( it.next() );
                Ssum += c[0]*c[1];
                s1 += c[0]*c[0];
                s2 += c[1]*c[1];
            }
            //百分比
            return Ssum / Math.sqrt( s1*s2 );
        } else {
            throw new Exception("相似度計算工具類傳入參數有問題!");
        }
    }
}      

JAVA實作代碼位址:

https://github.com/sijunx/mySpider/blob/feature_word_dic_20191001001/spider-scrawl/spider-scrawl/spider-scrawl-service-impl/src/main/java/com/spider/search/service/util/SimilarUtils.java https://github.com/sijunx/mySpider/blob/feature_word_dic_20191001001/spider-scrawl/spider-scrawl/spider-scrawl-service-impl/src/main/java/com/spider/search/service/util/SpiderKeyWordExtractUtil.java