文本和關鍵詞相似度計算（切詞、餘弦相似度）JAVA實作

2020-01-09 23:50:00

文本和關鍵詞相似度計算（切詞、餘弦相似度）JAVA實作

問題描述：

文本分類計算：假設文章類别分為多個類别，每個類别都有自己的關鍵詞資訊。

如何給新的文本歸類？

如何修正每個類别的文章資訊？

解決思路：

1、文本切詞（IKAnalyzer開源）：

借助于開源切詞工具對文本做切詞（注：如果項目用到了ES，需要排包，否則，有lucene的jar包沖突）。

<dependency>
    <groupId>com.janeluo</groupId>
    <artifactId>ikanalyzer</artifactId>
    <version>${ikanalyzer.version}</version>
    <exclusions>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queries</artifactId>
        </exclusion>
    </exclusions>
</dependency>

2、相似度計算（餘弦相似度計算方法）：

餘弦相似度，又稱為餘弦相似性，是通過計算兩個向量的夾角餘弦值來評估他們的相似度（具體原理百度一下，此處不再詳述）。

實作代碼：

package com.spider.search.service.util;

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Vector;

public class SimilarUtils {

    //門檻值
    public static double YUZHI = 0.001 ;

    /**
     * 傳回百分比
     */
    public static double getSimilarity(Vector<String> T1, Vector<String> T2) throws Exception {
        int size = 0 , size2 = 0 ;
        if ( T1 != null && ( size = T1.size() ) > 0 && T2 != null && ( size2 = T2.size() ) > 0 ) {

            Map<String, double[]> T = new HashMap<String, double[]>();

            //T1和T2的并集T
            String index = null ;
            for ( int i = 0 ; i < size ; i++ ) {
                index = T1.get(i) ;
                if( index != null){
                    double[] c = T.get(index);
                    c = new double[2];
                    c[0] = 1;  //T1的語義分數Ci
                    c[1] = YUZHI;//T2的語義分數Ci
                    T.put( index, c );
                }
            }

            for ( int i = 0; i < size2 ; i++ ) {
                index = T2.get(i) ;
                if( index != null ){
                    double[] c = T.get( index );
                    if( c != null && c.length == 2 ){
                        c[1] = 1; //T2中也存在，T2的語義分數=1
                    }else {
                        c = new double[2];
                        c[0] = YUZHI; //T1的語義分數Ci
                        c[1] = 1; //T2的語義分數Ci
                        T.put( index , c );
                    }
                }
            }

            //開始計算，百分比
            Iterator<String> it = T.keySet().iterator();
            double s1 = 0 , s2 = 0, Ssum = 0;  //S1、S2
            while( it.hasNext() ){
                double[] c = T.get( it.next() );
                Ssum += c[0]*c[1];
                s1 += c[0]*c[0];
                s2 += c[1]*c[1];
            }
            //百分比
            return Ssum / Math.sqrt( s1*s2 );
        } else {
            throw new Exception("相似度計算工具類傳入參數有問題！");
        }
    }
}

JAVA實作代碼位址：

https://github.com/sijunx/mySpider/blob/feature_word_dic_20191001001/spider-scrawl/spider-scrawl/spider-scrawl-service-impl/src/main/java/com/spider/search/service/util/SimilarUtils.java https://github.com/sijunx/mySpider/blob/feature_word_dic_20191001001/spider-scrawl/spider-scrawl/spider-scrawl-service-impl/src/main/java/com/spider/search/service/util/SpiderKeyWordExtractUtil.java

文本和關鍵詞相似度計算（切詞、餘弦相似度）JAVA實作

文本和關鍵詞相似度計算（切詞、餘弦相似度）JAVA實作

問題描述：

解決思路：

2、相似度計算（餘弦相似度計算方法）：

JAVA實作代碼位址：

繼續閱讀

nginx location中斜線的位置的重要性

結構體：typedef與struct的差別

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

hdu7108哈希