余弦相似度算法:
余弦相似性通过测量两个向量的夹角的余弦值来度量它们之间的相似性。0度角的余弦值是1,而其他任何角度的余弦值都不大于1;并且其最小值是-1。从而两个向量之间的角度的余弦值确定两个向量是否大致指向相同的方向。两个向量有相同的指向时,余弦相似度的值为1;两个向量夹角为90°时,余弦相似度的值为0;两个向量指向完全相反的方向时,余弦相似度的值为-1。这结果是与向量的长度无关的,仅仅与向量的指向方向相关。余弦相似度通常用于正空间,因此给出的值为-1到1之间。
坐标系表示:

具体公式:
具体实现(Java):
import java.util.ArrayList;
public class SimilarityUtil {
public static double similarity(ArrayList va, ArrayList vb) {
if (va.size() > vb.size()) {
int temp = va.size() - vb.size();
for (int i = 0; i < temp; i++) {
vb.add(0);
}
} else if (va.size() < vb.size()) {
int temp = vb.size() - va.size();
for (int i = 0; i < temp; i++) {
va.add(0);
}
}
int size = va.size();
double simVal = 0;
double num = 0;
double den = 1;
double powa_sum = 0;
double powb_sum = 0;
for (int i = 0; i < size; i++) {
double a = Double.parseDouble(va.get(i).toString());
double b = Double.parseDouble(vb.get(i).toString());
num = num + a * b;
powa_sum = powa_sum + (double) Math.pow(a, 2);
powb_sum = powb_sum + (double) Math.pow(b, 2);
}
double sqrta = (double) Math.sqrt(powa_sum);
double sqrtb = (double) Math.sqrt(powb_sum);
den = sqrta * sqrtb;
simVal = num / den;
return simVal;
}
}
案例分析:
姓名/兴趣 | 吃苹果 | 逛商店 | 看电视剧 | 打羽毛球 | 吃桔子 |
小红 | 3.5 | 5 | 5 | 5 | |
xxx | 3.5 | 5 | 5 |
分别以小红和xxx的兴趣评分组成向量va和vb,运用多维余弦相似公式,设向量 A =
(A1,A2,...,An),B = (B1,B2,...,Bn) :
计算得相似度为:0.9954774432988771
具体实现代码:
import java.util.ArrayList;
public class SimilarityMain {
public static double similarity(ArrayList va, ArrayList vb) {
if (va.size() > vb.size()) {
int temp = va.size() - vb.size();
for (int i = 0; i < temp; i++) {
vb.add(0);
}
} else if (va.size() < vb.size()) {
int temp = vb.size() - va.size();
for (int i = 0; i < temp; i++) {
va.add(0);
}
}
int size = va.size();
double simVal = 0;
double num = 0;
double den = 1;
double powa_sum = 0;
double powb_sum = 0;
for (int i = 0; i < size; i++) {
double a = Double.parseDouble(va.get(i).toString());
double b = Double.parseDouble(vb.get(i).toString());
num = num + a * b;
powa_sum = powa_sum + (double) Math.pow(a, 2);
powb_sum = powb_sum + (double) Math.pow(b, 2);
}
double sqrta = (double) Math.sqrt(powa_sum);
double sqrtb = (double) Math.sqrt(powb_sum);
den = sqrta * sqrtb;
simVal = num / den;
return simVal;
}
public static void main(String[] args) {
String item[] = {"吃苹果", "逛商店", "看电视剧", "打羽毛球", "吃桔子"};
float a[] = {(float) 4.5, 5, 5, 5,0};
float b[] = {(float) 3.5, 5, 5, 5,0};
ArrayList vitem = new ArrayList();
ArrayList<Float> va = new ArrayList();
ArrayList<Float> vb = new ArrayList();
for (int i = 0; i < a.length; i++)
{
vitem.add(item[i]);
va.add(new Float(a[i]));
vb.add(new Float(b[i]));
}
System.out.print("兴趣");
System.out.println(vitem);
System.out.print("小红");
System.out.println(va);
System.out.print("xxx");
System.out.println(vb);
SimilarityMain sim = new SimilarityMain();
double simVal = sim.similarity(va, vb);
System.out.println("The sim value is:" + simVal);
}
}
源码gitee地址:
https://gitee.com/jockhome/