- 目錄
1. 文本相似度問題與應用
2. 文本相似度模型介紹
3. 實戰:基于Python實作編輯距離
4. 實戰:基于simhash實作相似文本判斷
5. 實戰:詞向量Word AVG
1. 文本相似度問題與應用
- 文本相似度問題

文本相似度問題包含:詞與詞、句與句、段落與段落、篇章與篇章之間的相似度問題;以及詞與句、句與段落、段落與篇章等之類的相似度問題,這裡的相似指的是語義的相似。這些問題的難度遞增。
- 文本相似度應用
搜尋系統:
1)利用query來搜尋最相關的文本/網頁。
2)利用網頁的标題、内容等資訊。
問答系統:
使用者提問的問題與語料庫中的問題進行相似度比對,選擇相似度最高的問題的答案作為回答。
聊天機器人 --- 檢索式模型:
利用文本相似度實作問答的聊天機器人例子:
單看每一輪對話,效果似乎還不錯。如果綜合多輪對話來看,有些機械,達不到期望的結果。
2. 文本相似度模型介紹
- Hamming distance
兩個相同長度的字元串,有多少個位置是不同的token。 如:d(cap,cat) = 1
距離越小,說明二者越相似;反之,說明差距很大。很顯然,這種方法,過于簡單,有一些詞的詞義接近,但完全是不同的兩個詞,如diffcult,hard等。當然這種方法可能在某種特定的情況下,會有一些作用。
文本相似度強調的是詞義、語義的相似,而不是形似。
- 編輯距離
給定兩段文本或兩個句子,最少需要經過多少步操作能夠從一個句子轉化為另一個句子。允許的操作有:
利用動态規劃解決編輯距離:
假設我們比較kitten和sitting這兩個單詞的相似度(此時操作的基本機關是字元,如果是句子/段落相似度問題的話,基本操作機關就是單詞)。
利用動态規劃的思想,計算兩個字元串的編輯距離,就相當于計算他們子串的編輯距離,再加上從子串到全串需要的最少操作數即可,不斷的進行遞推。
遞推公式如下:
相當于産生了下面的這個編輯距離矩陣:
1)kitten和sitting都是從第一個位置開始,
。
2) 如果i或j=0,即第0行或第0列,
3)當i,j都不為0時
4) 矩陣右下角的3,就是字元串kitten和sitting的編輯距離,即相似度;其斜上方的3,就是子串kitte和sittin的編輯距離,即相似度。
5)當矩陣的第一行和第一列都初始化後,每個子串的編輯距離,都基于其斜上,上和左邊的編輯距離通過3)中的公式來計算。比如,看上述矩陣的第四行第六列,它是子串kitt和si的編輯距離 為3,當計算這個值時,首先看斜上方,即kit和s的編輯距離為3,由于
,即t和i不同,是以通過3)中的公式計算可得3+1=4;再看上方,即kitt和s的編輯距離為3,kitt和si編輯距離可以通過一步添加操作得到,通過3)中的公式計算可得3+1=4;再看左方,即kit和si的編輯距離為2,kitt和si編輯距離可以通過一步添加操作得到,通過3)中的公式計算可得2+1=3。取三者的最小值,最終kitt和si的編輯距離 為3。
- Jaccard Similarity
給定兩個文本或兩句話,把兩句話中出現的單詞取交集和并集,交集和并集的大小之商即為Jaccard Similarity。
例如:
s1 = "Natural language processing is a promising research area" #8個詞
s2 = "More and more researchers are working on natural language processing nowadays" #11 個詞
交集:language、processing 2個詞。 Jaccard Similarity = 2/(8+11-2) = 0.11764
缺點:隻考慮單詞出現與否,忽略每個單詞的含義,忽略單詞順序,沒有考慮單詞出現的次數。
- SimHash
SimHash在搜尋引擎中使用比較廣泛,當你對關鍵詞進行搜尋後,會傳回相關的一系列網頁,但是網際網路上的網頁有很多都是高度重複的,是以一個高品質的傳回結果,應該不同的,我們不希望傳回結果中,前十個網頁都是一樣的。可以比較一下他們之間的Simhash,傳回不同的内容。
1)選擇一個hashsize,如32
2)初始化向量V = [0]*32
3) 把一段文本text變成features,如:
可以選擇去掉原始文本中的空格,也可以不去。上圖中生成的features,其實就是對原始文本,每連續三個字元取一個feature(3元組)。
4)把每個feature(三元組)hash(具體hash算法不做詳細展開)成32位,即一個大小為32的向量,向量中的每個值是0/1.
5)對于每個feature的hash結果的每個位置,如果該位置為1就把向量V的對應位置V[i]+1,如果該位置為0就把向量V的對應位置V[i]-1。
6) 最後檢視向量V的各個位置,如果V[i]>0則設定V[i]=1;否則設定V[i]=0。最終得到的這個向量V就是這段文本的simhash。
- 基于文本特征的相似度計算方法
1)将文本轉換為feature vectors。
可以采用bag of words得到feature vectors,向量次元為詞典大小,向量的每一維是詞典中該位置的詞在文本中的出現次數,未在文本中出現則為0。
也可以使用TF-IDF得到feature vectors,向量次元為詞典大小,向量的每一維是詞典中該位置的詞在文本計算的TF-IDF值,未在文本中出現則為0。
2)利用feature vectors計算文本間的相似度。
可以使用餘弦相似度,基于兩個文本的特征向量,計算他們的相似度:
- word2Vec
詞向量可以用于測量單詞之間的相似度,相同語義的單詞,其詞向量也應該是相似的。對詞向量做降維并可視化,可以看到如下圖所示的聚類效果,即相近語義的詞會聚在一起:
文本或句子相似度問題,可以把句子中每個單詞的詞向量簡單做一個平均,得到的向量作為整個句子的向量表示,再利用餘弦相似度計算句子的相似度; 也可以對句子中每個單詞的詞向量做權重平均,權重可以是每個詞的TF-IDF值。
3. 實戰:基于Python實作編輯距離
def editDistDP(s1, s2):
m = len(s1)
n = len(s2)
# 建立一張表格記錄所有子問題的答案
dp = [[0 for x in range(n+1)] for x in range(m+1)]
# 從上往下填充DP表格
for i in range(m+1):
for j in range(n+1):
# 如果第一個字元串為空,唯一的編輯方法就是添加第二個字元串
if i == 0:
dp[i][j] = j # Min. operations = j
# 如果第二個字元串為空,唯一的方法就是删除第一個字元串中的所有字母
elif j == 0:
dp[i][j] = i # Min. operations = i
# 如果兩個字元串結尾字母相同,我們就可以忽略最後的字母
elif s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1]
# 如果結尾字母不同,那我們就需要考慮三種情況,取最小的編輯距離
else:
dp[i][j] = 1 + min(dp[i][j-1], # 添加
dp[i-1][j], # 删除
dp[i-1][j-1]) # 替換
return dp[m][n]
s1 = "natural language processing is a promising research area"
s2 = "more researchers are working on natural language processing nowadays"
editDistDP(s1.split(), s2.split()) #輸入為兩個句子 以詞為機關 .split() 空格切分。漢語需要jieba分詞。
ww2 = """
World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 50 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
Japan, which aimed to dominate Asia and the Pacific, was at war with China by 1937,[5][b] though neither side had declared war on the other. World War II is generally said to have begun on 1 September 1939,[6] with the invasion of Poland by Germany and subsequent declarations on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Under the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. Following the onset of campaigns in North Africa and East Africa, and the fall of France in mid 1940, the war continued primarily between the European Axis powers and the British Empire. War in the Balkans, the aerial Battle of Britain, the Blitz, and the long Battle of the Atlantic followed. On 22 June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history. This Eastern Front trapped the Axis, most crucially the German Wehrmacht, into a war of attrition. In December 1941, Japan launched a surprise attack on the United States and European colonies in the Pacific. Following an immediate U.S. declaration of war against Japan, supported by one from Great Britain, the European Axis powers quickly declared war on the U.S. in solidarity with their Japanese ally. Rapid Japanese conquests over much of the Western Pacific ensued, perceived by many in Asia as liberation from Western dominance and resulting in the support of several armies from defeated territories.
The Axis advance in the Pacific halted in 1942 when Japan lost the critical Battle of Midway; later, Germany and Italy were defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. Key setbacks in 1943, which included a series of German defeats on the Eastern Front, the Allied invasions of Sicily and Italy, and Allied victories in the Pacific, cost the Axis its initiative and forced it into strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained its territorial losses and turned toward Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in Central China, South China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.
The war in Europe concluded with an invasion of Germany by the Western Allies and the Soviet Union, culminating in the capture of Berlin by Soviet troops, the suicide of Adolf Hitler and the German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, the Soviet entry into the war against Japan and its invasion of Manchuria, Japan announced its intention to surrender on 15 August 1945, cementing total victory in Asia for the Allies. Tribunals were set up by fiat by the Allies and war crimes trials were conducted in the wake of the war both against the Germans and the Japanese.
World War II changed the political alignment and social structure of the globe. The United Nations (UN) was established to foster international co-operation and prevent future conflicts; the victorious great powers—China, France, the Soviet Union, the United Kingdom, and the United States—became the permanent members of its Security Council.[7] The Soviet Union and United States emerged as rival superpowers, setting the stage for the nearly half-century long Cold War. In the wake of European devastation, the influence of its great powers waned, triggering the decolonisation of Africa and Asia. Most countries whose industries had been damaged moved towards economic recovery and expansion. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and create a common identity.[8]"""
ww1 = """World War I (often abbreviated as WWI or WW1), also known as the First World War or the Great War, was a global war originating in Europe that lasted from 28 July 1914 to 11 November 1918. Contemporaneously described as "the war to end all wars",[7] it led to the mobilisation of more than 70 million military personnel, including 60 million Europeans, making it one of the largest wars in history.[8][9] It is also one of the deadliest conflicts in history,[10] with an estimated nine million combatants and seven million civilian deaths as a direct result of the war, while resulting genocides and the 1918 influenza pandemic caused another 50 to 100 million deaths worldwide.[11]
On 28 June 1914, Gavrilo Princip, a Bosnian Serb Yugoslav nationalist, assassinated the Austro-Hungarian heir Archduke Franz Ferdinand in Sarajevo, leading to the July Crisis.[12][13] In response, on 23 July Austria-Hungary issued an ultimatum to Serbia. Serbia's reply failed to satisfy the Austrians, and the two moved to a war footing.
A network of interlocking alliances enlarged the crisis from a bilateral issue in the Balkans to one involving most of Europe. By July 1914, the great powers of Europe were divided into two coalitions: the Triple Entente—consisting of France, Russia and Britain—and the Triple Alliance of Germany, Austria-Hungary and Italy (the Triple Alliance was primarily defensive in nature, allowing Italy to stay out of the war in 1914).[14] Russia felt it necessary to back Serbia and, after Austria-Hungary shelled the Serbian capital of Belgrade on the 28th, partial mobilisation was approved.[15] General Russian mobilisation was announced on the evening of 30 July; on the 31st, Austria-Hungary and Germany did the same, while Germany demanded Russia demobilise within 12 hours.[16] When Russia failed to comply, Germany declared war on 1 August in support of Austria-Hungary, with Austria-Hungary following suit on 6th; France ordered full mobilisation in support of Russia on 2 August.[17]
German strategy for a war on two fronts against France and Russia was to rapidly concentrate the bulk of its army in the West to defeat France within four weeks, then shift forces to the East before Russia could fully mobilise; this was later known as the Schlieffen Plan.[18] On 2 August, Germany demanded free passage through Belgium, an essential element in achieving a quick victory over France.[19] When this was refused, German forces invaded Belgium on 3 August and declared war on France the same day; the Belgian government invoked the 1839 Treaty of London and in compliance with its obligations under this, Britain declared war on Germany on 4 August.[20][21] On 12 August, Britain and France also declared war on Austria-Hungary; on the 23rd, Japan sided with the Entente, seizing German possessions in China and the Pacific. In November 1914, the Ottoman Empire entered the war on the side of the Alliance, opening fronts in the Caucasus, Mesopotamia and the Sinai Peninsula. The war was fought in and drew upon each powers' colonial empires as well, spreading the conflict to Africa and across the globe. The Entente and its allies would eventually become known as the Allied Powers, while the grouping of Austria-Hungary, Germany and their allies would become known as the Central Powers.
The German advance into France was halted at the Battle of the Marne and by the end of 1914, the Western Front settled into a battle of attrition, marked by a long series of trench lines that changed little until 1917 (the Eastern Front, by contrast, was marked by much greater exchanges of territory). In 1915, Italy joined the Allied Powers and opened a front in the Alps. The Kingdom of Bulgaria joined the Central Powers in 1915 and the Kingdom of Greece joined the Allies in 1917, expanding the war in the Balkans. The United States initially remained neutral, although by doing nothing to prevent the Allies from procuring American supplies whilst the Allied blockade effectively prevented the Germans from doing the same the U.S. became an important supplier of war material to the Allies. Eventually, after the sinking of American merchant ships by German submarines, and the revelation that the Germans were trying to incite Mexico to make war on the United States, the U.S. declared war on Germany on 6 April 1917. Trained American forces would not begin arriving at the front in large numbers until mid-1918, but ultimately the American Expeditionary Force would reach some two million troops.[22]
Though Serbia was defeated in 1915, and Romania joined the Allied Powers in 1916 only to be defeated in 1917, none of the great powers were knocked out of the war until 1918. The 1917 February Revolution in Russia replaced the Tsarist autocracy with the Provisional Government, but continuing discontent at the cost of the war led to the October Revolution, the creation of the Soviet Socialist Republic, and the signing of the Treaty of Brest-Litovsk by the new government in March 1918, ending Russia's involvement in the war. This allowed the transfer of large numbers of German troops from the East to the Western Front, resulting in the German March 1918 Offensive. This offensive was initially successful, but the Allies rallied and drove the Germans back in their Hundred Days Offensive.[23] Bulgaria was the first Central Power to sign an armistice—the Armistice of Salonica on 29 September 1918. On 30 October, the Ottoman Empire capitulated, signing the Armistice of Mudros.[24] On 4 November, the Austro-Hungarian empire agreed to the Armistice of Villa Giusti. With its allies defeated, revolution at home, and the military no longer willing to fight, Kaiser Wilhelm abdicated on 9 November and Germany signed an armistice on 11 November 1918.
World War I was a significant turning point in the political, cultural, economic, and social climate of the world. The war and its immediate aftermath sparked numerous revolutions and uprisings. The Big Four (Britain, France, the United States, and Italy) imposed their terms on the defeated powers in a series of treaties agreed at the 1919 Paris Peace Conference, the most well known being the German peace treaty—the Treaty of Versailles.[25] Ultimately, as a result of the war the Austro-Hungarian, German, Ottoman, and Russian Empires ceased to exist, with numerous new states created from their remains. However, despite the conclusive Allied victory (and the creation of the League of Nations during the Peace Conference, intended to prevent future wars), a Second World War would follow just over twenty years later."""
netease = """NetEase, Inc. (simplified Chinese: 網易; traditional Chinese: 網易; pinyin: WǎngYì) is a Chinese Internet technology company providing online services centered on content, community, communications and commerce. The company was founded in 1997 by Lebunto. NetEase develops and operates online PC and mobile games, advertising services, email services and e-commerce platforms in China. It is one of the largest Internet and video game companies in the world.[7]
Some of NetEase's games include the Westward Journey series (Fantasy Westward Journey, Westward Journey Online II, Fantasy Westward Journey II, and New Westward Journey Online II), as well as other games, such as Tianxia III, Heroes of Tang Dynasty Zero and Ghost II. NetEase also partners with Blizzard Entertainment to operate local versions of Warcraft III, World of Warcraft, Hearthstone, StarCraft II, Diablo III: Reaper of Souls and Overwatch in China. They are also developing their very first self-developed VR multiplayer online game with an open world setting, which is called Nostos.[8]"""
print(editDistDP(ww1.split(), ww2.split())) #更相似 距離越小越相似
print(editDistDP(ww1.split(), netease.split()))
4. 實戰:基于simhash實作相似文本判斷
- Jaccard Similarity
def jaccard_sim(s1, s2):
a = set(s1.split()) #分詞 轉換為集合去重
print(len(a))
b = set(s2.split())
print(len(b))
c = a.intersection(b) #交集
print(len(c))
print(c)
return float(len(c)) / (len(a) + len(b) - len(c))
s1 = "Natural language processing is a promising research area"
s2 = "More and more researchers are working on natural language processing nowadays"
print(jaccard_sim(s1, s2))
- SimHash
原理見上。
# Created by 1e0n in 2013
from __future__ import division, unicode_literals
import re
import sys
import hashlib
import logging
import numbers
import collections
from itertools import groupby
if sys.version_info[0] >= 3:
basestring = str
unicode = str
long = int
else:
range = xrange
def _hashfunc(x): # 使用的hash函數
return int(hashlib.md5(x).hexdigest(), 16)
class Simhash(object):
def __init__(
self, value, f=64, reg=r'[\w\u4e00-\u9fcc]+', hashfunc=None, log=None
):
"""
`f` is the dimensions of fingerprints
`reg` is meaningful only when `value` is basestring and describes
what is considered to be a letter inside parsed string. Regexp
object can also be specified (some attempt to handle any letters
is to specify reg=re.compile(r'\w', re.UNICODE))
`hashfunc` accepts a utf-8 encoded string and returns a unsigned
integer in at least `f` bits.
"""
self.f = f
self.reg = reg
self.value = None
if hashfunc is None:
self.hashfunc = _hashfunc
else:
self.hashfunc = hashfunc
if log is None:
self.log = logging.getLogger("simhash")
else:
self.log = log
if isinstance(value, Simhash):
self.value = value.value
elif isinstance(value, basestring):
# print("build by text")
self.build_by_text(unicode(value))
elif isinstance(value, collections.Iterable):
self.build_by_features(value)
elif isinstance(value, numbers.Integral):
self.value = value
else:
raise Exception('Bad parameter with type {}'.format(type(value)))
def __eq__(self, other):
"""
Compare two simhashes by their value.
:param Simhash other: The Simhash object to compare to
"""
return self.value == other.value
def _slide(self, content, width=4):
return [content[i:i + width] for i in range(max(len(content) - width + 1, 1))]
def _tokenize(self, content):
content = content.lower()
content = ''.join(re.findall(self.reg, content))
ans = self._slide(content)
return ans
def build_by_text(self, content):
features = self._tokenize(content)
features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))}
return self.build_by_features(features)
def build_by_features(self, features):
"""
`features` might be a list of unweighted tokens (a weight of 1
will be assumed), a list of (token, weight) tuples or
a token -> weight dict.
"""
v = [0] * self.f # 初始化 [0,0,0,...]
masks = [1 << i for i in range(self.f)] # [1, 10, 100, 1000,10000,...]
if isinstance(features, dict):
features = features.items()
for f in features:
if isinstance(f, basestring):
h = self.hashfunc(f.encode('utf-8')) # hash成32位
w = 1
else:
assert isinstance(f, collections.Iterable)
h = self.hashfunc(f[0].encode('utf-8'))
w = f[1]
for i in range(self.f):
v[i] += w if h & masks[i] else -w #如果該位置是1就+w,否則-w
ans = 0
for i in range(self.f): # 計算結果
if v[i] > 0: # 如果大于0,就把那一位變成1 和之前不太一樣,<=0沒有處理
ans |= masks[i]
self.value = ans
def distance(self, another):
assert self.f == another.f
x = (self.value ^ another.value) & ((1 << self.f) - 1) # 異或 對應位置相同為0 不同為1
ans = 0
while x:
ans += 1 #計算2進制表示中1的個數
x &= x - 1
return ans #傳回距離(1的個數)
def get_features(s): #生成features
width = 3 #3元組
s = s.lower()
s = re.sub(r'[^\w]+', '', s) #将非單詞字元替換為空''
return [s[i:i+width] for i in range(max(len(s) - width + 1, 1))]
print(get_features("How are you? I am fine. Thanks. "))
print(Simhash(get_features("How are you? I am fine. Thanks. ")))
print(Simhash(get_features("How are you? I am fine. Thanks. ")).value)
print(hex(Simhash(get_features("How are you? I am fine. Thanks. ")).value))
print(Simhash('aa').distance(Simhash('bb')))
print(Simhash('aa').distance(Simhash('aa')))
print(Simhash(get_features("How are you? I am fine. Thanks. ")).distance(Simhash(get_features("How are you? I am fine. Thanks. "))))
print(Simhash(get_features("How are you? I am fine. Thank you. ")).distance(Simhash(get_features("How are you? I am fine. Thanks. "))))
- SimhashIndex
class SimhashIndex(object):
def __init__(self, objs, f=64, k=2, log=None):
"""
`objs` is a list of (obj_id, simhash)
obj_id is a string, simhash is an instance of Simhash
`f` is the same with the one for Simhash
`k` is the tolerance
"""
self.k = k
self.f = f
count = len(objs)
if log is None:
self.log = logging.getLogger("simhash")
else:
self.log = log
self.log.info('Initializing %s data.', count)
self.bucket = collections.defaultdict(set)
for i, q in enumerate(objs):
if i % 10000 == 0 or i == count - 1:
self.log.info('%s/%s', i + 1, count)
self.add(*q)
def get_near_dups(self, simhash):
"""
`simhash` is an instance of Simhash
return a list of obj_id, which is in type of str
"""
assert simhash.f == self.f
ans = set()
for key in self.get_keys(simhash):
dups = self.bucket[key]
self.log.debug('key:%s', key)
if len(dups) > 200:
self.log.warning('Big bucket found. key:%s, len:%s', key, len(dups))
for dup in dups:
sim2, obj_id = dup.split(',', 1)
sim2 = Simhash(long(sim2, 16), self.f)
d = simhash.distance(sim2)
if d <= self.k:
ans.add(obj_id)
return list(ans)
def add(self, obj_id, simhash):
"""
`obj_id` is a string
`simhash` is an instance of Simhash
"""
assert simhash.f == self.f
for key in self.get_keys(simhash):
v = '%x,%s' % (simhash.value, obj_id)
self.bucket[key].add(v)
def delete(self, obj_id, simhash):
"""
`obj_id` is a string
`simhash` is an instance of Simhash
"""
assert simhash.f == self.f
for key in self.get_keys(simhash):
v = '%x,%s' % (simhash.value, obj_id)
if v in self.bucket[key]:
self.bucket[key].remove(v)
@property
def offsets(self):
"""
You may optimize this method according to <http://www.wwwconference.org/www2007/papers/paper215.pdf>
"""
return [self.f // (self.k + 1) * i for i in range(self.k + 1)]
def get_keys(self, simhash):
for i, offset in enumerate(self.offsets):
if i == (len(self.offsets) - 1):
m = 2 ** (self.f - offset) - 1
else:
m = 2 ** (self.offsets[i + 1] - offset) - 1
c = simhash.value >> offset & m
yield '%x:%x' % (c, i)
def bucket_size(self):
return len(self.bucket)
data = {
1: u'How are you? I am fine. blar blar blar blar blar Thanks.',
2: u'How are you i am fine. blar blar blar blar blar Thanks.',
3: u'This is a simhash test',
}
objs = [(str(k), Simhash(get_features(v))) for k, v in data.items()]
index = SimhashIndex(objs, k=3)
print(index.bucket_size())
s1 = Simhash(get_features(u'This is a simhash test'))
print(index.get_near_dups(s1)) #與s1最接近的文本的索引
index.add('4', s1)
print(index.get_near_dups(s1))
5. 實戰:詞向量Word AVG
- bag of words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def bow_cosine(s1, s2):
vectorizer = CountVectorizer()
vectorizer.fit([s1, s2])
X = vectorizer.transform([s1, s2]) #得到s1,s2用bag of words方式表示的向量
print(X.toarray())
print(cosine_similarity(X[0], X[1]))
s1 = "Natural language processing is a promising research area "
s2 = "More and more researchers are working on natural language processing nowadays"
bow_cosine(s1, s2)
ww2 = """
World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 50 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
Japan, which aimed to dominate Asia and the Pacific, was at war with China by 1937,[5][b] though neither side had declared war on the other. World War II is generally said to have begun on 1 September 1939,[6] with the invasion of Poland by Germany and subsequent declarations on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Under the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. Following the onset of campaigns in North Africa and East Africa, and the fall of France in mid 1940, the war continued primarily between the European Axis powers and the British Empire. War in the Balkans, the aerial Battle of Britain, the Blitz, and the long Battle of the Atlantic followed. On 22 June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history. This Eastern Front trapped the Axis, most crucially the German Wehrmacht, into a war of attrition. In December 1941, Japan launched a surprise attack on the United States and European colonies in the Pacific. Following an immediate U.S. declaration of war against Japan, supported by one from Great Britain, the European Axis powers quickly declared war on the U.S. in solidarity with their Japanese ally. Rapid Japanese conquests over much of the Western Pacific ensued, perceived by many in Asia as liberation from Western dominance and resulting in the support of several armies from defeated territories.
The Axis advance in the Pacific halted in 1942 when Japan lost the critical Battle of Midway; later, Germany and Italy were defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. Key setbacks in 1943, which included a series of German defeats on the Eastern Front, the Allied invasions of Sicily and Italy, and Allied victories in the Pacific, cost the Axis its initiative and forced it into strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained its territorial losses and turned toward Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in Central China, South China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.
The war in Europe concluded with an invasion of Germany by the Western Allies and the Soviet Union, culminating in the capture of Berlin by Soviet troops, the suicide of Adolf Hitler and the German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, the Soviet entry into the war against Japan and its invasion of Manchuria, Japan announced its intention to surrender on 15 August 1945, cementing total victory in Asia for the Allies. Tribunals were set up by fiat by the Allies and war crimes trials were conducted in the wake of the war both against the Germans and the Japanese.
World War II changed the political alignment and social structure of the globe. The United Nations (UN) was established to foster international co-operation and prevent future conflicts; the victorious great powers—China, France, the Soviet Union, the United Kingdom, and the United States—became the permanent members of its Security Council.[7] The Soviet Union and United States emerged as rival superpowers, setting the stage for the nearly half-century long Cold War. In the wake of European devastation, the influence of its great powers waned, triggering the decolonisation of Africa and Asia. Most countries whose industries had been damaged moved towards economic recovery and expansion. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and create a common identity.[8]"""
ww1 = """World War I (often abbreviated as WWI or WW1), also known as the First World War or the Great War, was a global war originating in Europe that lasted from 28 July 1914 to 11 November 1918. Contemporaneously described as "the war to end all wars",[7] it led to the mobilisation of more than 70 million military personnel, including 60 million Europeans, making it one of the largest wars in history.[8][9] It is also one of the deadliest conflicts in history,[10] with an estimated nine million combatants and seven million civilian deaths as a direct result of the war, while resulting genocides and the 1918 influenza pandemic caused another 50 to 100 million deaths worldwide.[11]
On 28 June 1914, Gavrilo Princip, a Bosnian Serb Yugoslav nationalist, assassinated the Austro-Hungarian heir Archduke Franz Ferdinand in Sarajevo, leading to the July Crisis.[12][13] In response, on 23 July Austria-Hungary issued an ultimatum to Serbia. Serbia's reply failed to satisfy the Austrians, and the two moved to a war footing.
A network of interlocking alliances enlarged the crisis from a bilateral issue in the Balkans to one involving most of Europe. By July 1914, the great powers of Europe were divided into two coalitions: the Triple Entente—consisting of France, Russia and Britain—and the Triple Alliance of Germany, Austria-Hungary and Italy (the Triple Alliance was primarily defensive in nature, allowing Italy to stay out of the war in 1914).[14] Russia felt it necessary to back Serbia and, after Austria-Hungary shelled the Serbian capital of Belgrade on the 28th, partial mobilisation was approved.[15] General Russian mobilisation was announced on the evening of 30 July; on the 31st, Austria-Hungary and Germany did the same, while Germany demanded Russia demobilise within 12 hours.[16] When Russia failed to comply, Germany declared war on 1 August in support of Austria-Hungary, with Austria-Hungary following suit on 6th; France ordered full mobilisation in support of Russia on 2 August.[17]
German strategy for a war on two fronts against France and Russia was to rapidly concentrate the bulk of its army in the West to defeat France within four weeks, then shift forces to the East before Russia could fully mobilise; this was later known as the Schlieffen Plan.[18] On 2 August, Germany demanded free passage through Belgium, an essential element in achieving a quick victory over France.[19] When this was refused, German forces invaded Belgium on 3 August and declared war on France the same day; the Belgian government invoked the 1839 Treaty of London and in compliance with its obligations under this, Britain declared war on Germany on 4 August.[20][21] On 12 August, Britain and France also declared war on Austria-Hungary; on the 23rd, Japan sided with the Entente, seizing German possessions in China and the Pacific. In November 1914, the Ottoman Empire entered the war on the side of the Alliance, opening fronts in the Caucasus, Mesopotamia and the Sinai Peninsula. The war was fought in and drew upon each powers' colonial empires as well, spreading the conflict to Africa and across the globe. The Entente and its allies would eventually become known as the Allied Powers, while the grouping of Austria-Hungary, Germany and their allies would become known as the Central Powers.
The German advance into France was halted at the Battle of the Marne and by the end of 1914, the Western Front settled into a battle of attrition, marked by a long series of trench lines that changed little until 1917 (the Eastern Front, by contrast, was marked by much greater exchanges of territory). In 1915, Italy joined the Allied Powers and opened a front in the Alps. The Kingdom of Bulgaria joined the Central Powers in 1915 and the Kingdom of Greece joined the Allies in 1917, expanding the war in the Balkans. The United States initially remained neutral, although by doing nothing to prevent the Allies from procuring American supplies whilst the Allied blockade effectively prevented the Germans from doing the same the U.S. became an important supplier of war material to the Allies. Eventually, after the sinking of American merchant ships by German submarines, and the revelation that the Germans were trying to incite Mexico to make war on the United States, the U.S. declared war on Germany on 6 April 1917. Trained American forces would not begin arriving at the front in large numbers until mid-1918, but ultimately the American Expeditionary Force would reach some two million troops.[22]
Though Serbia was defeated in 1915, and Romania joined the Allied Powers in 1916 only to be defeated in 1917, none of the great powers were knocked out of the war until 1918. The 1917 February Revolution in Russia replaced the Tsarist autocracy with the Provisional Government, but continuing discontent at the cost of the war led to the October Revolution, the creation of the Soviet Socialist Republic, and the signing of the Treaty of Brest-Litovsk by the new government in March 1918, ending Russia's involvement in the war. This allowed the transfer of large numbers of German troops from the East to the Western Front, resulting in the German March 1918 Offensive. This offensive was initially successful, but the Allies rallied and drove the Germans back in their Hundred Days Offensive.[23] Bulgaria was the first Central Power to sign an armistice—the Armistice of Salonica on 29 September 1918. On 30 October, the Ottoman Empire capitulated, signing the Armistice of Mudros.[24] On 4 November, the Austro-Hungarian empire agreed to the Armistice of Villa Giusti. With its allies defeated, revolution at home, and the military no longer willing to fight, Kaiser Wilhelm abdicated on 9 November and Germany signed an armistice on 11 November 1918.
World War I was a significant turning point in the political, cultural, economic, and social climate of the world. The war and its immediate aftermath sparked numerous revolutions and uprisings. The Big Four (Britain, France, the United States, and Italy) imposed their terms on the defeated powers in a series of treaties agreed at the 1919 Paris Peace Conference, the most well known being the German peace treaty—the Treaty of Versailles.[25] Ultimately, as a result of the war the Austro-Hungarian, German, Ottoman, and Russian Empires ceased to exist, with numerous new states created from their remains. However, despite the conclusive Allied victory (and the creation of the League of Nations during the Peace Conference, intended to prevent future wars), a Second World War would follow just over twenty years later."""
netease = """NetEase, Inc. (simplified Chinese: 網易; traditional Chinese: 網易; pinyin: WǎngYì) is a Chinese Internet technology company providing online services centered on content, community, communications and commerce. The company was founded in 1997 by Lebunto. NetEase develops and operates online PC and mobile games, advertising services, email services and e-commerce platforms in China. It is one of the largest Internet and video game companies in the world.[7]
Some of NetEase's games include the Westward Journey series (Fantasy Westward Journey, Westward Journey Online II, Fantasy Westward Journey II, and New Westward Journey Online II), as well as other games, such as Tianxia III, Heroes of Tang Dynasty Zero and Ghost II. NetEase also partners with Blizzard Entertainment to operate local versions of Warcraft III, World of Warcraft, Hearthstone, StarCraft II, Diablo III: Reaper of Souls and Overwatch in China. They are also developing their very first self-developed VR multiplayer online game with an open world setting, which is called Nostos.[8]"""
bow_cosine(ww1, ww2)
bow_cosine(ww1, netease)
- TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_cosine(s1, s2):
vectorizer = TfidfVectorizer()
vectorizer.fit([s1, s2])
X = vectorizer.transform([s1, s2])#得到s1,s2用TF-IDF方式表示的向量
print(X.toarray())
print(cosine_similarity(X[0], X[1]))
tfidf_cosine(s1, s2)
tfidf_cosine(ww1, ww2)
tfidf_cosine(ww1, netease)
- Word2Vec
import gensim
import gensim.downloader as api
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
model = api.load("glove-twitter-25") #加載基于twitter資料預訓練的詞向量 大小為25
#一般使用大小為300的詞向量,這裡用25友善示範
print(model.get_vector("dog"))
print(model.get_vector("dog").shape)
print(model.most_similar("cat"))
def wordavg(model,words): #對句子中的每個詞的詞向量簡單做平均 作為句子的向量表示
return np.mean([model.get_vector(word) for word in words],axis=0)
s1 = "Natural language processing is a promising research area "
s2 = "More and more researchers are working on natural language processing nowadays"
s1 = wordavg(model,s1.lower().split()) #中文需要分詞
s2 = wordavg(model,s2.lower().split())
print(cosine_similarity(s1.reshape(1,-1),s2.reshape(1,-1))) #用2維數組表示行向量
由于直接平均的方式比較簡單,是以用這種方式表示句向量效果并不是很好。當然,不能單單隻看絕對的餘弦相似度數值,一種更好的做法是,準備多對資料,人工為其評判一個相似度,并排序;然後用上述方式計算每對資料的相似度,并排序。比較兩個排序序列(可以通過斯皮爾曼系數),來進行最終效果的評定。