7.7 KiB

Raw Permalink Blame History

gensim word 2 vec–完整指南

原文：https://www.askpython.com/python-modules/gensim-word2vec

Word2Vec 是一种算法，它将单词转换为向量，从而将相似的单词分组到向量空间中。它广泛应用于许多领域，如文档检索、机器翻译系统、自动补全和预测等。在本教程中，我们将学习如何使用 Gensim 库训练 Word2Vec 模型，以及加载将单词转换为向量的预训练。

Word2Vec

Word2Vec 是谷歌设计的一种算法，它使用神经网络来创建单词嵌入，使得具有相似词义的嵌入往往指向相似的方向。例如，在向量空间中，像爱、关心等词的嵌入与像打架、战斗等词的嵌入相比，将指向相似的方向。这种模型还可以检测给定单词的同义词，并为部分句子建议一些附加单词。

Gensim Word2Vec

Gensim 是一个开源的 Python 库，可用于主题建模、文档索引以及消除与大型语料库的相似性。Gensim 的算法相对于语料库的大小是独立于内存的。它也被设计用来扩展其他向量空间算法。

Gensim 在Word2Vec类中提供了 Word2Vec 算法的实现以及其他一些自然语言处理的功能。让我们看看如何使用 Gensim 创建 Word2Vec 模型。

使用 Gensim 开发一个 Word2Vec 模型

Gensim Word2Vec 类采用的一些有用参数:

句子:它是模型被训练来创建单词嵌入的数据。在大型语料库的情况下，它可以是标记/单词列表的列表，或者来自网络/磁盘的数据流。在我们的例子中，我们将使用出现在中的棕色语料库。
size:它表示对于词汇表中的每个单词，您希望向量的维数是多长。其默认值为 100。
窗口:当前单词与其相邻单词之间的最大距离。如果您的相邻单词大于宽度，那么，一些相邻单词将不会被认为与当前单词相关。其默认值为 5。
min_count:它表示单词在词汇表中出现的最小频率值。其默认值为 5。
iter:它表示数据集上迭代/历元的数量。其默认值为 5。

在 Python 中使用 Word2Vec 的示例

import string
import nltk
from nltk.corpus import brown
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

nltk.download("brown")

# Preprocessing data to lowercase all words and remove single punctuation words
document = brown.sents()
data = []
for sent in document:
  new_sent = []
  for word in sent:
    new_word = word.lower()
    if new_word[0] not in string.punctuation:
      new_sent.append(new_word)
  if len(new_sent) > 0:
    data.append(new_sent)

# Creating Word2Vec
model = Word2Vec(
    sentences = data,
    size = 50,
    window = 10,
    iter = 20,
)

# Vector for word love
print("Vector for love:")
print(model.wv["love"])
print()

# Finding most similar words
print("3 words similar to car")
words = model.most_similar("car", topn=3)
for word in words:
  print(word)
print()

#Visualizing data
words = ["france", "germany", "india", "truck", "boat", "road", "teacher", "student"]

X = model.wv[words]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

pyplot.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
	pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

输出:

Some Output[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
Vector for love:
[ 2.576164   -0.2537464  -2.5507743   3.1892483  -1.8316503   2.6448352
 -0.06407754  0.5304831   0.04439827  0.45178193 -0.4788834  -1.2661372
  1.0238386   0.3144989  -2.3910248   2.303471   -2.861455   -1.988338
 -0.36665946 -0.32186085  0.17170368 -2.0292065  -0.9724318  -0.5792801
 -2.809848    2.4033384  -1.0886359   1.1814215  -0.9120702  -1.1175308
  1.1127514  -2.287549   -1.6190344   0.28058434 -3.0212548   1.9233572
  0.13773602  1.5269752  -1.8643662  -1.5568101  -0.33570558  1.4902842
  0.24851061 -1.6321756   0.02789219 -2.1180007  -1.5782264  -0.9047415
  1.7374605   2.1492126 ]

3 words similar to car
('boat', 0.7544293403625488)
('truck', 0.7183066606521606)
('block', 0.6936473250389099)

Word2Vec Visualization**

在上面的可视化中，我们可以看到单词 student 和 teacher 指向一个方向，像印度、德国和法国这样的国家指向另一个方向，而像 road、boat 和 truck 这样的单词指向另一个方向。这表明我们的 Word2Vec 模型已经学习了可以根据词义区分单词的嵌入。

使用 Gensimd 加载预训练模型

正如我们在下面看到的，Gensim 还带有几个已经预先训练好的模型。

import gensim
import gensim.downloader

for model_name in list(gensim.downloader.info()['models'].keys()):
  print(model_name)

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis

让我们加载word2vec-google-news-300模型并执行不同的任务，例如查找首都和国家之间的关系、获取相似的单词以及计算余弦相似度。

import gensim
import gensim.downloader

google_news_vectors = gensim.downloader.load('word2vec-google-news-300')

# Finding Capital of Britain given Capital of France: (Paris - France) + Britain = 
print("Finding Capital of Britain: (Paris - France) + Britain")
capital = google_news_vectors.most_similar(["Paris", "Britain"], ["France"], topn=1)
print(capital)
print()

# Finding Capital of India given Capital of Germany: (Berlin - Germany) + India = 
print("Finding Capital of India: (Berlin - Germany) + India")
capital = google_news_vectors.most_similar(["Berlin", "India"], ["Germany"], topn=1)
print(capital)
print()

# Finding words similar to BMW
print("5 similar words to BMW:")
words = google_news_vectors.most_similar("BMW", topn=5)
for word in words:
  print(word)
print()

# Finding words similar to Beautiful
print("3 similar words to beautiful:")
words = google_news_vectors.most_similar("beautiful", topn=3)
for word in words:
  print(word)
print()

# Finding cosine similarity between fight and battle
cosine = google_news_vectors.similarity("fight", "battle")
print("Cosine similarity between fight and battle:", cosine)
print()

# Finding cosine similarity between fight and love
cosine = google_news_vectors.similarity("fight", "love")
print("Cosine similarity between fight and love:", cosine)