defcosine_similarity(u, v): distance = 0.0 # Compute the dot product between u and v (≈1 line) dot = np.dot(u, v) # Compute the L2 norm of u (≈1 line) norm_u = np.sqrt(np.sum(np.power(u,2))) # Compute the L2 norm of v (≈1 line) norm_v = np.sqrt(np.sum(np.power(v,2))) # Compute the cosine similarity defined by formula (1) (≈1 line) cosine_similarity = np.divide(dot, norm_u * norm_v) return cosine_similarity
学习词嵌入
通常在使用词嵌入的时候,可以针对自己的数据集训练(也就是学习上面表格中的嵌入矩阵),如果数据集不是很充分也可以从网上下载训练好的词嵌入模型。在实践中通常是建立一个语言模型进行学习词嵌入(也就是说不是单独地去训练词嵌入),例如使用神经网络预测序列的下一个单词,I want a glass of orange __。 在实践中通常的做法是使用一个固定的历史窗口,例如超参数窗口大小为 4,那么就只用前面 4 个单词来预测下一个单词。嵌入矩阵也是一个参数,可以在训练过程中学习出来。如果训练集中的句子比较复杂还可以考虑上下文,即用前面四个词和后面四个词来预测中间的词。所以如果使用预训练的嵌入矩阵,那么在这个步骤就可以再训练一下,或者不训练(把它当成超参数)直接使用。
Skim-grams 其实就是学习从映射到的监督模型,只不过时间复杂度有点大。而负采样需要构造一个新的监督学习问题,即给定一对单词,例如 orange 和 juice,预测它们是否属于一对上下文 - 目标词。例如有一个句子:I want a glass of orange juice to go along with my cereal.
首先从句子中采样得到一个上下文词 orange 和一个目标词 juice,然后标记为 1;然后去字典中随机选 k (这里 k=4)个单词,标记为 0(即使 of 也出现在句子中):
Context
Word
Target?
orange
juice
1
orange
king
0
orange
book
0
orange
the
0
orange
of
0
给定输入的上下文词和可能的目标词,定义一个逻辑回归模型,判断输出:即每个正样本都有 K 个对应的负样本来训练一个逻辑回归模型,相对而言每次迭代的成本更低,详细内容可以参考原文献 [3]。在负采样的时候如果均匀采样,则学不到单词的分布,如果根据单词的频率采样又可能导致一些介词的频率很高,因此通常介于这两者之间:其中是语料库中某个单词的词频。
GloVe 词向量
GloVe 表示用于词表示的全局变量(Global vectors for word representation),假设为单词在上下文词中出现的次数(即两个词出现在同一个窗口中的次数)。如果上下文词和目标词的范围定义为左右各 10 各词的话,根据定义有,矩阵也叫做语料库的共现矩阵。GloVe 就是要最小化:其中和是两个词向量的偏置项, 权重函数是一个截断函数:原文献中的取值都是 0.75,而取值都是 100,损失函数的详细推导过程可以参考原文献 [4]。
单词类比任务
man is to woman as king is to queen,即给定单词 a (man)、b (woman) 和 c (king),需要找到一个单词 d 满足。这里衡量就用余弦相似度。
defcomplete_analogy(word_a, word_b, word_c, word_to_vec_map): # convert words to lower case word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower() # Get the word embeddings v_a, v_b and v_c (≈1-3 lines) e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c] words = word_to_vec_map.keys() max_cosine_sim = -100# Initialize max_cosine_sim to a large negative number best_word = None# Initialize best_word with None, it will help keep track of the word to output
# loop over the whole word vector set for w in words: # to avoid best_word being one of the input words, pass on them. if w in [word_a, word_b, word_c] : continue # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line) cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c) # If the cosine_sim is more than the max_cosine_sim seen so far, # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines) if cosine_sim > max_cosine_sim: max_cosine_sim = cosine_sim best_word = w return best_word
for w in name_list: print (w, cosine_similarity(word_to_vec_map[w], g))
1 2 3 4 5 6 7 8 9 10 11
List of names and their similarities with constructed vector: john [-0.23163356] marie [0.31559794] sophie [0.3186879] ronaldo [-0.31244797] priya [0.17632042] rahul [-0.16915471] danielle [0.24393299] reza [-0.0793043] katy [0.28310687] yasmin [0.23313858]
defneutralize(word, g, word_to_vec_map): # Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line) e = word_to_vec_map[word] # Compute e_biascomponent using the formula give above. (≈ 1 line) e_biascomponent = np.divide(np.dot(e, g), np.linalg.norm(g)**2) * g
# Neutralize e by substracting e_biascomponent from it # e_debiased should be equal to its orthogonal projection. (≈ 1 line) e_debiased = e - e_biascomponent return e_debiased
defequalize(pair, bias_axis, word_to_vec_map): # Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines) w1, w2 = pair e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2] # Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line) mu = (e_w1 + e_w2) / 2.0
# Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines) mu_B = np.divide(np.dot(mu, bias_axis), np.linalg.norm(bias_axis)**2) * bias_axis mu_orth = mu - mu_B
# Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines) e_w1B = np.divide(np.dot(e_w1, bias_axis), np.linalg.norm(bias_axis)**2) * bias_axis e_w2B = np.divide(np.dot(e_w2, bias_axis), np.linalg.norm(bias_axis)**2) * bias_axis # Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines) corrected_e_w1B = np.sqrt(np.abs(1 - np.sum(mu_orth**2))) * np.divide(e_w1B - mu_B, np.abs(e_w1 - mu_orth - mu_B)) corrected_e_w2B = np.sqrt(np.abs(1 - np.sum(mu_orth**2))) * np.divide(e_w2B - mu_B, np.abs(e_w2 - mu_orth - mu_B))
# Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines) e1 = corrected_e_w1B + mu_orth e2 = corrected_e_w2B + mu_orth return e1, e2
通过均衡算法,两个只有性别之分的词和性别的相似度应该大致成相反数的关系。
参考文献
吴恩达. DeepLearning.
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[J]. Computer Science, 2013.
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[J]. 2013, 26:3111-3119.
Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing. 2014:1532-1543.