词向量表示

前言

实验和上一篇博客都使用了 Embedding,这篇博客正好可以加深对词向量和嵌入矩阵的理解。

余弦相似度

为了衡量两个单词的相似性,我们需要一种衡量两个单词的嵌入向量的方法。给定两个向量 $u$ 和 $v$,其余弦相似度定义为:
$$
\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta)
$$
其中分子是两个向量的点乘,分母是两个向量的二范数的乘积,$\theta$ 是两个向量形成的角度。两个向量越相似,余弦相似度就越接近于 1,不相似则取值会很小。图像如下图所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def cosine_similarity(u, v):
distance = 0.0

# Compute the dot product between u and v (≈1 line)
dot = np.dot(u, v)
# Compute the L2 norm of u (≈1 line)
norm_u = np.sqrt(np.sum(np.power(u,2)))

# Compute the L2 norm of v (≈1 line)
norm_v = np.sqrt(np.sum(np.power(v,2)))
# Compute the cosine similarity defined by formula (1) (≈1 line)
cosine_similarity = np.divide(dot, norm_u * norm_v)

return cosine_similarity

单词类比任务

man is to woman as king is to queen,即给定单词 a(man)、b(woman) 和 c(king),需要找到一个单词 d 满足 $e_b - e_a \approx e_d - e_c$。这里衡量 $e_b - e_a$ 就用余弦相似度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
# convert words to lower case
word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()

# Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]

words = word_to_vec_map.keys()
max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number
best_word = None # Initialize best_word with None, it will help keep track of the word to output

# loop over the whole word vector set
for w in words:
# to avoid best_word being one of the input words, pass on them.
if w in [word_a, word_b, word_c] :
continue

# Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)
cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)

# If the cosine_sim is more than the max_cosine_sim seen so far,
# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
if cosine_sim > max_cosine_sim:
max_cosine_sim = cosine_sim
best_word = w

return best_word

去偏词向量

首先计算一个向量 $g = e_{woman}-e_{man}$,这个向量可以粗略地看成是性别 gender。或者可以同时计算:

  • $g_1 = e_{mother}-e_{father}$

  • $g_2 = e_{girl}-e_{boy}$

最后取这三个向量的均值作为性别则会更加精确。可以通过以下代码验证我们的想法:

1
2
3
4
name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']

for w in name_list:
print (w, cosine_similarity(word_to_vec_map[w], g))
1
2
3
4
5
6
7
8
9
10
11
List of names and their similarities with constructed vector:
john [-0.23163356]
marie [0.31559794]
sophie [0.3186879]
ronaldo [-0.31244797]
priya [0.17632042]
rahul [-0.16915471]
danielle [0.24393299]
reza [-0.0793043]
katy [0.28310687]
yasmin [0.23313858]

可以看出,一些比较女性化的名字和 $g$ 的相似性大于0,比较男性化的名字和 $g$ 的相似性则小于 0。

中和无性别单词的偏差

下面是一些词和性别的相似性,虽然大部分的工程师是男性,但是这有点性别歧视了,而且这些词本身是不应该有性别之分的。

1
2
3
4
receptionist [0.33077942]
technology [-0.13193732]
teacher [0.17920923]
engineer [-0.0803928]

假如词嵌入是 50 维,则可以分为两部分:偏置方向 $g$ 和其余的 49 维 $g_{\perp}$。其余的 49 维与性别无关,所以是正交的。下面的任务就是把向量 $e_{receptionist}$ 的 $g$ 方向置 0,得到 $e_{receptionist}^{debiased}$。如下图所示:


$$
e^{bias\_component} = \frac{e \cdot g}{||g||_2^2} * g
$$

$$
e^{debiased} = e - e^{bias\_component}
$$

$e^{bias\_component}$ 也就是 $e$ 在方向 $g$ 上的投影。

1
2
3
4
5
6
7
8
9
10
11
12
13
def neutralize(word, g, word_to_vec_map):
# Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)
e = word_to_vec_map[word]

# Compute e_biascomponent using the formula give above. (≈ 1 line)
e_biascomponent = np.divide(np.dot(e, g), np.linalg.norm(g)**2) * g


# Neutralize e by substracting e_biascomponent from it
# e_debiased should be equal to its orthogonal projection. (≈ 1 line)
e_debiased = e - e_biascomponent

return e_debiased

性别专用词均衡算法

均衡算法可以应用于两个只有性别之分的词。例如男演员 (actor) 和女演员 (actress),可能女演员更接近保姆 (babysit),通过对 babysit 的中和可以减少保姆和性别的关联性,但是还是不能保证这两种演员和其他词的关联性是否相同。均衡算法就可以处理这个问题,均衡算法的原理如下图所示:

原理就是保证这两个词到 49 维的 $g_\perp$ 的距离相等,公式参考 Bolukbasi et al., 2016:
$$
\mu = \frac{e_{w1} + e_{w2}}{2}
$$

$$
\mu_{B} = \frac {\mu \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}
$$

$$
\mu_{\perp} = \mu - \mu_{B}
$$

$$
e_{w1B} = \frac {e_{w1} \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}
$$

$$
e_{w2B} = \frac {e_{w2} \cdot \text{bias_axis}}{||\text{bias_axis}||_2^2} *\text{bias_axis}
$$

$$
e_{w1B}^{corrected} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w1B}} - \mu_B} {|(e_{w1} - \mu_{\perp}) - \mu_B)|}
$$

$$
e_{w2B}^{corrected} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w2B}} - \mu_B} {|(e_{w2} - \mu_{\perp}) - \mu_B)|}
$$

$$
e_1 = e_{w1B}^{corrected} + \mu_{\perp}
$$

$$
e_2 = e_{w2B}^{corrected} + \mu_{\perp}
$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def equalize(pair, bias_axis, word_to_vec_map):
# Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)
w1, w2 = pair
e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]

# Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)
mu = (e_w1 + e_w2) / 2.0

# Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)
mu_B = np.divide(np.dot(mu, bias_axis), np.linalg.norm(bias_axis)**2) * bias_axis
mu_orth = mu - mu_B

# Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)
e_w1B = np.divide(np.dot(e_w1, bias_axis), np.linalg.norm(bias_axis)**2) * bias_axis
e_w2B = np.divide(np.dot(e_w2, bias_axis), np.linalg.norm(bias_axis)**2) * bias_axis

# Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)
corrected_e_w1B = np.sqrt(np.abs(1 - np.sum(mu_orth**2))) * np.divide(e_w1B - mu_B, np.abs(e_w1 - mu_orth - mu_B))
corrected_e_w2B = np.sqrt(np.abs(1 - np.sum(mu_orth**2))) * np.divide(e_w2B - mu_B, np.abs(e_w2 - mu_orth - mu_B))

# Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)
e1 = corrected_e_w1B + mu_orth
e2 = corrected_e_w2B + mu_orth

return e1, e2

通过均衡算法,两个只有性别之分的词和性别的相似度应该大致成相反数的关系。

参考文献

  1. 吴恩达. DeepLearning.
疏影横斜水清浅,暗香浮动月黄昏