如何解决sentence-transformers库中dot_score方法返回NaN值的问题？

问题现象与复现

在使用sentence-transformers的dot_score方法计算句子嵌入相似度时，开发者常会遇到返回NaN(Not a Number)的情况。典型报错表现为：

RuntimeWarning: invalid value encountered in matmul
similarity = torch.matmul(embeddings1, embeddings2.T)

根本原因分析

通过分析源码和用户报告，我们总结出产生NaN的三大主因：

数值溢出：当嵌入向量包含极大值(>1e38)或极小值(<-1e38)时，浮点运算超出IEEE 754标准范围
未归一化输入：原始嵌入向量未进行L2归一化，导致点积计算出现数值不稳定
空向量输入：全零向量或包含inf/NaN的损坏嵌入

5种解决方案

1. 强制归一化处理

from sentence_transformers.util import normalize_embeddings
embeddings = normalize_embeddings(embeddings)

2. 数值裁剪(Clamping)

embeddings = torch.clamp(embeddings, min=-1e4, max=1e4)

3. 异常值检测

def check_nan(emb):
    if torch.isnan(emb).any():
        raise ValueError("包含NaN值")

4. 安全计算模式

with torch.autograd.detect_anomaly():
    similarity = dot_score(emb1, emb2)

5. 替代相似度度量

from scipy.spatial.distance import cosine
similarity = 1 - cosine(emb1, emb2)

性能优化建议

批处理前执行torch.nn.functional.normalize
使用torch.use_deterministic_algorithms(True)调试
FP16精度下增加eps=1e-4保护参数

最佳实践示例

def safe_dot_score(emb1, emb2):
    emb1 = F.normalize(emb1, p=2, dim=1)
    emb2 = F.normalize(emb2, p=2, dim=1)
    return torch.matmul(emb1, emb2.T).clamp(-1.0, 1.0)