通过模拟对CLIP进行解释：如何通过梯度提升正样本的相似度？

具体CLIP可以参考笔者的另外的博客： CLIP 的核心训练代码与对比损失的解释：中英双语和对比损失（Contrastive Loss）与大模型：Contrastive Loss and Large Models (中英双语)

交叉熵损失在 CLIP 中的工作原理

相似性矩阵（Logits）：
- logits_per_image 是一个 ( $batch_size × batch_size \text{batch\_size} \times \text{batch\_size}$ ) 的矩阵。
- 例如，假设 batch size 为 4：
  $logits_per_image = [ 1.2 0.3 − 0.8 0.5 0.4 1.5 0.1 − 0.2 0.0 − 0.3 2.0 0.6 − 0.5 0.7 0.8 1.3 ] \text{logits\_per\_image} = \begin{bmatrix} 1.2 & 0.3 & -0.8 & 0.5 \\ 0.4 & 1.5 & 0.1 & -0.2 \\ 0.0 & -0.3 & 2.0 & 0.6 \\ -0.5 & 0.7 & 0.8 & 1.3 \end{bmatrix}$
- 对角线上的值是正样本的相似度，其他值是负样本的相似度。
交叉熵损失的计算：
- 对于每一行（例如第 ( $i$ ) 行），交叉熵损失希望第 ( $i$ ) 列的值（正样本）最大化，而其他列（负样本）最小化。
- 计算公式如下：
  $\text{CrossEntropyLoss} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\text{logits}[i, i])}{\sum_{j=1}^N \exp(\text{logits}[i, j])}$
- 该公式中的分母（归一化项）将正样本和负样本的相似度联合建模，形成一种竞争关系。
正负样本的距离调节：
- 拉近正样本距离：通过最大化正样本（对角线值）在 softmax 分布中的概率。
- 拉远负样本距离：通过对其他负样本的值施加抑制，使它们的 softmax 概率接近 0。

在这个例子中，我们通过 梯度下降法 来优化 logits_per_image 中的正样本分数（即对角线上的值，例如 2.0）。以下是详细的步骤，包括梯度计算、权重更新以及下一轮的损失变化。

假设前提

初始相似度矩阵（logits_per_image）：
$logits_per_image = [ 2.0 0.5 − 1.0 0.3 1.8 0.2 − 0.5 0.4 1.5 ] \text{logits\_per\_image} = \begin{bmatrix} 2.0 & 0.5 & -1.0 \\ 0.3 & 1.8 & 0.2 \\ -0.5 & 0.4 & 1.5 \end{bmatrix}$
学习率：( $\eta = 0.1$ )
批量大小：( $batch_size = 3 \text{batch\_size} = 3$ )
目标：通过一轮梯度下降，提升正样本的相似度，同时降低负样本的相似度。

Step 1: 计算梯度

Softmax 概率计算

以第一行 ( $\text{logits}[0, :]$ ) 为例：
$\text{logits}[0, :] = [2.0, 0.5, -1.0]$
对应的 softmax 概率为：
$\frac{\exp(\text{logits}[i, j])}{\sum_{k} \exp(\text{logits}[i, k])}$
计算分母：
$\text{denominator} = \exp(2.0) + \exp(0.5) + \exp(-1.0) \approx 7.389 + 1.649 + 0.368 = 9.406$
正样本 ( $P(\text{positive})$ )：
$P(\text{positive}) = \frac{\exp(2.0)}{\text{denominator}} = \frac{7.389}{9.406} \approx 0.785$
负样本 ( $P(\text{negative}, j=2)$ )：
$P(\text{negative}, j=2) = \frac{\exp(0.5)}{\text{denominator}} = \frac{1.649}{9.406} \approx 0.175$
负样本 ( $P(\text{negative}, j=3)$ )：
$P(\text{negative}, j=3) = \frac{\exp(-1.0)}{\text{denominator}} = \frac{0.368}{9.406} \approx 0.039$

交叉熵损失对 logits 的梯度

交叉熵损失公式：
$\text{Loss}_i = -\log(P(\text{positive}))$
对 ( $\text{logits}[0, :]$ ) 的梯度计算：
$\frac{\partial \text{Loss}_i}{\partial \text{logits}[0, j]} = P[i, j] - \delta_{i, j}$
其中 ( $\delta_{i, j}$ ) 是 Kronecker delta，表示只有正样本（即对角线）位置是 1，其余为 0。

对于第一行：

正样本 ( $j = 0$ )：
$\frac{\partial \text{Loss}_0}{\partial \text{logits}[0, 0]} = P[0, 0] - 1 = 0.785 - 1 = -0.215$
负样本 ( $j = 1$ )：
$\frac{\partial \text{Loss}_0}{\partial \text{logits}[0, 1]} = P[0, 1] = 0.175$
负样本 ( $j = 2$ )：
$\frac{\partial \text{Loss}_0}{\partial \text{logits}[0, 2]} = P[0, 2] = 0.039$

Step 2: 更新 logits

使用梯度下降法更新：
$\text{logits}[i, j] = \text{logits}[i, j] - \eta \cdot \frac{\partial \text{Loss}_i}{\partial \text{logits}[i, j]}$

对于第一行：

正样本 ( $j = 0$ )：
$\text{logits}[0, 0] = 2.0 - 0.1 \cdot (-0.215) = 2.0 + 0.0215 = 2.0215$
负样本 ( $j = 1$ )：
$\text{logits}[0, 1] = 0.5 - 0.1 \cdot 0.175 = 0.5 - 0.0175 = 0.4825$
负样本 ( $j = 2$ )：
$\text{logits}[0, 2] = -1.0 - 0.1 \cdot 0.039 = -1.0 - 0.0039 = -1.0039$

更新后的第一行 logits：
$\text{logits}[0, :] = [2.0215, 0.4825, -1.0039]$

Step 3: 下一轮的损失计算

使用更新后的 logits 重新计算 softmax 概率和损失：
$\text{logits}[0, :] = [2.0215, 0.4825, -1.0039]$

分母：
$\text{denominator} = \exp(2.0215) + \exp(0.4825) + \exp(-1.0039) \approx 7.562 + 1.620 + 0.367 = 9.549$
正样本概率：
$P(\text{positive}) = \frac{\exp(2.0215)}{\text{denominator}} = \frac{7.562}{9.549} \approx 0.792$
损失：
$\text{Loss} = -\log(P(\text{positive})) = -\log(0.792) \approx 0.233$