CLIP官方github代码详解_README

系列文章目录

文章目录

系列文章目录
一、Usage
- 1、conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
- 2、示例代码
二、API
- 1、clip.available_models()
- 2、clip.load(name, device=..., jit=False)
- 3、clip.load()返回的模型支持以下方法:
三、更多例子
- 1、Zero-Shot预测
- 2、Linear-probe评价
四、参见

一、Usage

1、conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0

conda: 这是一个包管理工具，用于管理 Python 环境和安装软件包。
install: 这是一个命令，表示要安装软件包。
–yes: 这个选项表示在安装过程中自动确认所有提示，避免手动确认。
-c pytorch: 这个选项指定了要从名为 pytorch 的频道（repository）中安装软件包。Conda 允许从不同的频道获取软件包。
pytorch=1.7.1: 这是要安装的具体软件包及其版本。在这里，表示要安装 PyTorch 的 1.7.1 版本。
torchvision: 这是另一个要安装的软件包，通常与 PyTorch 一起使用，提供计算机视觉相关的工具和数据集。
cudatoolkit=11.0: 这是要安装的 CUDA 工具包的版本，CUDA 是用于 GPU 加速计算的工具。这里指定为 11.0 版本。

2、示例代码

import torch  # 导入 PyTorch 库，用于深度学习相关操作
import clip  # 导入 CLIP 库，用于处理图像和文本的模型
from PIL import Image  # 从 PIL 库导入 Image 模块，用于图像处理# 检查是否有可用的 GPU，如果有则使用 CUDA，否则使用 CPU
device = "cuda" if torch.cuda.is_available() else "cpu"# 加载 CLIP 模型和预处理函数，使用 ViT-B/32 结构，指定计算设备
model, preprocess = clip.load("ViT-B/32", device=device)# 打开指定的图像文件，进行预处理，并增加一个维度（batch size），然后移动到指定设备
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)# 对给定的文本进行标记化处理，转换为模型可以理解的格式，并移动到指定设备
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)# 在不计算梯度的情况下进行推理，节省内存和计算资源
with torch.no_grad():# 使用模型编码图像，得到图像特征image_features = model.encode_image(image)# 使用模型编码文本，得到文本特征text_features = model.encode_text(text)# 计算图像和文本之间的对比 logitslogits_per_image, logits_per_text = model(image, text)# 对 logits 进行 softmax 操作，得到每个文本标签的概率分布probs = logits_per_image.softmax(dim=-1).cpu().numpy()# 打印每个标签的概率，显示模型对每个文本的预测概率
print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

二、API

CLIP 的模型clip提供以下方法：

1、clip.available_models()

返回可用 CLIP 模型的名称。

2、clip.load(name, device=…, jit=False)

返回模型和模型所需的 TorchVision 变换，由clip.available_models()返回的模型名称指定。它会根据需要下载模型。name参数也可以是本地检查点的路径。可以选择性地指定运行模型的设备，默认是使用第一个CUDA设备(如果有的话)，否则使用CPU。当jit为False时，将加载模型的非jit版本。
CLIP的JIT版本指的是使用“Just-In-Time”编译技术优化的CLIP模型。CLIP（Contrastive Language-Image Pretraining）是一种结合文本和图像的模型，能够理解和生成与图像相关的文本描述。
JIT版本的CLIP通常意味着：
性能优化：通过JIT编译，模型在推理时能够更快地执行，减少延迟。
动态计算图：利用PyTorch的JIT功能，可以动态地优化计算图，提升效率。
更好的资源利用：能够更有效地利用GPU等硬件资源，提高整体性能。
总的来说，CLIP的JIT版本旨在提高模型的运行速度和效率，特别是在处理大规模数据时。

3、clip.load()返回的模型支持以下方法:

model.encode_image(image: Tensor)

给定一批图像，返回由CLIP模型的视觉部分编码的图像特征。

model.encode_text(text: Tensor)

给定一批文本标记，返回由CLIP模型的语言部分编码的文本特征。

model(image: Tensor, text: Tensor)

给定一批图像和一批文本标记，返回两个张量，包含对应于每个图像和文本输入的logit分数。其值是对应图像和文本特征之间的相似度的余弦值，乘以100。

三、更多例子

1、Zero-Shot预测

下面的代码使用CLIP执行Zero-Shot预测，如本文附录B所示。本例从CIFAR-100数据集中获取图像，并在数据集中的100个文本标签中预测最可能的标签。注意，本例使用encode_image()和encode_text()方法返回给定输入的编码特征。

import os
import clip
import torch
from torchvision.datasets import CIFAR100# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)# Calculate features
with torch.no_grad():image_features = model.encode_image(image_input)text_features = model.encode_text(text_inputs)# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

输出将如下所示(具体数字可能因计算设备的不同而略有不同):

Top predictions:snake: 65.31%turtle: 12.29%sweet_pepper: 3.83%lizard: 1.88%crocodile: 1.75%

2、Linear-probe评价

下面的示例使用scikit-learn对图像特征执行逻辑回归。

import os
import clip
import torchimport numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)def get_features(dataset):all_features = []all_labels = []with torch.no_grad():for images, labels in tqdm(DataLoader(dataset, batch_size=100)):features = model.encode_image(images.to(device))all_features.append(features)all_labels.append(labels)return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

注意，c值应该通过使用验证分割的超参数扫描来确定。

四、参见

OpenCLIP: 包括较大的和独立训练的CLIP模型，最高可达 ViT-G/14
Hugging Face implementation of CLIP: 更容易与HF生态系统集成