huggingface:https://huggingface.co/BAAI/bge-small-zh-v1.5
论文地址:https://arxiv.org/abs/2309.07597
模型结构就是BERT,绝对位置编码,使用时取最后一层CLS token作为sentence_embeddings:
MLM训练使用RetroMAE https://arxiv.org/abs/2205.12035 提出的方式
BertModel((embeddings): BertEmbeddings((word_embeddings): Embedding(21128, 512, padding_idx=0)(position_embeddings): Embedding(512, 512)(token_type_embeddings): Embedding(2, 512)(LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): BertEncoder((layer): ModuleList((0-3): 4 x BertLayer((attention): BertAttention((self): BertSdpaSelfAttention((query): Linear(in_features=512, out_features=512, bias=True)(key): Linear(in_features=512, out_features=512, bias=True)(value): Linear(in_features=512, out_features=512, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=512, out_features=512, bias=True)(LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=512, out_features=2048, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=2048, out_features=512, bias=True)(LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))))(pooler): BertPooler((dense): Linear(in_features=512, out_features=512, bias=True)(activation): Tanh())
)
from transformers import AutoTokenizer, AutoModel
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')# Compute token embeddings
with torch.no_grad():model_output = model(**encoded_input)# Perform pooling. In this case, cls pooling.sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)