欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 财经 > 产业 > 中文通用embedding:BGE

中文通用embedding:BGE

2025/5/6 9:12:58 来源:https://blog.csdn.net/ylzf2008/article/details/147424616  浏览:    关键词:中文通用embedding:BGE

huggingface:https://huggingface.co/BAAI/bge-small-zh-v1.5
论文地址:https://arxiv.org/abs/2309.07597
模型结构就是BERT,绝对位置编码,使用时取最后一层CLS token作为sentence_embeddings:
MLM训练使用RetroMAE https://arxiv.org/abs/2205.12035 提出的方式

BertModel((embeddings): BertEmbeddings((word_embeddings): Embedding(21128, 512, padding_idx=0)(position_embeddings): Embedding(512, 512)(token_type_embeddings): Embedding(2, 512)(LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): BertEncoder((layer): ModuleList((0-3): 4 x BertLayer((attention): BertAttention((self): BertSdpaSelfAttention((query): Linear(in_features=512, out_features=512, bias=True)(key): Linear(in_features=512, out_features=512, bias=True)(value): Linear(in_features=512, out_features=512, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=512, out_features=512, bias=True)(LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=512, out_features=2048, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=2048, out_features=512, bias=True)(LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))))(pooler): BertPooler((dense): Linear(in_features=512, out_features=512, bias=True)(activation): Tanh())
)
from transformers import AutoTokenizer, AutoModel
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')# Compute token embeddings
with torch.no_grad():model_output = model(**encoded_input)# Perform pooling. In this case, cls pooling.sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

热搜词