GRAPHARG——学习

20250106
项目git地址：https://github.com/microsoft/graphrag.git
版本：1.2.0

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.encoding_model: cl100k_base # this needs to be matched to your model!
`hiuuh`
llm:api_key: `填你自己的` # set this in the generated .env filetype: openai_chat # or azure_openai_chatmodel: deepseek-chatmodel_supports_json: true # recommended if this is available for your model.# audience: "https://cognitiveservices.azure.com/.default"api_base: https://api.deepseek.com # https://<instance>.openai.azure.comapi_version: V3# organization: <organization_id>deployment_name: maweijunparallelization:stagger: 0.3# num_threads: 50async_mode: threaded # or asyncioembeddings:async_mode: threaded # or asynciovector_store: type: lancedbdb_uri: 'output/lancedb'container_name: defaultoverwrite: truellm:api_key: `填你自己的`type: openai_embedding # or azure_openai_embeddingmodel: embedding-2api_base: https://open.bigmodel.cn/api/paas/v4# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# audience: "https://cognitiveservices.azure.com/.default"# organization: <organization_id># deployment_name: <azure_model_deployment_name>### Input settings ###input:type: file # or blobfile_type: text # or csvbase_dir: "input"file_encoding: utf-8file_pattern: ".*\\.txt$"chunks:size: 1200overlap: 100group_by_columns: [id]### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be providedcache:type: file # one of [blob, cosmosdb, file]base_dir: "cache"reporting:type: file # or console, blobbase_dir: "output/${timestamp}/logs"storage:type: file # one of [blob, cosmosdb, file]base_dir: "output/${timestamp}/artifacts"## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:# type: file # or blob# base_dir: "update_output"### Workflow settings ###skip_workflows: []entity_extraction:prompt: "prompts/entity_extraction.txt"entity_types: [organization,person,geo,event]max_gleanings: 1summarize_descriptions:prompt: "prompts/summarize_descriptions.txt"max_length: 500claim_extraction:enabled: falseprompt: "prompts/claim_extraction.txt"description: "Any claims or facts that could be relevant to information discovery."max_gleanings: 1community_reports:prompt: "prompts/community_report.txt"max_length: 2000max_input_length: 8000cluster_graph:max_cluster_size: 10embed_graph:enabled: false # if true, will generate node2vec embeddings for nodesumap:enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)snapshots:graphml: trueembeddings: falsetransient: false### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#querylocal_search:prompt: "prompts/local_search_system_prompt.txt"global_search:map_prompt: "prompts/global_search_map_system_prompt.txt"reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"drift_search:prompt: "prompts/drift_search_system_prompt.txt"reduce_prompt: "prompts/drift_search_reduce_prompt.txt"basic_search:prompt: "prompts/basic_search_system_prompt.txt"

这段代码是一个配置文件，用于设置和调整 GraphARG 框架的行为。GraphARG 是一个基于图结构的推理框架，通常用于处理图数据和自然语言数据的联合任务。以下是对配置文件的详细解释：

1. LLM 设置 (LLM Settings)

这部分配置与 大语言模型 (LLM) 相关，用于定义如何调用和处理 LLM 的 API。

encoding_model: cl100k_base
指定编码模型，需要与使用的 LLM 模型匹配。cl100k_base 是 OpenAI 模型常用的编码器。
llm
配置 LLM 的 API 调用参数：
- api_key: LLM 的 API 密钥，通常存储在 .env 文件中。
- type: LLM 的类型，例如 openai_chat 或 azure_openai_chat。
- model: 使用的 LLM 模型名称，例如 deepseek-chat。
- model_supports_json: 是否支持 JSON 格式的输入输出。
- api_base: LLM API 的基础 URL。
- api_version: API 的版本号。
- deployment_name: 部署名称（适用于 Azure OpenAI）。
parallelization
配置并行化参数：
- stagger: 调用 API 时的延迟时间（秒），用于避免速率限制。
- num_threads: 并行线程数（未启用）。
async_mode
指定异步模式，可以是 threaded（多线程）或 asyncio（异步 I/O）。

2. 嵌入模型设置 (Embeddings Settings)

这部分配置与 嵌入模型 相关，用于生成文本或节点的向量表示。

async_mode
指定嵌入模型的异步模式。
vector_store
配置向量存储：
- type: 向量存储类型，例如 lancedb。
- db_uri: 数据库的 URI。
- container_name: 容器名称。
- overwrite: 是否覆盖现有数据。
llm
配置嵌入模型的 API 调用参数：
- api_key: 嵌入模型的 API 密钥。
- type: 嵌入模型类型，例如 openai_embedding 或 azure_openai_embedding。
- model: 嵌入模型名称，例如 embedding-2。
- api_base: 嵌入模型 API 的基础 URL。

3. 输入设置 (Input Settings)

这部分配置与输入数据的处理相关。

input
配置输入数据的来源和格式：
- type: 输入类型，例如 file（文件）或 blob（Blob 存储）。
- file_type: 文件类型，例如 text（文本）或 csv。
- base_dir: 输入文件的根目录。
- file_encoding: 文件编码格式，例如 utf-8。
- file_pattern: 文件名的正则表达式模式，用于匹配文件。
chunks
配置文本分块参数：
- size: 每个块的大小（字符数）。
- overlap: 块之间的重叠字符数。
- group_by_columns: 按列分组（适用于结构化数据）。

4. 存储设置 (Storage Settings)

这部分配置与缓存、报告和存储相关。

cache
配置缓存存储：
- type: 缓存类型，例如 file（文件）或 blob（Blob 存储）。
- base_dir: 缓存文件的根目录。
reporting
配置报告输出：
- type: 报告类型，例如 file（文件）或 console（控制台）。
- base_dir: 报告文件的根目录。
storage
配置存储：
- type: 存储类型，例如 file（文件）或 blob（Blob 存储）。
- base_dir: 存储文件的根目录。
update_index_storage
配置索引更新存储（通常不需要手动启用）。

5. 工作流设置 (Workflow Settings)

这部分配置与框架的工作流相关。

skip_workflows
指定要跳过的工作流。
entity_extraction
配置实体抽取任务：
- prompt: 实体抽取的提示模板文件路径。
- entity_types: 要抽取的实体类型，例如 organization、person 等。
- max_gleanings: 最大抽取数量。
summarize_descriptions
配置文本摘要任务：
- prompt: 摘要生成的提示模板文件路径。
- max_length: 摘要的最大长度。
claim_extraction
配置声明抽取任务（默认禁用）：
- prompt: 声明抽取的提示模板文件路径。
- description: 任务的描述。
- max_gleanings: 最大抽取数量。
community_reports
配置社区报告生成任务：
- prompt: 报告生成的提示模板文件路径。
- max_length: 报告的最大长度。
- max_input_length: 输入的最大长度。
cluster_graph
配置图聚类任务：
- max_cluster_size: 最大聚类大小。
embed_graph
配置图嵌入任务（默认禁用）。
umap
配置 UMAP 降维任务（默认禁用）。
snapshots
配置快照输出：
- graphml: 是否生成 GraphML 格式的快照。
- embeddings: 是否生成嵌入快照。
- transient: 是否生成临时快照。

6. 查询设置 (Query Settings)

这部分配置与查询任务相关。

local_search
配置本地搜索任务：
- prompt: 本地搜索的提示模板文件路径。
global_search
配置全局搜索任务：
- map_prompt: 全局搜索的 Map 阶段提示模板。
- reduce_prompt: 全局搜索的 Reduce 阶段提示模板。
- knowledge_prompt: 全局搜索的知识提示模板。
drift_search
配置漂移搜索任务：
- prompt: 漂移搜索的提示模板文件路径。
- reduce_prompt: 漂移搜索的 Reduce 阶段提示模板。
basic_search
配置基础搜索任务：
- prompt: 基础搜索的提示模板文件路径。