Chroma：操作指南

accttodo 12/31/2025 大模型向量数据库Chroma

参考

Chroma数据库的中文介绍文档 (opens new window)

# Chroma：操作指南

# 一、客户端创建

# 内存模式（非持久化）
client = chromadb.Client()

# 持久化模式（本地持久化）
client = chromadb.PersistentClient(path="/path/to/storage")

# 客户端/服务器模式（远程访问）
client = chromadb.HttpClient(host='localhost', port=8000)

1
2
3
4
5
6
7
8

持久化模式会在路径下生成3类文件：
- chroma.sqlite3（元数据索引）
- chroma-embeddings（向量数据）
- chroma-fulltext（全文索引）
本地持久化，自定义存储路径，相对路径 vs 绝对路径示例：

类型 代码示例 实际路径（假设工作目录为 C:\project）

相对路径 path="chroma/data" C:\project\chroma\data

绝对路径 path="C:/chroma/data" C:\chroma\data

类型	代码示例	实际路径（假设工作目录为 `C:\project`）
相对路径	`path="chroma/data"`	`C:\project\chroma\data`
绝对路径	`path="C:/chroma/data"`	`C:\chroma\data`

动态获取自定义存储路径

# 导入Python标准库中的os模块，用于与操作系统交互，此模块提供跨平台的文件操作、路径管理、环境变量访问等功能
import os

# 获取当前工作目录（脚本运行的根路径）
base_dir = os.getcwd()  

# 拼接存储路径（跨平台安全方式）
#   os.path.join() 自动适配操作系统的路径分隔符（Windows为\，Linux为/）
#   最终路径格式示例：Windows → "C:\project\chroma\data"
#                  Linux → "/home/user/project/chroma/data"
save_path = os.path.join(base_dir, "chroma", "myCollection")
print(f"持久化存储路径: {save_path}")

# 路径是否存在验证
if not os.path.exists(save_path):
    os.makedirs(save_path)  # 显式创建目录（PersistentClient虽可自建，但显式操作更可控）

# 创建持久化客户端
#   path参数说明：
#     - 使用PersistentClient确保数据持久化到硬盘
#     - 指定路径下将自动生成数据库文件（chroma.sqlite3等）
#     - 若目录不存在会自动创建，需确保程序有写权限
client = chromadb.PersistentClient(path=save_path)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

注意：

ChromaDB ‌未提供内置的删除方法‌，删除本质是清理本地文件夹。
删除后数据‌不可恢复‌，需谨慎操作。
若需保留部分数据，应通过 API 删除特定集合而非整个目录。

# 二、集合操作

Collection（集合） 是Chroma的核心数据组织单元，类似于传统数据库中的表。

类比数据表：Collection是组织和存储数据的基本逻辑单元，类似于传统关系型数据库中的“表”。
数据分组机制：Collection用于分组存储嵌入向量（Embeddings）、原始文档（Documents） 及其关联的元数据（Metadata）。

一个 Collection 包含以下内容：

组件	描述
Embeddings	高维向量数据，由嵌入模型（如 OpenAI、SentenceTransformers）生成。
Documents	原始文本或数据块（如段落、句子），在嵌入前存储。
Metadata	描述文档的键值对（如 `{"author": "Alice", "category": "science"}`），支持过滤和检索。
IDs	文档的唯一标识符（如 `id1`, `id2`），用于更新或删除数据。

对应集合的核心功能：

嵌入向量存储：存储通过嵌入模型（如all-MiniLM-L6-v2、OpenAI等）生成的高维向量，用于表示文本、图像等非结构化数据的语义。
原始文档关联：保存向量化前的原始文本片段，支持后续检索时返回可读内容。
元数据管理：支持为每个文档附加结构化元数据（键值对形式），例如 {"author": "李白", "year": 2023}，用于精细化过滤。
唯一标识符（IDs）：为每个文档分配唯一ID（如 id1, id2），支持精准更新或删除。

# 1. 创建Collection

创建Collection的方式

# 通过客户端创建新Collection（若已存在同名则报错）
collection = client.create_collection(
    name="my_collection",
    metadata={"hnsw:space": "cosine"},  # 可选：配置索引参数
    embedding_function=embedding_fn     # 可选：自定义嵌入模型
)

# 通过客户端获取或创建Collection（推荐，避免重复创建）
collection = client.get_or_create_collection(
    name="my_collection",
    metadata={"description": "用户知识库"},       # 自定义元数据
    embedding_function=sentence_transformer_fn  # 指定嵌入函数
)

1
2
3
4
5
6
7
8
9
10
11
12
13

关键参数说明

参数	作用	示例值
`name` (必填)	Collection唯一标识	`"ai_docs"`
`metadata` (可选)	配置Collection属性： • 距离计算函数（`hnsw:space`） • 自定义描述信息	`{"hnsw:space": "l2"}` `{"category": "research"}`
`embedding_function` (可选)	自定义文本向量化模型（默认`all-MiniLM-L6-v2`）	`OpenAIEmbeddings()`

指定距离计算算法

通过metadata配置相似度计算方式：
```
collection = client.create_collection(
    name="my_collection",
    metadata={"hnsw:space": "ip"}  # 可选值：cosine（默认）, l2, ip
)
```
1
2
3
4
- cosine：余弦相似度（适合文本）
- l2：欧氏距离（适合图像）
- ip：内积相似度

自定义嵌入模型

# 使用Hugging Face模型
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddings
hf_ef = SentenceTransformerEmbeddings(model_name="paraphrase-multilingual-MiniLM-L12-v2")

# 使用OpenAI模型
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
openai_ef = OpenAIEmbeddingFunction(api_key="YOUR_KEY", model_name="text-embedding-3-large")

collection = client.get_or_create_collection(
    name="multilingual",
    embedding_function=hf_ef  # 应用自定义模型
)

1
2
3
4
5
6
7
8
9
10
11
12

# 2. 获取Collection

获取Collection的方式

get_collection
- 作用：获取已存在的Collection（严格匹配名称）
- 行为：若不存在则抛出ValueError异常
```
collection = client.get_collection(name="my_collection")
```
  1
get_or_create_collection (推荐)
- 作用：获取或创建Collection（安全方法）
- 行为：存在则返回，不存在则创建
```
collection = client.get_or_create_collection(
    name="my_collection",
    embedding_function=SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2"),
    metadata={"hnsw:space": "cosine"}  # 设置余弦相似度
)
```
  1
  2
  3
  4
  5
  参数：
  - name：Collection名称（必需）
  - embedding_function：嵌入模型（可选，默认all-MiniLM-L6-v2）
  - metadata：配置参数（如距离计算方式）

list_collections

作用：列出所有Collection

collections = client.list_collections()
for col in collections:
    print(col.name, col.metadata)

1
2
3

关键特性说明

元数据配置，可在获取时配置Collection行为：

metadata = {
    "hnsw:space": "cosine",  # 距离算法（l2/ip/cosine）
    "hnsw:M": 32,            # HNSW图连接数（默认16）
    "optimizers:memory_threshold": 0.8  # 内存优化阈值
}

1
2
3
4
5

嵌入模型绑定，Collection一旦创建，其嵌入模型即固定：
- 创建时可指定模型（如text-embedding-3-large）
- 已存在的Collection无法修改模型

最佳实践说明

生产环境必用 get_or_create_collection，避免因Collection不存在导致服务中断。

统一嵌入模型管理

# 集中管理模型配置
EMBEDDING_MODELS = {
    "text": "text-embedding-3-large",
    "image": "clip-ViT-B-32"
}

collection = client.get_or_create_collection(
    name="multimodal_data",
    embedding_function=CustomEmbedding(EMBEDDING_MODELS["text"])
)

1
2
3
4
5
6
7
8
9
10

Collection版本控制

# 创建时启用版本控制
client.create_collection(
    name="versioned_data",
    versioning_policy={"max_versions": 5, "retention_days": 30}
)

1
2
3
4
5

# 3. 删除Collection

删除Collection方式

# 通过客户端直接删除整个Collection（不可逆操作）
client.delete_collection(name="my_collection")

1
2

最佳实践说明

会永久删除 Collection及其包含的所有数据（向量、元数据、文档）
删除后数据无法恢复，需谨慎操作
适用于需要彻底清理数据的场景

# 4. Collection新增数据

Collection新增数据方法

add() ：基础添加操作，若 ID 已存在则报错。

# 添加数据 - 自动生成向量
collection.add(
    ids=["poem_1", "poem_2"],
    documents=["海内存知己，天涯若比邻", "举头望明月，低头思故乡"],
    metadatas=[
        {"author": "王勃", "dynasty": "唐"},
        {"author": "李白", "dynasty": "唐"}
    ]
)

1
2
3
4
5
6
7
8
9

upsert() ：高级操作，ID 存在时更新数据，不存在时新增（推荐生产环境使用）。

# 更新/新增数据 (upsert)
collection.upsert(
    ids=["poem_3"],
    documents=["春眠不觉晓，处处闻啼鸟"],
    metadatas=[{"author": "孟浩然", "dynasty": "唐"}]
)

1
2
3
4
5
6

关键参数说明

参数	必填	说明
`ids`	✓	数据唯一标识符（字符串列表，长度与数据量一致）
`embeddings`	✗	自定义向量（浮点数列表的列表）。若不提供，Chroma 自动调用嵌入模型生成
`metadatas`	✗	键值对元数据列表（如 `{"author": "李白", "category":"诗歌"}`），用于后续过滤
`documents`	✗	原始文本列表（长度需与 `ids` 一致）

关键特性说明

自动向量化
- 若未提供 embeddings，Chroma 使用 Collection 关联的嵌入模型（默认 all-MiniLM-L6-v2）自动生成向量。
- 支持自定义模型（如 OpenAI、Hugging Face），通过 embedding_function 参数指定（创建 Collection 时设置）。
元数据过滤支持：添加的 metadatas 可用于后续查询过滤（如 where={"dynasty": "唐"}）。

批量化操作：使用 batch 接口提升大批量写入效率。

with collection.batch(batch_size=100) as batch:
    for doc in large_dataset:
        batch.add(ids=..., documents=...)

1
2
3

混合数据写入：允许同时写入向量、文本和元数据，形成关联数据单元。

最佳实践说明

ID 唯一性:add() 时重复 ID 会引发错误；upsert() 则覆盖旧数据。
性能优化
- 预生成向量可减少写入延迟（避免实时计算）。
- 批量写入减少 API 调用开销。
数据类型匹配：metadatas 支持类型：str、int、float、bool。
持久化存储：使用 PersistentClient 确保数据落盘，避免内存模式重启丢失。

# 5. Collection更新数据

Collection更新数据的方法

update() ：用于更新指定ID的文档数据（向量、元数据、原始文本）。

参数说明：

ids (必填): 待更新文档的唯一标识（字符串或列表）。
embeddings: 新向量（可选，不传则重新计算）。
metadatas: 新元数据（字典或列表）。
documents: 新原始文本（字符串或列表）。

示例代码：

# 更新单条数据
collection.update(
    ids="doc3",
    documents="球状闪电：量子力学与军事科技的未解之谜（修订版）",
    metadatas={"author": "刘慈欣", "year": 2004, "category": "科幻/悬疑"},
    # embeddings=[[0.1, 0.2, ...]]  # 可自定义新向量
)

# 批量更新多条数据
collection.update(
    ids=["id1", "id2"],
    documents=["更新后的文档1", "更新后的文档2"],
    metadatas=[{"source": "Nature更新版"}, {"source": "Science更新版"}]
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14

upsert() ：合并插入与更新：若ID存在则更新，不存在则新增。

适用场景：

不确定数据是否存在时的高效写入。
避免重复ID报错。

示例代码：

# 存在则更新，不存在则插入
collection.upsert(
    ids=["id1", "id4"],
    documents=["文档1新内容", "新增文档4"],
    metadatas=[{"tag": "AI"}, {"tag": "Cloud"}]
)

1
2
3
4
5
6

最佳实践说明

ID必须存在：
- 使用 update() 时，若ID不存在会报错。需提前通过 get(ids=...) 确认存在性。
- 使用 upsert() 可规避此问题。
部分更新逻辑：
- 未指定的字段（如更新时未传 embeddings）保留原值。
- 若需删除元数据字段，需显式设置为 None：
```
collection.update(ids="doc1", metadatas={"author": None})
```
  1
向量更新策略：
- 若未提供新向量，且更新了 documents 文本，Chroma会自动重新计算向量（需配置嵌入函数）。
- 自定义向量时需与维度一致，否则报错。

批量操作性能：

建议用 batch() 接口减少请求次数。

with collection.batch(batch_size=100) as batch:
    for id, doc in data.items():
        batch.update(ids=id, documents=doc)

1
2
3

# 6. Collection查询数据

Collection查询数据的方法

query()，基本查询方法：执行相似性搜索，根据查询向量或文本查找最相似的文档。

results = collection.query(
    query_embeddings=[[0.1, 0.2, 0.3]],   # 直接传入查询向量（可选）
    query_texts=["搜索关键词"],             # 传入文本（Chroma自动生成向量）
    n_results=5,                          # 返回结果数量
    where={"category": "science"},        # 元数据过滤条件
    where_document={"$contains": "AI"},   # 文档内容过滤
    include=["documents", "metadatas"]    # 指定返回字段
)

1
2
3
4
5
6
7
8

返回结果

{
    'ids': [['id1', 'id2']],              # 匹配文档ID
    'distances': [[0.32, 0.85]],          # 相似度距离（越小越相似）
    'metadatas': [[{"category":"science"}, ...]], # 元数据
    'documents': [["文档内容1", ...]]       # 原始文本
}

1
2
3
4
5
6

过滤查询详解

元数据过滤 (where)，使用键值对精准筛选：

where={
    "year": {"$gte": 2023},             # 数值范围：year ≥ 2023
    "category": {"$in": ["AI", "ML"]},  # 多值匹配
    "author": {"$ne": "John"}           # 排除特定值
}

1
2
3
4
5

支持操作符

运算符	示例	说明
`$eq`	`{"year": {"$eq": 2023}}`	等于
`$ne`	`{"category": {"$ne": "AI"}}`	不等于
`$gt`/`$gte`	`{"price": {"$gt": 100}}`	大于 / 大于等于
`$lt`/`$lte`	`{"rating": {"$lte": 4.5}}`	小于 / 小于等于
`$in`	`{"tags": {"$in": ["科幻", "悬疑"]}}`	在列表中
`$nin`	`{"source": {"$nin": ["A", "B"]}}`	不在列表中
`$and`/`$or`	`{"$and": [{"year": 2023}, {"category": "科技"}]}`	逻辑与/或

文档内容过滤 (where_document)，基于文本内容搜索：

where_document={
    "$contains": "神经网络"  # 文档包含关键词
}

1
2
3

get()：按ID精确查询，直接通过ID获取文档，支持元数据过滤。

docs = collection.get(
    ids=["id1", "id2"],
    where={"status": "published"},  # 可选元数据过滤
    include=["documents", "metadatas"]
)

1
2
3
4
5

混合查询，结合语义相似性和多条件过滤。

# 查找2023年后发表的、包含"深度学习"的AI相关文档
results = collection.query(
    query_texts=["最新的AI技术"],
    n_results=10,
    where={
        "$and": [
            {"year": {"$gt": 2023}},
            {"tags": {"$contains": "AI"}}
        ]
    },
    where_document={"$contains": "深度学习"}
)

1
2
3
4
5
6
7
8
9
10
11
12

关键特性说明

自动向量化：若使用 query_texts，Chroma 自动调用关联的 Embedding 模型（如 all-MiniLM-L6-v2）生成查询向量。

距离度量支持：创建 Collection 时指定相似度算法：

collection = client.create_collection(
    name="tech_docs",
    metadata={"hnsw:space": "cosine"}  # 可选: cosine（默认）/l2/ip
)

1
2
3
4

结果定制：通过 include 控制返回字段，减少数据传输：
- include=["documents"]：仅返回文本
- include=["embeddings"]：返回原始向量
性能优化
- 批量查询：单次传入多个 query_texts 提升效率
- 索引配置：调整 HNSW 参数（如 ef_construction）优化搜索速度

典型应用场景

场景	查询示例
语义搜索	`query_texts=["自然语言处理技术"], n_results=5`
推荐系统	`query_embeddings=[user_embedding], where={"category": "music"}`
异常检测	`query(..., where={"status": "normal"})` 对比异常数据距离
多模态检索	`query_embeddings=[image_embedding], n_results=3`

最佳实践说明

空结果处理：当过滤条件无匹配时，返回 'ids': [] 空列表
距离计算：
- 余弦相似度 → 距离值域 [0, 2]（0=完全相似）
- L2距离 → 值越大相似度越低
文本编码：非英文文档需确保嵌入模型支持多语言（如 paraphrase-multilingual-MiniLM-L12-v2

# 7. Collection删除数据

Collection删除数据的方法

在 Chroma 中，通过 Collection.delete() 方法删除数据，支持三种删除模式：

按 ID 删除：删除指定 ID 的文档及其向量、元数据。
```
collection.delete(ids=["id1", "id2", "id3"])
```
1

按元数据条件删除：使用 where 参数过滤符合特定元数据的文档并删除。

# 删除元数据中 category 为 "test" 的文档
collection.delete(where={"category": {"$eq": "test"}})

1
2

组合 ID 与元数据条件删除：同时指定 ID 和元数据条件（交集删除）。
```
collection.delete(
    ids=["id1", "id2"],
    where={"author": "刘慈欣"}
)
```
1
2
3
4

最佳实践说明

物理删除不可逆：删除操作是永久性的，无法恢复。
索引自动更新：删除数据后，HNSW 索引会实时更新，无需手动重建。
性能影响：批量删除大量数据时可能短暂影响查询性能，建议分批操作。
无软删除功能：Chroma 不支持软删除（标记删除）。若需保留数据，需提前备份或通过元数据状态字段过滤（如添加 is_deleted 字段）。

# 三、最佳实践

数据预处理流程

def process_data(docs):
    cleaned = [clean_text(doc) for doc in docs]
    chunks = split_text(cleaned)  # 文本分块
    embeddings = embedding_model(chunks)
    return embeddings, chunks

1
2
3
4
5

LangChain集成

from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# 加载文档
loader = TextLoader("data.txt")
documents = loader.load_and_split()

# 创建向量库
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embedding_model,
    persist_directory="./chroma_db"
)

# 语义搜索
docs = vectorstore.similarity_search("查询内容", k=3)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

典型应用场景

RAG架构实现

def retrieve_context(question):
    embedding = model.encode(question)
    return collection.query(
        query_embeddings=[embedding],
        n_results=3
    )['documents'][0]

def generate_answer(question):
    context = retrieve_context(question)
    prompt = f"{context}\n\n问题：{question}"
    return llm.generate(prompt)

1
2
3
4
5
6
7
8
9
10
11

跨模态检索

# 图像特征检索
img_embedding = vision_model.encode(image)
results = collection.query(
    query_embeddings=[img_embedding],
    n_results=5
)

1
2
3
4
5
6

注意事项

中文处理

使用中文专用嵌入模型：

embeddings = ModelScopeEmbeddings(model_id="damo/nlp_corom_sentence-embedding_chinese-base")

性能监控

collection.peek()  # 查看前10条数据
collection.count()  # 获取数据量

1
2

Chroma：安装指南 Chroma：操作指南_多租户