網易首頁 > 網易號 > 正文申請入駐

LlamaIndex檢索調優實戰：七個能落地的技術細節

2025-12-04 20:17:06　來源: deephub

北京舉報

分享至

RAG系統搭完其實才是工作的開始，實際跑起來你會發現，答案質量參差不齊，有時候精準得嚇人、有時候又會非常離譜。這個問題往往不模型本身，而是在檢索環節的那些"小細節"。

這篇文章整理了七個在LlamaIndex里實測有效的檢索優化點，每個都帶代碼可以直接使用。

1、語義分塊 + 句子窗口

固定長度切分文檔是最省事的做法，但問題也很明顯：這樣經常把一句話從中間劈開，上下文斷裂，檢索器只能硬著頭皮匹配這些殘缺的片段。

所以LlamaIndex提供了兩個更聰明的解析器。SemanticSplitter會在語義邊界處切分，不再機械地按字數來；SentenceWindow則給每個節點附加前后幾句話作為上下文窗口。并且這兩者還可以組合使用，能達到不錯的效果：

# pip install llama-index
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import (
SemanticSplitterNodeParser, SentenceWindowNodeParser
)
docs = SimpleDirectoryReader("./knowledge_base").load_data()
# Step 1: Semantically aware base chunks
semantic_parser = SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95)
semantic_nodes = semantic_parser.get_nodes_from_documents(docs)
# Step 2: Add sentence-window context to each node
window_parser = SentenceWindowNodeParser(window_size=2, window_metadata_key="window")
nodes = window_parser.get_nodes_from_documents(semantic_nodes)
index = VectorStoreIndex(nodes)

檢索模型打分的對象是單個節點，所以讓每個節點包含完整的語義單元，再帶上一點其他的附加信息，命中率自然就上去了。

2、BM25 + 向量的混合檢索

向量嵌入擅長捕捉語義相似性，但碰到專業縮寫、產品型號這類精確匹配場景就容易翻車。老牌的BM25算法恰好補上這個短板，它對精確詞項敏感，長尾術語的召回能力很強。

把兩種檢索方式融合起來，LlamaIndex的QueryFusionRetriever可以直接搞定：

from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core import StorageContext
from llama_index.core.indices.keyword_table import SimpleKeywordTableIndex
# Build both indexes
vector_index = index # from above
keyword_index = SimpleKeywordTableIndex.from_documents(docs)
retriever = QueryFusionRetriever(
retrievers=[
vector_index.as_retriever(similarity_top_k=5),
keyword_index.as_retriever(similarity_top_k=5)
],
num_queries=1, # single query fused across retrievers
mode="simple", # RRF-style fusion
)

BM25抓精確匹配，向量抓語義關聯，RRF融合后的top-k質量通常比單一方法好一截，而且不用寫多少額外代碼。

3、多查詢擴展

用戶的提問方式千奇百怪，同一個意圖可以有很多種表達方法。所以單一query去檢索很可能漏掉一些相關但措辭不同的文檔。

多查詢擴展的思路就是：自動生成幾個query的變體，分別檢索，再把結果融合起來。

from llama_index.core.retrievers import QueryFusionRetriever
multi_query_retriever = QueryFusionRetriever.from_defaults(
retriever=vector_index.as_retriever(similarity_top_k=4),
num_queries=4, # generate 4 paraphrases
mode="reciprocal_rerank", # more robust fusion
)

如果業務場景涉及結構化的對比類問題（比如"A和B有什么區別"），還可以考慮query分解：先拆成子問題，分別檢索，最后匯總。

不同的表述會激活embedding空間里不同的鄰居節點，所以這種融合機制保留了多樣性，同時讓多個檢索器都認可的結果排到前面。

4、reranker

初篩拿回來的top-k結果，質量往往是"還行"的水平。如果想再往上提一個檔次reranker是個好選擇。

和雙編碼器不同，交叉編碼器會把query和passage放在一起過模型，對相關性的判斷更精細。但是問題就是慢，不過如果只跑在候選集上延遲勉強還能接受：

from llama_index.postprocessor.cohere_rerank import CohereRerank
# or use a local cross-encoder via Hugging Face if preferred
reranker = CohereRerank(api_key="COHERE_KEY", top_n=4) # keep the best 4
query_engine = vector_index.as_query_engine(
similarity_top_k=12,
node_postprocessors=[reranker],
)
response = query_engine.query("How does feature X affect Y?")

先用向量檢索快速圈出候選（比如top-12），再用交叉編碼器精排到top-4。速度和精度之間取得了不錯的平衡。

5、元數據過濾與去重

不是所有檢索回來的段落都值得信任，文檔有新有舊，有的是正式發布版本，有的只是草稿。如果語料庫里混著不同版本、不同產品線的內容，不加過濾就是給自己挖坑。

元數據過濾能把檢索范圍限定在特定條件內，去重則避免相似內容重復占用上下文窗口，時間加權可以讓新文檔獲得更高權重：

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.postprocessor import (
SimilarityPostprocessor, DuplicateRemovalPostprocessor
)
retriever = VectorIndexRetriever(
index=vector_index,
similarity_top_k=12,
filters={"metadata": {"product": "alpha"}} # simple example
)
post = [
DuplicateRemovalPostprocessor(),
SimilarityPostprocessor(similarity_cutoff=0.78),
]
nodes = retriever.retrieve("Latest install steps for alpha build?")
nodes = [p.postprocess_nodes(nodes) for p in post][-1]

過濾器擋住不相關的文檔，相似度閾值過濾掉弱匹配，去重保證多樣性。這套組合操作下來，檢索結果的下限被抬高了。

6、響應合成模式的選擇

檢索只是手段，最終目的是生成靠譜的答案。如果合成階段沒控制好，模型很容易脫離檢索內容自由發揮，幻覺就來了。

LlamaIndex的"compact"模式會讓模型更緊密地依賴檢索節點，減少跑題的概率：

from llama_index.core.response_synthesizers import TreeSummarize, CompactAndRefine
# Balanced, citation-friendly option
qe = vector_index.as_query_engine(
similarity_top_k=8,
response_mode="compact", # leans terse & grounded
use_async=False,
)
ans = qe.query("Summarize the security model, cite sources.")
print(ans) # includes source refs by default

嚴格來說這不算檢索優化，但它形成了一個反饋閉環——如果發現答案經常跑偏，可能需要回頭調整top-k或者相似度閾值。

7、持續評估

沒有量化指標，優化就是在黑箱里瞎摸。建議準備一個小型評估集，覆蓋核心業務場景的10到50個問題，每次調參后跑一遍，看看忠實度和正確率的變化。

from llama_index.core.evaluation import FaithfulnessEvaluator, CorrectnessEvaluator
faith = FaithfulnessEvaluator() # checks grounding in retrieved context
corr = CorrectnessEvaluator() # compares to reference answers
eval_prompts = [
{"q": "What ports do we open for service Z?", "gold": "Ports 443 and 8443."},
# add 20–50 more spanning your taxonomy
]
qe = multi_query_retriever.as_query_engine(response_mode="compact", similarity_top_k=6)
scores = []
for item in eval_prompts:
res = qe.query(item["q"])
scores.append({
"q": item["q"],
"faithful": faith.evaluate(res).score,
"correct": corr.evaluate(res, reference=item["gold"]).score
})
# Now look at averages, find weak spots, iterate.

當你發現系統在某類問題上總是出錯：比如漏掉具體數字、把策略名稱搞混等等，就就可以根據問題來進行調整了，比如加大BM25權重？提高相似度閾值？換個更強的reranker？

幾個容易踩的坑

分塊太長會拖累召回率，節點應該保持聚焦，讓句子窗口來承擔上下文補充的任務。

Rerank不要對全量結果做，應該只在初篩的候選集上。

語料庫如果混著多個產品版本，一定要在建索引時就加好version、env、product這些元數據字段，否則檢索回來的可能是過時內容。

最后別憑感覺判斷效果好不好，維護一個評估用的表格，記錄每次調參后的分數變化，時間長了你會發現哪些參數對哪類問題影響最大。

總結

RAG的答案質量不靠單一銀彈，而是一系列合理配置的疊加。建議先從混合檢索和句子窗口兩個點入手，觀察效果，再逐步加入多查詢擴展和reranker。

量化、調整、再量化，循環往復。

https://avoid.overfit.cn/post/507a074851c5480a818e67374aecddd6

作者：Modexa

特別聲明：以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發布，本平臺僅提供信息存儲服務。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.