構(gòu)建自己的AI編程助手：基于RAG的上下文感知實現(xiàn)方案

2026-01-12 21:37:12　來源: deephub

北京舉報

分享至

很多人覺得做個AI助手就是調(diào)調(diào)OpenAI的接口，其實這樣智能做出一個通用聊天機器人。

而代碼助手需要專門為代碼設(shè)計的上下文感知的RAG（Retrieval-Augmented Generation）管道，這是因為代碼跟普通文本不一樣，結(jié)構(gòu)嚴格，而且不能隨便按字符隨便進行分割。

一般的代碼助手分四塊：代碼解析把源文件轉(zhuǎn)成AST語法樹；向量存儲按語義索引代碼片段而非關(guān)鍵詞匹配；倉庫地圖給LLM一個全局視角，知道文件結(jié)構(gòu)和類定義在哪；推理層把用戶問題、相關(guān)代碼、倉庫結(jié)構(gòu)拼成一個完整的prompt發(fā)給模型。

代碼解析：別用文本分割器

自己做代碼助手最常見的坑是直接用文本分割器。

比如按1000字符切Python文件很可能把函數(shù)攔腰截斷。AI拿到后半截沒有函數(shù)簽名根本不知道參數(shù)等具體信息。

而正確做法是基于AST分塊。tree-sitter是這方面的標(biāo)準(zhǔn)工具，因為Atom和Neovim都在用。它能按邏輯邊界比如類或函數(shù)來切分代碼。

依賴庫是tree_sitter和tree_sitter_languages：

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
# 1. Load the Repository
# We point the loader to our local repo. It automatically handles extensions.
loader = GenericLoader.from_filesystem(
"./my_legacy_project",
glob="**/*",
suffixes=[".py"],
parser=LanguageParser(language=Language.PYTHON, parser_threshold=500)
)
documents = loader.load()
# 2. Split by AST (Abstract Syntax Tree)
# This ensures we don't break a class or function in the middle.
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=2000,
chunk_overlap=200
)
texts = python_splitter.split_documents(documents)
print(f"Processed {len(texts)} semantic code chunks.")
# Example output: Processed 452 semantic code chunks.

保持函數(shù)完整性很關(guān)鍵。檢索器拿到的每個分塊都是完整的邏輯單元，不是代碼碎片。

向量存儲方案

分塊完成后需要存儲，向量數(shù)據(jù)庫肯定是標(biāo)配。

embedding模型推薦可以用OpenAI的text-embedding-3-large或者Voyage AI的代碼專用模型。這類模型在代碼語義理解上表現(xiàn)更好，能識別出def get_users():和"獲取用戶列表"是一回事。

這里用ChromaDB作為示例：

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
# Initialize the Vector DB
# Ideally, persist this to disk so you don't re-index every run
db = Chroma.from_documents(
texts,
OpenAIEmbeddings(model="text-embedding-3-large"),
persist_directory="./chroma_db"
)
retriever = db.as_retriever(
search_type="mmr", # Maximal Marginal Relevance for diversity
search_kwargs={"k": 8} # Fetch top 8 relevant snippets
)

這里有個需要說明的細節(jié)：search_type用"mmr"是因為普通相似度搜索容易返回五個幾乎一樣的分塊，MMR（最大邊際相關(guān)性）會強制選取相關(guān)但彼此不同的結(jié)果，這樣可以給模型更寬的代碼庫視野。

上下文構(gòu)建

單純把代碼片段扔給GPT還不夠。它可能看到User類的定義，卻不知道m(xù)ain.py里怎么實例化它。缺的是全局視角。

所以解決辦法是設(shè)計系統(tǒng)提示，讓模型以高級架構(gòu)師的身份來理解代碼：

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
# The "Stuff" chain puts all retrieved docs into the context window
prompt = ChatPromptTemplate.from_template("""
You are a Senior Software Engineer assisting with a Python legacy codebase.
Use the following pieces of retrieved context to answer the question.
If the context doesn't contain the answer, say "I don't have enough context."
CONTEXT FROM REPOSITORY:
{context}
USER QUESTION:
{input}
Answer specifically using the class names and variable names found in the context.
""")
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)
# Let's test it on that tricky legacy function
response = rag_chain.invoke({"input": "How do I refactor the PaymentProcessor to use the new AsyncAPI?"})
print(response["answer"])

這樣AI不再編造不存在的導(dǎo)入，因為它現(xiàn)在能看到向量庫檢索出的AsyncAPI類定義和PaymentProcessor類。它會告訴你："重構(gòu)PaymentProcessor需要修改_make_request方法，根據(jù)上下文，AsyncAPI初始化時需要await關(guān)鍵字……"

代碼地圖：應(yīng)對大型代碼庫

上面的方案對中小項目就已經(jīng)夠用了，但是如果代碼的規(guī)模到了十萬行以上，這些工作還遠遠不夠覆蓋。

Aider、Cursor這類工具采用的進階技術(shù)叫Repo Map，也就是把整個代碼庫壓縮成一棵樹結(jié)構(gòu)，塞進上下文窗口：

src/
auth/
login.py:
- class AuthManager
- def login(user, pass)
db/
models.py:
- class User

我們的做法是在發(fā)送查詢前先生成文件名和類定義的輕量級樹狀結(jié)構(gòu)，附加到系統(tǒng)提示里。這樣模型能說："地圖里有個auth_utils.py，但檢索結(jié)果里沒它的內(nèi)容，要不要看看那個文件？"

總結(jié)

我們做自己做代碼助手目標(biāo)不是在補全速度上跟Copilot較勁，而是在于理解層面的提升。比如說內(nèi)部文檔、編碼規(guī)范、那些只有老員工才知道的遺留模塊都可以喂進去。從一個靠猜的AI，變成一個真正懂你代碼庫的AI。

https://avoid.overfit.cn/post/e04b69f27ca841b59679a916781b28c6

作者：Rahul Kaklotar

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺“網(wǎng)易號”用戶上傳并發(fā)布，本平臺僅提供信息存儲服務(wù)。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.