A project that I built for the embedded security team during my internship at Bosch. I really loved this project, but honestly, I donāt love security that much š„¹. This was also the first time I coded extensively, writing about 1,700 LoC (lines of code) in Python for the backend.
WARNING: All the code below is not the real code in the project, itās just a concept.
1. Source code parsing

from clang import cindex
index = cindex.Index.create()tu = index.parse( path, # this is path to file I need to parse # I need to use this option so the parser doesn't ignore macros (such as #define) options=TranslationUnit.PARSE_DETAILED_PROCESSING_RECORD)
# parse function for C source codedef parse(self): for cursor in self.tu.cursor.walk_preorder(): match cursor.kind: # if you want to parse typedef case CursorKind.TYPEDEF_DECL: ... # if you want to parse macros case CursorKind.MACRO_DEFINITION: ... # if you want to parse functions case CursorKind.FUNCTION_DECL: ...
# we can even get the body of a function# just get the start line and the end linedef get_full_code(self, cursorm, path): extent = cursor.extent start = extent.start end = extent.end
# read source code with open(path, "r") as f: content = f.readlines() return "".join(content[start.line - 1 : end.line]).strip()
2. āLLMā architecture
a. The backend
To support multiple backend, I just use simple switch-case
and Inheritance
:
class Backend: # the model is pydantic object of settings related to LLM def __init__(self, model): # some settings here ...
# because the need to support API # I use a new feature in langchain called MemoryLimiter to limit the API use # https://api.python.langchain.com/en/latest/rate_limiters/langchain_core.rate_limiters.InMemoryRateLimiter.html
@property def model(self): # this is the different of each backend raise NotImplementedError
For example, I can add OpenAI and Ollama backends as shown below. I need to use _model
to avoid recreating the model
when you call model multiple times.
class OpenAIBackend(Backend): @property def model(self): if not hasattr(self, "_model"): self._model = ChatOpenAI( # put all the settings you want here ... ) return self._model
class OllamaBackend(Backend): @property def model(self): if not hasattr(self, "_model"): self._model = ChatOllama( # put all the settings you want here ... ) return self._model
And now use simple factory design to create backend (just another switch-case
):
def create_backend(backend_type): match backend_type: case "openai": ... # create openai backend # add more cases if you want to support more backends
b. Detail Implementation

The architecture is inspired heavily by this Langraph Tutorial. However, after a while, I couldnāt figure out how to implement memory into one of the nodes (probably due to skill issues), so I decided to build my own. Itās quite simple - just create a model for each node (though this isnāt great for performance) and then use conditions to decide which node to go to next. The crucial part is the prompt - since I use a local LLM model (throughllama.cpp
), if the prompt is poor, the performance suffers significantly, even causing the LLM to fail to generate answers in the Route node.
In the Route LLM, there are two options: use structured_output
, a new way to generate structured answers supported by Langchain, or prompt the model to generate structured output like JSON
. I chose the latter because the first option has many bugs - if the LLM canāt generate āstructured outputā, Langchain wonāt return an answer. With the JSON
approach, it will always generate something, and if the answer isnāt in JSON
format, I simply route it to the database.
# just copy from https://langchain-ai.github.io/langgraph/tutorials/rag/langgraph_adaptive_rag_local/#componentsROUTE_SYSTEM_PROMPT = """You are an expert at routing a user question to a vectorstore or web search.The vectorstore contains documents related to code in project {project}.Use the vectorstore for questions on these topics. For all else, and especially for current events, use web-search.Return JSON with single key, datasource, that is 'websearch' or 'vectorstore' depending on the question."""
# and I add to user prompt for more strictlyROUTE_USER_PROMPT = """Below is the question, read it carefully and tell me what route I should use{question}"""
In the RAG LLM, to improve the accuracy, I use long context reorder. For the memory LLM, I just copy 100% from this tutorial but adjust a little bit (so 99% and Iām in love with Langgraph right now, fantastic framework). But why do I need to use RAG and another LLM with memory just to generate code? At the early stage, I wanted a model that could retrieve conditionally (but you know, I couldnāt, due to skill issues again), so I had to use RAG, retrieve one time only and then put all the context into the memory LLM, and finally use a syntax checker (in C, I just use clang -fsyntax-only
) to check the code, keep generating it and just loop the process. Below is the RAG implementation:
class RAG: def __init__(self, backend, database): # some settings here ...
# create retriever self.retriever = ContextualCompressionRetriever( base_compressor=DocumentCompressorPipeline(transformers=[LongContextReorder()]), base_retriever=database.as_retriever( # some settings in retriever search_type="mmr", search_kwargs={"k": 7} ) )
# and then create RAG # get_qa_prompt() is the system prompt that I use to provide context for generate code self.qa_chain = create_stuff_document_chain(backend.model, get_qa_prompt()) self.rag = create_retrieval_chain( self.retriever, self.qa_chain )
# run rag def run(self, question): return self.rag.invoke({"input": question})
Hereās the implementation of the syntax checking system:
class ProgramError(Enum): # syntax error SYNTAX = auto() # we can actually extend to more error # such as compile, linking, ... # LINKING = auto() COMPILE = auto()
class CheckError(Exception): def __init__(self, message, code): super().__init__(message) self.code = code
class Codechecker: ...
def check_syntax(self, path: str): # we use clang to check syntax # so we need to write the code from LLM to file before doing checking try: subprocess.run(["command to check syntax"]) except subprocess.CalledProcessError as exc: # we need error message to feedback to LLM raise CheckError(ProgramError.SYNTAX, exc.message)
# we can extend to check_linking and compile too or more, I don't know def check_compile(self, path: str): try: subprocess.run(["command to check compile"]) except subprocess.CalledProcessError as exc: # we need error message to feedback to LLM raise CheckError(ProgramError.COMPILE, exc.message)
# now we do a sequence check def check_program(self, path: str): self.check_syntax(path) self.check_compile(path)
# now we do the loop of error checking def run(self, rag, question): # this is a function to generate code from RAG # then write this code to a file # and return the code and path of the file code, path = self.generate_code(rag, question)
# In RAG, I implement a function to get the context retrieved # And memoryLLM will extract this context from RAG llm = MemoryLLM(rag)
# now extract the code while True: try: self.check_program(path) # check ok return code except CheckError as e: match e.code: case ProgramError.SYNTAX: new_question = "Generate prompt based on sytax error" ... case ProgramError.COMPILE: new_question = "Generate prompt based on compile error" ... code, path = self.generate_code(llm, new_question) # NOTE: you can actually stop earlier # but the above code is the main implementation # I just drop all the details
Cool, now we just combine everything to create a big loop, itās really easy:
class Pipeline: # I assume you have initialized all needed LLM ...
def run(self, question): route = self.route_llm.run(question)
if route == "vectorstore": # is an instance of Codechecker return self.code_checker.run(self.rag, question) elif route == "websearch": # is an instance of WebLLM return self.web_llm.run(question)
3. The API
Now we move to the API part, how I implement an API to create a good backend (not really good, but I want to use this buzz word LOL). The first thing I considered is how I can manage all the objects created, because users can call the API to create objects whenever they want, so I cannot create empty objects beforehand. In Python, creating an object at runtime is really easy:
# this is the main class for application# but in real projects, I implement it as a singleton class# to make sure the whole application cannot be changed at runtimeclass App: def register_attr(self, name, value): """ name: Attribute name for access via self.name value: Value to assign to the "name" attribute """ if name == "backend": # do something to create llm (or backend because I want to call it that way) # it can be: self._create_attr("backend", OpenAIBackend(value)) # you can create more than that self._create_attr("rag", RAG(self.backend)) # and more ... # put creation logic you want here as a new branch
def _create_attr(self, name, value): # We can use a function in Python to check if an attribute is already created if hasattr(self, name): # delete old attribute delattr(self, name, value) setattr(self, name, value)
Now we go to the API. I use FastAPI and Pydantic, so it is really easy. From the example below, I hope you can create more APIs:
@app.post("/backend")def create_backend(model: BackendModel) """BackendModel is a Pydantic data class""" # app is an instance of App try: app.register_attr("backend", model) except SomeErrorFromApp as e: # you can add error handling to register_attr in App class raise HTTPException(status_code=404, detail=e.message)
return {"message": "Create backend successfully"}
Ok cool, so we just walked through a brief discussion of my implementation for the API backend. Now we go to the hardest part, the frontend. Because I use VSCode extension as a frontend, yeah, no knowledge about JavaScript is killing me.
Thank you so much for reading. Thanks my team at Bosch for helping me, I really love Bosch culture š„¹.