New AI technique makes LLMs write code more like real programmers
A team from Tel Aviv University has just published a paper titled “Execution Guided Line-by-Line Code Generation” on the arXiv platform, introducing a clever new approach to how AI models write code. Instead of generating all the code from start to finish and only testing it at the end, this method, called EG‑CFG, adds continuous checks while the code is being written, like real-time spellcheck, but for programming. It works more like how real programmers do it.
Note: arXiv is an open-access repository for electronic papers. Submissions are moderated but not peer reviewed.
The problem they spotted
LLMs sometimes write code that looks right but doesn’t actually work. Even top models like Claude and GPT‑4 can generate code that’s syntactically correct and convincing, but fails when you run it. The code might throw errors, use the wrong logic, miss edge cases, or produce unexpected side effects.
LLMs make these mistakes because they generate code based on what seems likely to work, without knowing the outcome. In a way, they are guessing. It gets frustrating when the code doesn’t work, especially for teams relying on automation, because they end up spending more time reviewing and fixing what the model generates.
Programmers, on the other hand, usually spot when something’s wrong while they’re writing the code, before even running it.
The solution proposed
They propose to give the model feedback while it’s writing code. This new technique, called EG‑CFG, was designed to produce higher-quality code. Instead of waiting until the end to check if it works, the model writes small chunks, tests them straight away, and uses the results to guide its next steps. It’s very similar to how a human programmer thinks.
How does a programmer’s brain work?
The brain follows a logical and creative reasoning process. Here’s how it works, step by step:
- It asks: What do I need to build? What is it supposed to do?
- It breaks down the big task into smaller parts.
- It thinks through how to solve each part, either mentally or on paper.
- It recalls the programming language, or libraries that might be useful.
- It imagines how the code will work before writing.
- It writes a block of code.
- It executes the code and decides what to do based on the result.
- It tweaks and improves the code until it works as expected.
How does the EG-CFG method work?
- The model generates a small chunk of code, usually one or two lines.
- These lines are immediately executed against test cases to catch errors early.
- The model uses that feedback, what passed and what failed, to figure our its next steps and keep the code clean and executable.
- Instead of choosing one path, it tries multiple possible next lines in parallel and tests each one. The most promising results move into the next line or block. This is where “parallel coder agents” come in, like a n orchestrated team of AI programmers proposing ideas at once.
- The tokens that move forward in the neural network are the ones most likely to lead to working code, based on both grammar and test results.
What this paper does is use a grammar-based decoder (like Context-Free Grammar) to keep each part of the program syntactically valid as it’s written. So even though it writes in small parts, it makes sure every piece it tries to run is actually executable. For example:
- If it’s writing a function, it might stop after the first few lines once it has a
return
statement. - If it’s mid-way through an
if
block, it waits until the block is closed and only then runs the code.
Why it’s a big deal
EG‑CFG makes AI coding feel more like how human programmers think: write a bit of code, test it, fix it, move on. According to the researchers, this method beats previous AI techniques in standard coding tests, reaching state‑of‑the‑art performance across popular benchmarks like MBPP, HumanEval, and CodeContests. Sometimes even outperforming models like GPT‑4 (according to the paper).
Most importantly, it works well even with smaller, open‑source models. For example, using a 1.3 billion‑parameter model, it scored over 83% on MBPP, close to larger competitors. MBPP (Mostly Basic Python Programming) is a benchmark that consists of around 1,000 Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, standard library functionality, and so on.
How it compares to existing LLMs
How doable is this solution
While the results are impressive, this method is more compute‑intensive. It needs to explore multiple possibilities and run tests at each step, so it’s slower than straightforward code generation. It also depends heavily on having good test cases, if those aren’t there, it can’t check itself.
Also, this new method changes how LLMs generate code, by making execution part of the writing loop. Most current models, like GPT‑4 and Claude, follow a “generate first, check later” pattern. So this is what GPT or Claude would need to add:
- A sandbox that can run partial code in real time during token generation. GPT‑4 with tools could pull this off, but it’s not how inference normally works today. LLM inference is the phase when the model generates answers based on a prompt, and it’s usually done in one go.
- A context-free grammar to make sure the code is syntactically valid as it’s being written. That’s not standard for LLMs like GPT or Claude, which usually just generate based on probabilities, not grammar rules.
- Out of the box simulations, which adds cost, latency and compute.
- A wrapper or agent to handle the feedback loop, for example: “Try A, didn’t work, now try B instead”.
So the key innovation in this paper is integrating execution and grammar checks into the decoding loop itself. That’s what gives it its name: Execution-Guided CFG decoding. Instead of just relying on probability or grammar, it brings actual runtime feedback into the loop, like a human programmer constantly compiling and testing while writing.
Conclusion
Current models like transformers and diffusion are brilliant at pattern recognition, language, and even planning. But they’re static. They can’t adapt as they think, they don’t update memories like we do, they don’t feel emotions, and they don’t truly care about finishing something.
The concept of “parallel coder agents” sounds like a short-term solution, but in the long run, LLMs will need to come up with a new paradigm or architecture that allows AI agents to reason instead of running thousands of tests every time a line or block of code is written. Researchers should keep studying how the brain works and aim for something closer to continual learning, adaptive feedback, and self-reflective reasoning. We’re not there yet, but that’s where it’s headed.
For now, it looks like LLMs will have to keep shifting part of the QA burden onto programmers, who write prompts with tests or constraints to catch errors early. My advice: next time you use an LLM to generate code, include the following:
- Examples of expected input and output.
- Specific conditions the code should meet.
- Edge cases or tricky scenarios.
- Hints about what the code must not do.
Happy coding! Or should I say, happy prompting!
Related:
- This AI Model Never Stops Learning
Researchers at MIT have now devised a way for LLMs to keep improving by tweaking their own parameters in response to useful new information.
Sources:
- “Execution Guided Line-by-Line Code Generation” by Boaz Lavon, Shahar Katz, and Lior Wolf.
- “The State of LLM Reasoning and Inference Scaling” by Sebastian Raschka.
- “Understanding Reasoning LLMs” by Sebastian Raschka.