Grading Code Like a Pro: How AI Gets Smarter with Custom Rubrics

Grading Code Like a Pro: How AI Gets Smarter with Custom Rubrics
Large Language Models (LLMs) like ChatGPT are getting pretty good at writing code—but how good are they at grading it?
That’s the big question behind a fascinating new study by researchers who’ve taken a closer look at how we can use AI to evaluate student code more accurately. Turns out, the secret isn’t just in the model—it’s in the rubric.
In their paper, titled Rubric Is All You Need, the researchers propose a better way to make AI-powered grading more reliable: by giving LLMs specific, customized instructions (aka question-specific rubrics) for each coding assignment.
If you’re an educator, developer, or just someone curious about how AI learns to “grade like a teacher,” this blog post breaks it all down for you.
Why Grading Code With AI Is Hard
Let’s say you’re an instructor for a programming course with hundreds of students. You know how time-consuming grading assignments can be. Traditionally, we’ve used automated test cases to help speed things up—if the code runs and gives the right output, it gets a pass.
But here’s the catch: a program might technically be correct, but how it gets to the answer matters too. Does the student use best coding practices? Do they understand the underlying logic? Is the algorithm efficient?
This is where Large Language Models (LLMs) enter the scene. Trained on massive datasets from the internet, LLMs can understand and generate code—so why not use them to grade it?
Here’s the problem: most current AI systems use general rubrics (think: “check if code is correct” or “verify syntax”). But just like a teacher wouldn’t grade an essay about Shakespeare using a rubric made for a science report, using a one-size-fits-all rubric for code evaluation doesn’t cut it.
So the researchers asked: What if we tailor the grading instructions for each coding question?
Introducing Question-Specific Rubrics: Grading With Context
The team behind this study dove deep into using Question-Specific Rubrics (QS Rubrics) to guide AI grading. Unlike general rubrics (Question-Agnostic or QA Rubrics), QS Rubrics break down each coding task into unique steps and award points based on how well students solve each part.
They built two new datasets to test this idea:
- Object-Oriented Programming (OOP) Dataset – Based on real student submissions from a college Java exam.
- Data Structures and Algorithms (DSA) Dataset – Pulled from Geeks for Geeks, a popular programming practice website.
Each dataset includes the problem description, student submissions, solutions, grading rubrics, and human-written feedback. Basically, it’s a goldmine for testing AI graders.
Grading Techniques That Mimic Human Teachers
To see how question-specific rubrics compare to traditional methods, the researchers created three AI grading techniques using GPT-based models:
1. Complete Rubric Evaluation (CRE)
- Think of CRE like a wise professor.
- It reviews the entire rubric and student code in one go.
- It’s trained to focus on deep understanding—what the student was trying to do, not just how they did it.
- It even ignores syntax errors unless they really matter.
2. Pointwise Rubric Evaluation (PRE)
- PRE is more like a strict TA.
- It grades each point in the rubric one at a time.
- This method is super detailed but can be a bit too harsh—awarding zero for anything that doesn’t meet the point exactly.
3. Ensemble Method Evaluation (EME)
- EME calls in multiple graders (LLMs, actually) and lets them vote on the final grade.
- It balances leniency and accuracy, and even chooses the most helpful feedback from among all the AI messages.
They also created a cool new measure called Leniency, which checks how forgiving (or strict) the AI grader is compared to a human.
So… Do Question-Specific Rubrics Really Help?
The short answer? Absolutely.
When compared to common approaches like: – Test-case-only autograders – Code similarity checkers like CodeBERTScore – Generic LLM scoring like Five Point Marking
…the QS-rubric-based graders consistently performed better across all metrics:
✅ Better alignment with human scores
✅ More helpful and targeted feedback
✅ Balanced strictness (not too forgiving, not too picky)
Especially with the more complex DSA problems, where logic really matters, the question-specific approach made a major difference.
Real-World Impact: Why You Should Care
This isn’t just academic navel-gazing—this research has real-world power.
🚀 For Educators
AI graders can save tons of time. By feeding them the same rubric you’d use in class, you can automate grading at scale while still giving meaningful feedback. That means less burnout and more time mentoring students.
💻 For Students
You get fairer assessments based on what you were supposed to do—not just whether your code runs. Plus, tailored feedback helps you learn and improve your skills faster.
🤖 For AI Developers and Researchers
Want to fine-tune GPT for grading or feedback generation? Designing better prompts and including context-rich rubrics can significantly improve output accuracy.
Smart Tip for Prompt Engineers & AI Tweakers
Here’s a gem from the study: LLMs are way more reliable when they’re given the full rubric at once, not one piece at a time.
Why?
Because when LLMs are given just one point to grade, they lean toward being overly strict—like giving a zero for a small misunderstanding. But with the full picture, they understand the student’s intent and grade in a more human-like manner.
So if you’re building prompts for AI grading tools (or even reviewing submissions yourself), include the full rubric as context. It makes a huge difference.
Limitations and What’s Next
No study is perfect. Here are a few things to keep in mind:
- They mainly used GPT-4o. Future work could explore other models (and whether open-source LLMs can keep up).
- All the assignments were in Java. Other languages—like Python or C++—still need testing.
- The data focused on single-file assignments from intermediate courses. Advanced, multi-file projects are still uncharted territory.
- Rubric complexity matters, too. Future research might test whether super detailed rubrics are always better—or if there’s a sweet spot.
Still, this study sets a new bar for AI-driven code assessment.
Key Takeaways
- Generic rubrics aren’t enough. Just like humans, AIs do better with specific instructions.
- Question-Specific Rubrics give AI graders much-needed context, improving grading quality and feedback relevance.
- CRE and EME work best for balanced, accurate evaluation. PRE is stricter—good for cases needing higher rigor.
- Rubric presentation matters a lot. Feed the entire rubric for more human-like grading.
- AI-based grading isn’t just about saving time—it can actually improve learning outcomes when done right.
In the battle to make AI grading useful, rubrics are your best weapon. If you’re designing educational tools or simply dreaming of a day when grading code isn’t a chore, this research offers a powerful blueprint.
Want to dive into the datasets or implement your own AI grader? The paper’s authors have made some of their work open-source. Check it out here on GitHub.
Let’s stop fearing the robot TA—turns out, it grades a lot like us… when it’s told exactly what to look for.
🎓 Have you used AI to grade student work or debug code? Curious to try prompt engineering with rubrics? Share your experience in the comments.
If you are looking to improve your prompting skills and haven’t already, check out our free Advanced Prompt Engineering course.
This blog post is based on the research article “Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics” by Authors: Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Devansh, Yashwanth Nakka, Aaryan Raj Jindal, Pratyush Ghosh, Arnav Ramamoorthy, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, Jagat Sesh Challa, Dhruv Kumar. You can find the original article here.