Unpacking LLMs: How Well Can AI Code Assist Students in Computer Science?

Have you ever wondered how Artificial Intelligence (AI), specifically Large Language Models (LLMs) like GitHub Copilot and ChatGPT, can make coding a piece of cake for students? With the rise of these digital assistants, programming students are leveraging these tools not just for coding help but as companions for tackling difficult assignments. But do they truly deliver the goods, especially when it comes to advanced programming problems? A recent research piece has taken a deep dive into this question, and we have some insightful findings to explore!

The Buzz around AI and Code Generation

In recent years, the world of technology has seen remarkable advancements, particularly in machine learning. This has paved the way for LLMs that are not only able to converse with humans but also to create images and even generate code. But how does this work in the coding department? Simply put, these AI models are trained on vast amounts of data—including heaps of code—enabling them to respond to prompts, like a well-equipped assistant ready to lend a helping hand.

The significance of using LLMs in programming education cannot be understated. Recent surveys indicate an uptick in students harnessing these tools to assist with their assignments, prompting educators to reconsider how they structure their curricula.

The Study: A Closer Look

The research in question focused specifically on how well these LLMs can handle advanced programming assignments—those typically encountered in second and third-year university courses. The researchers selected nine programming assignments across different languages—Java, Python, and C—to see how effectively LLMs could interpret natural language prompts and generate the required code.

Key Research Questions

To dive into this study, the authors set out to answer a few burning questions:

How effective are LLM tools at solving advanced programming assignments correctly?
Can these tools identify the algorithmic problems they need to tackle?
Does the choice of LLM or programming language influence the quality of code generated?

These questions formed the backbone of their investigation and set the stage for uncovering the abilities and limitations of AI in coding tasks.

Findings: The Good, the Bad, and the Ugly

Now, let’s talk about what the researchers found regarding the performance of these LLMs.

Overall Performance: Advanced vs. Introductory Tasks

The results demonstrated a clear disparity in performance between basic coding assignments and their more advanced counterparts. While the LLMs performed reasonably well on introductory programming tasks, they stumbled significantly when faced with complex challenges.

For instance, in simpler tasks, the models managed to provide correct or nearly correct solutions about six out of twelve times. However, for more complicated problems, only a handful of correct solutions appeared, with GitHub Copilot being the standout performer.

Recognizing the Problem

One noteworthy aspect of the research was the ability of the LLMs to recognize the type of problem described in natural language prompts. More often than not, these models understood what “algorithm” they were meant to apply, even if they didn’t always execute it correctly.

Even when the generated code wasn’t optimal, the output could aid students by providing a framework or a starting point upon which they could build.

Language Matters: Influence of Coding Language on Performance

Interestingly, the programming language used also played a role in the success of the generated solutions. Despite some successes, the models exhibited a noticeable performance drop with C, often resulting in non-functional code. In contrast, both Python and Java yielded more workable solutions.

Breaking Down the Algorithms: A Real-World Problem Example

To illustrate just how AI approaches coding tasks, let’s take a look at one complicated case examined in the study—the bin-packing problem. This is a well-known NP-complete problem where the aim is to pack a set of weighted items into the least number of bins possible without exceeding capacity.

When the researchers provided relevant prompts to GitHub Copilot, it returned a solution that used a greedy heuristic approach, which sometimes leads to optimal results but doesn’t guarantee them. The code was functional enough for some scenarios but failed to address all instances accurately.

The researchers ran 1,000 test cases on the output generated by the LLM, revealing that it provided correct answers only about 75% of the time. While not perfect, it delivered a foundation that students could refine and correct, showcasing how AI can serve as both a tutor and a teammate in code development.

Key Performance Metrics

The researchers introduced a metric called accuracy—the rate of correctly verified test cases against the total number of test cases. This offered a tangible measure of how well the LLMs were doing in their attempts to solve advanced programming challenges.

Lessons for Educators: What This Means for the Future

For educators teaching computer science, the findings from this research offer critical insights. Here are some key takeaways:

Adjust Assignments: Understanding that LLMs can struggle with simple prompts can inform how educators design assignments. By incorporating slight variances or constraints to known algorithms, educators can ensure that students engage in deeper thinking and problem-solving rather than relying solely on AI assistance.
Emphasize Understanding: While students may find initial support in LLMs, it’s essential to stress the importance of understanding the underlying concepts and algorithms. Students should learn to use AI as an aid rather than a crutch.
Incorporate Prompt Engineering: The study mentions that a lack of advanced prompt engineering might have influenced results. Encouraging students to refine their prompts when using LLMs could enhance the accuracy and relevance of the outputs, moving towards more complex explorations with AI.

Key Takeaways

LLMs are Effective for Introductory Tasks: AI tools like GitHub Copilot perform well in simple coding challenges, acting as reliable helpers for students.
Advanced Problems Are Tricky: The success rate declines significantly with more complex assignments, wherein only a few solutions proved correct.
Problem Recognition is a Strength: LLM tools can identify the type of algorithm involved, but they often default to heuristics that may not yield optimal results.
Practice Makes Perfect: Students should leverage LLM-generated code as a learning aid, correcting and improving it rather than taking it at face value.
Language Matters: The programming language can influence the degree of success in code generation, with Python and Java showing better overall results than C.

In conclusion, while LLMs have their limitations—especially for advanced coding tasks—they are stepping stones toward a future where AI and human learners collaborate to tackle design complexities in programming education. As students and educators, make the most of these tools while honing your own problem-solving skills—after all, the craft of coding is as much about the journey as it is about the destination!

If you are looking to improve your prompting skills and haven’t already, check out our free Advanced Prompt Engineering course.

This blog post is based on the research article “Evaluating Code Generation of LLMs in Advanced Computer Science Problems” by Authors: Emir Catir, Robin Claesson, Rodothea Myrsini Tsoupidi. You can find the original article here.

Blog

Unpacking LLMs: How Well Can AI Code Assist Students in Computer Science?

Unpacking LLMs: How Well Can AI Code Assist Students in Computer Science?

The Buzz around AI and Code Generation