OpenAI's ChatGPT chatbot can fix software bugs very well, but its key advantage over other methods and AI models is its unique ability for dialogue with humans that allows it to improve the correctness of an answer.
Researchers from Johannes Gutenberg University Mainz and University College London pitted OpenAI's ChatGPT against "standard automated program repair techniques" and two deep-learning approaches to program repairs: CoCoNut, from researchers at the University of Waterloo, Canada; and Codex, OpenAI's GPT-3-based model that underpins GitHub's Copilot paired programming auto code-completion service.
Also: How to get started using ChatGPT
"We find that ChatGPT's bug fixing performance is competitive to the common deep learning approaches CoCoNut and Codex and notably better than the results reported for the standard program repair approaches," the researchers write in a new arXiv paper, first spotted by New Scientist.
AI chatbots and writers can help lighten your workload by writing emails and essays and even doing math. They use artificial intelligence to generate text or answer queries based on user input. ChatGPT is one popular example, but there are other noteworthy chatbots.
Read nowThat ChatGPT can be used to solve coding problems isn't new, but the researchers highlight that its unique capacity for dialogue with humans gives it a potential edge over other approaches and models.
The researchers tested ChatGPT's performance using the QuixBugs bug-fixing benchmark. The automated program repair (APR) systems appear to be at a disadvantage as they were developed prior to 2018.
Also: The best AI art generators: DALL-E 2 and alternatives
ChatGPT is based on the transformer architecture, which Meta's AI chief Yann LeCunn highlighted this week was developed by Google. Codex, CodeBERT from Microsoft Research, and its predecessor BERT from Google are all based on Google's transformer method.
OpenAI highlights ChatGPT's dialogue capability in examples for debugging code where it can ask for clarifications, and receive hints from a person to arrive at a better answer. It trained the large language models behind ChatGPT (GPT-3 and GPT 3.5) using Reinforcement Learning from Human Feedback (RLHF).
While ChatGPT's ability for discussion can help it to arrive at a more correct answer, the quality of its suggestions remain unclear, the researchers note. That's why they wanted to evaluate ChatGPT's bug-fixing performance.
The researchers tested ChatGPT against QuixBugs 40 Python-only problems, and then manually checked whether the suggested solution was correct or not. They repeated the query four times because there is some randomness in the reliability of ChatGPT's answers, as a Wharton professor found out after putting the chatbot through an MBA-like exam.
Also: The developer role is changing radically, and these figures show how
ChatGPT solved 19 of the 40 Python bugs, putting it on par with CoCoNut (19) and Codex (21). But standard APR methods only solved seven of the issues.
The researchers found that ChatGPT's success rate with follow-up interactions reached 77.5%.
The implications for developers in terms of effort and productivity are ambiguous, though. Stack Overflow recently banned ChatGPT-generated answers because they were low quality but plausible sounding. The Wharton professor found that ChatGPT could be a great companion to MBA students as it can play a "smart consultant" -- one who produces elegant but oftentimes wrong answers -- and foster critical thinking.
"This shows that human input can be of much help to an automated APR system, with ChatGPT providing the means to do so," the researchers write.
"Despite its great performance, the question arises whether the mental cost required to verify ChatGPT answers outweighs the advantages that ChatGPT brings."