AlphaCodium outperforms direct prompting of OpenAI's o1 on coding problems

87 points by benocodes 9 months ago

I find it interesting that the folks at Qodo are not clear on why more groups try to push the needle on SWE-Bench rather than on Codeforces.

I think most software developers would find that performance on SWE-Bench (wherein an AI system has to resolve a real-world Github issue) is much more relevant to their day-to-day than the raw algorithmic problem-solving capabilities of an AI system.

Competitive programming questions are challenging, obviously, but they are tightly defined with prescribed inputs and outputs and the solutions are usually compact, though hard to arrive at.

In contrast, real-world programming is much fuzzier and more ambiguous. The solutions that one must implement are usually much larger in scope and require much greater project-specific context to build. And of course real-world tasks are generally not things that demand lots of formal data structures and algorithms knowledge, like one would need to do well at Codeforces.

What's interesting is that you can have systems like o1 or AlphaCodium that are much, much better than the median software developer at solving tricky algorithmic puzzles, but that also can't do very well at SWE-Bench, which largely comprises Github issues that have been estimated to take <1h for a human developer to do. Even though o1 would absolutely dominate Claude 3.5 Sonnet at competitive programming questions, it seems to basically be a wash on real-world tasks.

So I suppose my question to Qodo would be: Why the emphasis on Codeforces as a benchmark? It's quite clear -- and has been for some time, since e.g. AlphaCode in 2022 -- that AI systems can be really powerful when it comes to solving Codeforces-style questions. What seems much harder (for now) is making significant progress on real-world development tasks.

akrlt 9 months ago

I think LLMs aren't better than the median software developer at LeetCode. They simply have a compressed database of the (stolen) answers. If any software developer had access to Google in an interview, he could "solve" all of the answers instantly.
- rmbyrro 9 months ago
  
  > he could "solve" all of the answers instantly
  Set an experiment selecting 100 random software developers around the world and test this hypothesis. You're up to be surprised.
  Nevertheless, most developers who wouldn't be able to "solve" a LeetCode challenge in a couple of hours even with access to Google, I bet would perform much better than o1 on real-world Github issues in their technical domain.
  This highlights the essence behind OP's question about why focusing on Codeforces. And it shows me that "intelligence" involves a dimension that isn't logical and we don't understand yet.
- henry2023 9 months ago
  
  Exactly what I’ve found. Try giving an LLM any novel easy problem (IMO 2024 Problem 5 is a good example). And absolutely every llm out there fails miserably.
  
  Cyph0n 9 months ago
  
  https://deepmind.google/discover/blog/ai-solves-imo-problems...
  
  fallingsquirrel 9 months ago
  
  If I'm reading this right, these are narrowly-applicable purpose-built systems, not general-purpose LLMs.
  > AlphaProof, a new reinforcement-learning based system for formal math reasoning
  > AlphaGeometry 2, an improved version of our geometry-solving system
  
  TacticalCoder 9 months ago
  
  The problem is if a problem is formulated like this: "There are 7 pirates on a ship and 61 cannon balls, ..." (doesn't matter what the problem is: say the solution involves some dynamic programming algorithm) and that the LLM then finds a problem starting with:
  "There are 7 cats in a playground and 61 plushies, ..." (insert basically the same problem, requiring the same solution)
  Well... Then the LLM shall be able to solve it.
  And many people will consider it's a novel problem and hence a resounding success.
  I mean: it is a success, but not anywhere near as impressive as most think.
  
  magicalhippo 9 months ago
  
  I had the same with the strawberries test. For the ones that supposedly gave the right answer, I tried again with "strawerberry" and they promptly failed miserably.
  Now, given the token encoding I think naive letter counting is not something we should expect from LLMs, but still serves as a nice reminder to actually ensure the test/validation data is not part of the training data.
gus_massa 9 months ago

In competitive programing, it's clear what is a good and a bad splution, and they usualy have a set of test so people can't claim they cheated.
With real world questions, everyone can have a different opinion. Let's say you must clasify photos as dogs, cats or other. What about an apple tree? A hyena? An AI generated version of a dog with cat fur? A lion? Hello Kitty?
- ianbutler 9 months ago
  
  SWE bench has gold code patches and the passing test suite patch for after the github issue was completed. While you may argue over the style of the code produced by the model there is a known good passing state for the model to achieve. For now that's the closest representation of a real world problem solve in a controlled repeatable benchmark we have.

CSMastermind 9 months ago

I find myself using 4o more than o1. I haven't noticed any meaningful improvement from o1 and it has more limitations.

golol 9 months ago

o1 is much much much better at mathematics. I am am a phd student and sometimes give it small problems to play around with, just to get some inspiration. That's the thing: It actually plays around and tries stuff instead of just spitting out a remembered answer like GPT-4o.
jtmarmon 9 months ago

I pretty much only use o1 for more rigorous tasks (coding/math). Eg the other day I gave it the tax rates for my area and used it to explore different tax scenarios
rmbyrro 9 months ago

In my experience, o1 is much better at assisting sotware architectual planning and also bug finding. For other tasks, it's unfortunately too slow for the benefits it provides. 4o and sonnet are net better in terms of speed/value balance.
stuckinhell 9 months ago

I'm finding o1 to be far better at code. I wonder why people are having such differences
- loxias 9 months ago
  
  Interesting. I've tried to use it for a few things, it works where I'd expect -- make a scaffolding for a simple Java project, edit this English -- but whenever I then try to use it for Real Work, things that require some thought and don't have lots of examples in the training set, it fails. It didn't help me as much implementing some ideas from an academic paper about compilers, or on using AF_XDP sockets.
  It's interesting that it has the property of always returning _something_, so you have to be careful how you phrase. And the something returned will be optimized for looking right, but might only be so by accident.
rm_-rf_slash 9 months ago

I’ve found o1 can be more hit-or-miss than the relatively consistent 4o, but o1 tends to be better at solving really hairy problems.
In an apples to apples comparison, 4o can overlook important things while focusing on the kernel of the problem, while o1 is often more comprehensive.
isoprophlex 9 months ago

o1 annoys the hell out of me. it ignores my custom instructions, and after using a highly customized gpt-4 for a while I can't bear the blandness of the default models anymore.
it doesn't swear or yell at me, it's verbose, apologetic, overly correct in its language, gives me endless bulleted lists that convey little information.
unbearable.

submeta 9 months ago

What I miss in ChatGPT something I love in Claude: Projects. In Claude I create a project for any larger task, upload files and text, and work on it for days or weeks. Then I can search for projects and follow up on them. In chatgpt I don’t even know how to search for a chat I had days ago. Also, I cannot structure my input / „uploads“.

Btw: How are people working on multiple code files in Claude?

throwaway77385 9 months ago

Ah yes, working on multiple files was an issue until very recently.
Firstly, at claude.ai you can upload multiple files, so Claude will take those into account and even suggest changes to multiple files. You are then, however, still copy/pasting from a web interface.
Enter cursor (https://www.cursor.com/), you can either use a Claude API key (but it will warn you that all the features they want you to pay for then don't work), or just use the free version, like I am currently. It gets me enough prompts per day to improve my life.
Or you could pay for it, but I have a feeling that this is a sort of WinRAR situation...
- submeta 9 months ago
  
  Wow, looks great. How is this different from VS Code‘s copilot?
  
  mbeex 9 months ago
  
  There are several others (free and certainly on par with cursor). Personally I'm using
  https://aider.chat/docs/usage.html (not a vscode plugin, but with other advantages even when using this editor)
  https://docs.continue.dev/getting-started/overview
  https://github.com/cline/cline
  was pretty good too, but it had taken on too much in its latest version 2.0.0. In other words, it is too unstable at the moment (but probably worth looking at again later).
  
  vipshek 9 months ago
  
  I've completely switched over to Cursor from Copilot. Main benefits:
  1. You can configure which LLMs you want to use, whereas Copilot just supports OpenAI models. I just use Claude 3.5 for everything.
  2. Chatting with the LLM can produce file edits that you can directly apply to your files. Cursor's experimental "Composer" UI lets you prompt to make changes to multiple files, and then you can apply all the changes with one click. This is way more powerful than just tab-complete or a chat interface. For example, I can prompt something like "Factor out the selected code into a new file" and it does everything properly.
  3. Cursor lets you tune what's in LLM context much more precisely. You can @-mention specific files or folders, attach images, etc.
  Note I have no affiliation whatsoever with Cursor, I've just really enjoyed using it. If you're interested, I wrote a blog post about my switch to Cursor here: https://www.vipshek.com/blog/cursor. My specific setup tips are at the bottom of that post.
  
  viraptor 9 months ago
  
  It's not even the same class. Copilot does autocomplete essentially. Cursor does that but with the scope of a file/project. That means after you change one thing it will offer to skip to another part of the file and do the other part of the change there. Refractors often end up being a single change then (tab), (tab), (tab), ... And it's correct a shocking amount of time. The UI of proposed changes is also better.
  I'd recommend just trying it, because it's hard to summarise how much it's not copilot.
  
  PoignardAzur 9 months ago
  
  The "tab to next change" feature is amazing and I would love it if it worked reliably.
  As it is though, I suspect whatever model Cursor Tab is using under the hood has a fairly small context window, so the range of that "tab to move" feature ends up being pretty limited.
  Overall my takeaway after months of using Cursor is that it has some really promising features but their LLM engine is too underpowered to deliver.
  Hopefully that will change over the next few years. The potential is definitely there.
- throwaway314155 9 months ago
  
  Do they impose arbitrary time based restrictions (that wouldn't otherwise exist) when you use a Claude API key instead of paying? I went back to VS Code after something like that (seemed to have) happened.
  Honestly, once you learn the copilot specific hot keys you can do all of what cursor does and more. in fact there were times that i felt the team at VS code clearly could have added features that Cursor has but chose not to because they led to more unwanted code slipping through.
  I did like the edit tab completions from Cursor but not worth 20$/month and guaranteed enshittification
- submeta 9 months ago
  
  Holy Christ, this is unbelievable. Been using Claude, ChatGPT and so many other tools to code, but this type of tight integration is absolutely increadible. Thankyou very much for the tip! :)
jcheng 9 months ago

Have you tried ChatGPT’s Playgrounds feature? I don’t use it but it sounds similar.
- submeta 9 months ago
  
  Will check it. Thanks

butz 9 months ago

Are there any specialized small LLMs for specific language/toolkit combo, e.g. GTK4 + Adwaita + GJS or Python? I'd prefer to run something that is a tad slower due to my PC, but does not break when there's no internet connection.

vivzkestrel 9 months ago

Are we at that point yet where law of diminishing returns starts to take effect?

itamarcode 9 months ago

While Sonnet-3.5 excels in accuracy-token, o1 excels in self-reflection in small isolated tasks. With AlphaCodium, o1 tasks are broken to small isolated tasks, while the flow introduced in AlphaCodium actually guides the steps and overall decision making framework.

We will see more of these frameworks for different use cases

BugsJustFindMe 9 months ago

It seems that 4o would/could run the code it wrote on sample inputs for test+update cycles, but o1 only pretends to and then reports feelgood results that are actually impossible? Is that a known tradeoff that I just missed somewhere in the marketing? Is 4o's appearance of running code also a charade?

richardw 9 months ago

So much of my workflow has moved to Claude. I haven’t quite got into Cursor, and find CoPilot fine for simple autocomplete but I don’t want bigger changes until I’ve checked them. In fact I usually heavily slow down efforts by insisting on design discussions first, and either continue, abandon/restart chat and/or write code myself and come back when the I think the issue is LLM-friendly. But Claude works for me and edits multiple files at once, which I check. I use ChatGPT (o1 or 4o) when I don’t want to impact my Claude quota or when I think a single file change is tricky. It’s personality is now too annoying. So often it doesn’t even look at the code in multiple files I’ve uploaded, it just starts shooting off code and I have to say “look at the code”.

What improvements am I missing? Honest question.

sha16 9 months ago

I think Claude trades quality for speed. From what I've seen it starts generating almost immediately even with a large token window. For smaller changes it is usually good enough, but larger changes are where I bump into issues as well. I'll stick to using change [somefunction] rather than change entire file.
- richardw 9 months ago
  
  I tend to iterate and limit the output by eg saying “make the smallest change possible, don’t rewrite the file, tell me what you want to do first” etc. It seems to respond well to change requests, with much apology. ChatGPT berates me about its own code and keeps saving shit I don’t want in memory, so I have to go back and clean up stuff like "Is testing Lambda functions directly in the AWS web console" and "Is working with testing and integration test fixtures in their software projects" when those are 2 of 100 things I'm doing. I'm using SAM for lambda, I might have run one in the console to bypass the API and now it's saved it as gospel. Half the benefit with LLM's is that they forget context when you start a new chat, so you can control what it focuses on.
dghlsakjg 9 months ago

What are you using to get your code into Claude. Continue, or something similar?
- richardw 9 months ago
  
  No I just drag files in, at least when I'm working with IntelliJ. Sluggish but works. With VSCode I've used Copilot but still break multi-file tasks out to Claude. I’m sure the eg Cursor workflow is better but it hasn’t bitten for me yet.
  I also tried Replit and it felt amazing, but quickly got into a place it couldn't escape and it felt like it took too much effort to get it to change direction once it had committed to a plan. This was early, so it's almost definitely better by now.
- smuser 9 months ago
  
  I really like using aider (https://aider.chat) and heavily leverage the /ask command to have a quick chat on intent and context before prompting it for the code I need.
  
  rmbyrro 9 months ago
  
  Aider feels like magic
algo_trader 9 months ago

is copilot improving behind the scenes ?
I assume you get the latest model, plus some goodies that GH is working on
baudpunk 9 months ago

Cursor is so good.

bearjaws 9 months ago

o1 seems to be better suited at many changes executed at once or larger project planning.

I don't use o1 simply because I work on one small problem at a time, and LLMs tend to go off the rails when giving multi-step tasks. o1 is not a silver bullet for this either.

Recently it doesn't seem to spend much time thinking, and honestly my results from o1 have been disappointing. I've been sticking with 4o and Claude 3.5 sonnet still.

cloudking 9 months ago

Anyone tried their extension? Is it better than Cursor?

cbhl 9 months ago

I found it tricky to set up and use, personally.
My current ranking would be Cursor > Continue > Codium (haven't yet tried Copilot).
Codium seems to specialize in enterprise right now (where someone might be told to not use Cursor).
- rmbyrro 9 months ago
  
  For me, Copilot degraded over time. It started like magic. But they were losing money pretty fast and my impression is they had to dumb it down. I wish I could pay more for the old Copilot.
  Aider is the best of them all, but I spend too much money... Like, I can easily spend $10, $20 in a day. Which is still a great deal for the added productivity it gives me, but it's $200-$400/mo, which is salty.
  Cursor didn't impress me.
- esafak 9 months ago
  
  There's Cody too.
moffkalast 9 months ago

They must be pretty good if they think they can charge $20 for an API wrapper. That's how much it used to cost to get GPT-4 back when everyone else was showing off half coherent models.
- cloudking 9 months ago
  
  It's much more than an API wrapper. They built a bunch of custom smaller models like tab auto completion, applying changes from foundation models, indexing all your code and performing RAG etc. All of this is abstracted away from the user in an easy to use UX.