I'm one of the creators of OpenHands (fka OpenDevin). I agree with most of what's been said here, wrt to software agents in general.
We are not even close to the point where AI can "replace" a software engineer. Their code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp. I've talked to companies who went all in on AI engineers, only to realize two months later that their codebase was rotting because no one was reviewing the changes.
But once you develop some intuition for how to use them, software agents can be a _massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself. I especially love asking it to do simple, tedious things like fixing merge conflicts or failing linters. It's great at getting an existing PR over the line.
It's also important to keep in mind that these agents are literally improving on a _weekly_ basis. A few weeks ago we were at the top of the SWE-bench leaderboard; now there are half a dozen agents that have pulled ahead of us. And we're one launch away from leapfrogging back to the top. Exciting times!
> code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp
> ..._massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself.
I'm having trouble reconciling these statements. Where does the productivity boost come from since that reviewing burden seems much greater than you'd have if you knew commits were coming from a competent human?
There's often a lot of small fixes that not time efficient to do, but a solution is not much code and is quick to verify.
If the cost is small to setting a coding agent (e.g. aider) on a task, seeing if it reaches a quick solution, and just aborting if it spins out, you can solve a subset of these types of issues very quickly, instead of leaving them in issue tracking to grow stale. That lets you up the polish on your work.
That's still quite a different story to having it do the core, most important part of your work. That feels a little further away. One of the challenges is the scout rule, the refactoring alongside change that makes the codebase nicer. I feel like today it's easier to get a correct change that slightly degrades codebase quality, than one that maintains it.
Thanks - this all makes sense - I still don't feel like this would constitute a massive productivity boost in most cases, since it's not fixing time consuming major issues. But I can see how it's nice to have.
The bigger win comes not from saving keystrokes, but from saving you from a context switch.
Merge conflicts are probably the biggest one for me. I put up a PR and move onto a new task. Someone approves, but now there are conflicts. I could switch off my task, spend 5-10 min remembering the intent of this PR and fixing the issues. Or I could just say "@openhands fix the merge conflicts" and move back to my new task.
The issue is that you still need to review the fixed PR (or someone else does) which means you just deferred the context switch, you didn't eliminate it. And if the fix is in a new commit, that's possible (whereas if it rebases you have to remember your old SHA).
I haven't started doing this with agents, but with autocomplete models I know exactly what OP is talking about: you stop trying to use models for things that models are bad at. A lot of people complain that Copilot is more harm than good, but after a couple of months of using it I figured out when to bother and when not to bother and it's been a huge help since then.
I imagine the same thing applies to agents. You can waste a lot of time by giving them tasks that are beyond them and then having to review complicated work that is more likely to be wrong than right. But once you develop an intuition for what they can and cannot do you can act appropriately.
Is that true? I'd like to think my commits are less burdensome to review than a fresh out of boot camp junior dev especially if all that's being done is fixing linter issues. Perhaps there's a small benefit, but doesn't seem like a major productivity boost.
Agreed! The comparison is great for estimating the scope of the tasks they're capable of--they do very well with bite-sized tasks that can be individually verified. But their world knowledge is that of a principal engineer!
I think this is why people struggle so much with agents--they see the agent perform magic, then assume it can be trusted with a larger task, where it completely falls down.
My biggest issue is just how often these agents make subtle, hard to spot mistakes.
It can seem great for certain tasks at first. Yesterday I had to add license headers to the top of a lot of source code files. The reason why I let the AI try is because the repository contained lots of different programming languages.
It was able to do this but I then realized that it just removed the last sentence of the text it was supposed to add.
We've seen exponential improvements in LLM's coding abilities. Went from almost useless to somewhat useful in like two years.
Claude 3.5 is not bad really. I wanted to do a side project that has been on my mind for a few years, and Claude coded it in like 30 seconds.
So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.
> So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.
These sorts of things can’t be extrapolated. It could be 6-months, it could be a local maxima / dead end that’ll take another breakthrough in 10 years like transformers were. See self-driving cars.
I think the most you could say is we’ve had improvements - from gpt 4 to whatever the current model is has definitely not been exponential improvements.
My experience is acctually they’ve become dramatically less helpful over the past two years (past year in particular). Claude seems not to have backslid much but it’s progression also has not been very fast at all (I’ve noticed no difference since the 3.5 launch despite several updates).
Everything grows sinusoidally and I feel we’re well past the tipping point into diminishing rate of improvement
What does the cost look like for running OpenHands yourself? From your docs, it looks like you recommend Sonnet @ $3 / million tokens. But I could imagine this can add up quickly if you are sending large portions of the repository at a time as context.
I think one of the big problems with Devin (and AI agents in general) is that they're only ever as good as they are. Sometimes their intelligence feels magical and they accomplish things within minutes that even mid level or senior software engineers would take a few hours to do. Other times, they make simple mistakes and no matter how much help you give, they run around in circles.
A big quality that I value in junior engineers is coachability. If an AI agent can't be coached (and it doesn't look like it right now), then there's no way I'll ever enjoy using one.
My first job I spent so much time reading Python docs, and the ancient art of Stack Overflow spelunking. But I could intuitively explain a solution in seconds because of my CS background. I used to encounter a certain kind of programmer often, who did not understand algorithms well but had many years of experience with a language like Ruby, and thus was faster in completing tasks because they didn't need to do the reference work that I had to do. Now I think these kinds of programmers will slowly disappear and only the ones with the fast CS intuition will remain.
I disagree. If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).
If anything, my gut says that the CS concepts are very easy for LLMs to recall and will be the first things replaced (if ever) by AI. Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.
There's also the meme in the industry that self-taught, non-CS degree engineers are potentially of the most capable group. Though this is anecdotal.
> If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).
Emerging? I graduated in 2006 with a BEng in Software Engineering.
The difference between it and the BSc CompSci degree I started in, was that optional modules became mandatory — including an industrial placement year (paid internship).
> Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.
My gut disagrees, because LLMs are at about the same level in those things as they are in low level coding: not yet replacing humans in project level tasks any more than they do in coding tasks, but also being OK assistants for both coding and project domains. I have no reason to think either has less or more opportunity for self-training, so I expect progress to track for the foreseeable future.
(That said, the foreseeable future in this case is 1-2 years).
> the CS concepts are very easy for LLMs to recall
They're easy to recall, but you have to know what to recall in the first place. Or even know enough of the territory to realise there's something to recall. Without enough background, you'll get a whole set of amazing tools that you have no idea what to do with.
For example, you may be able to write a long description of your problem with some ideas how to steer the AI to give you possible solutions. And the AI may figure out what the problem is and that the hyperloglog is something that could be useful to you. And you may have the awesome programming skills to implement that. But that's a lot of maybes. It would be much faster/easier if you knew about hyperloglog ahead of time and just asked for the implementation or library recommendation.
Or even if you don't know about the actual solution, you'd have enough of CS vocabulary to ask: "how do I get a fast, approximate distinct count from a multiset". It would take a long imprecise description to get the same thing for a coder with no theory background.
To this point, I use AI programming assistants pretty heavily and find very frequently that they will write extremely inefficient or oddly baroque implementations of what I’m asking for in their first pass, that appear as if they don’t have the “knowledge” or ability to do it better, but then they can be prodded to re-do it very easily. Frequently I look at some generated code and write back the most cursory feedback like “looks o(n^2) can you make more efficient” or “use pointers instead of nested loops” or “how about using X approach” and it will often produce something dramatically better than the initial effort. For now at least I think these tools are still most powerful in the hands of experts. (I am a self-taught programmer but have a fair bit of experience)
I'm not convinced an LLM is really "recalling" any CS concepts when they try to solve a problem. IMHO, we're lucky if it matches the pattern of the request against the pattern of a solution and the two are actually related. I'm no expert but I don't think there's any reason to think that an LLM is taking a CS concept and applying it to something novel in order to get a solution. If they were, I believe their success rate would be much higher.
In many places where someone might reach for something they remember from their CS coursework, there's often an open-source library or tool doing much the same thing. Understanding how these libraries and tools function is certainly valuable but, much of the time, people can get by with only a vague hunch; indeed, this is why they exist! IMHO, I would be happier with the LLM assistant if it picked reliable library code rather than writing up a sketchy solution of its own.
I'm also familiar with this idea people who have managed to be successful in the field without a CS degree are more capable. In my opinion, this is hogwash. I think if we take a step back, we'll see that people graduating from established, top-tier CS programs are looking for higher pay than those who have come from a less expensive and (very often) business focused program. To be fair, people from each of these backgrounds has their strengths; in many organizations a developer who has done two semesters of accounting is a real benefit, in others the ability to turn a research paper into the prototype of a new product is going to be important.
Years of experience often washes out much of these differences. People who have started from business oriented education programs may end up taking a handful of CS courses as they navigate their career, likewise many people with a CS background end up accruing business centered skills.
In my opinion, people start out their education at a place that they can afford, a place that is familiar to them, often a place that they feel comfortable. Someone's economic background (and of their family) plays a big role on what kind of educational program they choose when they are on the cusp of adulthood. Smart and talented people can always learn what they need to learn, regardless of their starting point.
I think honestly the meme that non-CS degree engineers are most capable is selection bias.
If they had taken a CS degree they would likely be just as, of not more capable.
To self-learn the topics you need to make good software takes an immense amount of effort and although the data and material is out there, takes a lot of work to figure out.
I'm only recently starting to pick up on "magic" patterns that are actually extremely simple to understand given the right base knowledge... I can gain tons of insights from talks givem in the early 2010s but if I watched them without the correct practical experience and foundational knowledge it is the same as the title to a HN post this week[1], gibberish.
With the correct time playing with the foundational patterns and learning some of the backing knowledge it unlocks amazing patterns in my mind and makes the magic seem simple. A great example, CSP[2]. I've known about and used the actor model before, which I first discovered when I found Erlang, but now with CSP I could ask the question "Why should actors be heavy?", you can put an actor into a light-weight task and spawn tons of them and build a tree of connections. Stuff like oneTBB flow graph[3]now makes sense and looks like a beautiful pattern with some really interesting ideas that can be implemented in more general computing than the high performance computing it was designed for. It seems niche but golang is built on those foundations, and the true power of concurrency in golang comes from embracing that. It fundamentally changes the way I want to structure and layout code and I feel like a good CS course can get you there quicker...
Unfortunately a good CS course probably wouldn't accelerate the average CS grads understanding of that but can get someone dedicated and hungry there much much quicker. Someone fresh out of a JS bootcamp is maybe a decade away from that if they ever even want to search for that knowledge.
I completely agree with you. More precisely, I feel they are useful when you have specific tasks with limited scope.
For instance, just yesterday I was battling with a complex SQL query and I got halfway there. I gave our bot the query and an half assed description of what I wanted/what was missing and it got it right on the first try.
And when working with people it's fairly easy to intervene and improve when needed. I think the current working model with LLMs is definitely suboptimal when we cannot confine their solution space AND where they should apply a solution precisely, and timely.
It’s also often possible to know what a human will be bad at before they start. This allows you to delegate tasks better or vary the level of pre-work you do before getting started. This is pretty unpredictable with LLMs still.
As someone who uses AI coding tools daily and has done a fair amount of experimentation with different approaches (though not Devin), I feel like this tracks pretty well. The problem is that Devin and other "agentic" approaches take on more than they can handle. The best AI coders are positioned as tools for developers, rather than replacements for them.
Github Copilot is "a better tab complete". Sure, it's a neat demo that it can produce a fast inverse square root, but the real utility is that it completes repetitive code. It's like having a dynamic snippet library always available that I never have to configure.
Aider is the next step up the abstraction ladder. It can edit in more locations than just the current cursor position, so it can perform some more high-level edit operations. And although it also uses a smarter model than Copilot, it still isn't very "smart" at the end of the day, and will hallucinate functions and make pointless changes when you give it a problem to solve.
When I tried Copilot the "better tab complete" felt quite annoying, in that the constantly changing suggested completion kept dragging my focus away from what I was writing. That clearly doesn't happen for you. Was that something you got used to over time, or did that just not happen for you? There were elements of it I found useful, but I just couldn't get over the flickering of my attention from what I was doing to the suggested completions.
Edit: I also really want something that takes the existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions. Does Copilot do that now?
I tried to get used to the tab completion tools a few times but always found it distracting like you describe. often I’d have a complete thought, start writing the code, get a suggested completion, start reading it, realize it was wrong, but then I’d have lost my initial thought, or at least have to pause and bring myself back to it.
I have, however, fully adopted chat-to-patch style workflows like Aider, I find it much less intrusive and distracting than the tab completions, since I can give it my entire thought rather than some code to try to complete.
I do think there’s promise in more autonomous tools, but they still very much fall into the compounding-error traps that agents often do at the present.
It's just vscode. I greatly prefer vim but the difference between vim + ai tools and cursor is just a no brainer in terms of productivity. Cursor isn't without problems but it's leagues ahead of the competition in my opinion.
pricing model, downtime, model support, pricing model, trying to take over the experience rather than assist within my experience. This last one is big, because Cursor wants to "reimagine" how developers work. The problem is the AIs are so far from being competent, they need to be kept on the sidelines and sub'd in occasionally, not be the quarterback. Oh, did I mention pricing model?
Personally, I just prefer the chat interface directly with no Cursor UI.
For me, the best way is to write my prompt in a txt file, away from anything to do with LLMs. The bottleneck is not the update of the files like Cursor is good at.
The bottleneck is the clarity of my thoughts.
I looked at your website.
How to get past Barry Schwartz ideas is the main problem that we face in 2025.
The Godel, Escher, Bach stuff to me is just nonsense. As a huge Bach fan boy it is from when Bach was massively overrated in cultural importance.
Hierarchy Theory? How about O-information?
Doesn't seem the O-information wiki entry exists, yet.
Because of the complaints? If so, yeah I get it. I'm there amongst them. It's kind of like Tesla FSD. There are often setbacks in releases and they definitely need to work on their communication with the community. That said, for the current price it's still worth any misgivings.
I would try cursor. It’s pretty good at copy pasting the relevant parts of the codebase in and out of the chat window. I have the tab autocomplete disabled.
i’ve been very impressed with the gemini autocomplete suggestions in google colab, and it doesn’t feel more/less distracting than any IDEs built in tab suggestions
I think a lot of people who are enabling copilot in vs code (like I did a few days ago), are experiencing "suggested autocomplete as I type" for the first time where before there was no grey text below what I am writing personally.
It is a huge distraction, especially if it changes as I write more. I turned it off almost immediately.
I deeply regret turning on copilot in vscode. It (M$) immediately weaseled into so many places and settings. I'm still trying to scaled it back. Super annoying and distracting. I'd prefer a much more opt in for each feature than what they did.
> The best AI coders are positioned as tools for developers, rather than replacements for them.
I agree with this. However, we must not delude ourselves and understand that corporate is pushing for replacement. So there will be a big push to improve on tools like Devin. This is not a conspiracy theory, in many companies (my wife's, for example) they are openly stating this: we are going to reduce (aka "lay off") the engineering staff and use as much AI solutions as possible.
I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. Not everyone can be a top of the cream specialist. And it'll be used to drive down salaries, too.
I remember when I was first getting started in the industry the big fear of the time was that offshoring was going to take all of our jobs and drive down the salaries of those that remained. In fact the opposite happened: it was in the next 10 years that salaries ballooned and tech had a hiring bubble.
Companies always want to reduce staff and bad companies always try to do so before the solution has really proven itself. That's what we're seeing now. But having deep experience with these tools over many years, I'm very confident that this will backfire on companies in the medium term and create even more work for human developers who will need to come in and clean up what was left behind.
(Incidentally, this also happened with offshoring— many companies ended up with large convoluted code bases that they didn't understand and that almost did what they wanted but were wrong in important ways. These companies needed local engineers to untangle the mess and get things back on track.)
But having deep experience with these tools over many years, I'm very confident...
No one has had deep experience with these tools for any amount of time, let alone many years. They're literally just now hitting the market and are rapidly expanding their capabilities. We're at a fundamentally different place than we were just twelve months ago, and there's no reason to think 2025 will be any different.
I was building things with GPT-2 in 2019. I have as much experience engineering with them as anyone who wasn't an AI researcher before then.
And no, we're not at a fundamentally different place than we were just 12 months ago. The last 12 months had much slower growth than the 12 months before that, which had slower growth than the 12 months before that. And in the end these tools have the same weaknesses that I saw in GPT-2, just to a lesser degree.
The only aspect in which we are in a fundamentally different place is that the hype has gone through the roof. The tools themselves are better, but not fundamentally different.
It’s genuinely difficult to take seriously a claim that coding using Sonnet has “the same weaknesses” as GPT-2, which was effectively useless for the task. It’s like suggesting that a flamethrower has the same weaknesses as a matchstick because they both can be put out by water.
We’ll have to agree to disagree about whether the last 12 months has had as much innovation as the preceding 12 months. We started 2024 with no models better than GPT-4, and we ended the year with multiple open source models that beat GPT-4 and can run on your laptop, not to mention a bunch of models that trounce it. Plus tons of other innovations, dramatically cheaper training and inference costs, reasoning models, expanded multi-modal capabilities, etc, etc.
I'm paying for o1-pro (just for one month) and have been using LLMs since GPT-2 (via AI Dungeon). Progress is absolutely flattering when you're looking at practical applications versus benchmarks.
o1 is actually surprisingly "meh" and I just don't see how they can justify the price when sonnet 3.5 latest is almost as good, 10x as fast and doesn't even have "reasoning".
I'm spending half my day every day for the past few years using LLMs in one way or another. They still confidently (and unpredictability) hallucinate, even o1. They have no memory, can't build up experience, performance rapidly degrades with long conversations, etc.
I'm not saying progress isn't being made, but the rate of progress is definitely slowing.
Unlike with offshoring, this is a technological solution, which understandably is received more enthusiastically on HN. I get it. It's interesting as tech! And it's achieved remarkable things. But unlike with offshoring (which is a people thing) or magical NOCODE/CASE/etc "solutions", it seems the consensus is that AI coding assistants will eventually get there. At least a portion of even HN seems to think so. And some are cheering!
The coping mechanism seems to be "it won't happen to me" or "my knowledge is too specialized" but I think this will become increasingly false. And even if your knoweldge is too specialized to be replaced by AI, most engineers aren't like that. "Well, become more specialized" is unrealistic advice, and in any case, the employment pool will shrink.
PS: I am offhsoring (in a way). I'm not based in the US but I work remotely for a US company.
> But unlike with offshoring (which is a people thing) or magical NOCODE/CASE/etc "solutions", it seems the consensus is that AI coding assistants will eventually get there.
There's no consensus to that point. There are a few loud hype artists, most of whom are employed in AI and have so have conflicts of interest and also are pre-filtered to the true believers. Their logic is basically "See this trend? Trends continue, so this is inevitable!"
That's bad logic. Trends do not always continue, they often slow or reverse, and this one is showing all signs of doing so already. OpenAI has come straight out and said that they don't expect to see another jump like GPT-3 to 4, and have resorted to throwing more tokens at the problems, which works with diminishing returns. I do not expect to see a return to the rapid growth we had for a year or two there.
> PS: I am offhsoring (in a way). I'm not based in the US but I work remotely for a US company.
Yes, and this is a good example: there's a place for offshoring, but it didn't replace US devs. The same thing will happen here.
Trends do not always continue, they often slow or reverse, and this one is showing all signs of doing so already. OpenAI has come straight out and said that they don't expect to see another jump like GPT-3 to 4, and have resorted to throwing more tokens at the problems, which works with diminishing returns. I do not expect to see a return to the rapid growth we had for a year or two there.
This feels like the declaration of someone who has spent almost no time playing with these models or keeping up with AI over the last two years. Go look at the benchmarks and leaderboards for the last 18 months and tell me we're not progressing far beyond GPT4. Meanwhile models are also getting faster, cheaper, getting multi-modal capabilities, cheaper to train for a given capability, etc.
And of course there are diminishing returns, the latest public models are in the 90s on many of their benchmarks!
> I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. And it'll be used to drive down salaries, too.
Yeah it's maddening.
The cope is bizarre too: "writing code is the least important part of the job"
Ok then why does nearly every company make people write code for interviews or do take home programming projects?
Why do people list programming languages on their resumes if it's "least important"?
Also bizarre to see people cheering on their replacements as they use all this stuff.
> Ok then why does nearly every company make people write code for interviews or do take home programming projects?
For the same reason they put leetcode problems to "test" an applicants skill. Or have them write mergesort on a chalkboard by hand. It gives them a warm fuzzy feeling in the tummy because now they can say "we did something to check they are competent". Why, you ask? Well it's mostly impossible to come up with a test to verify a competency you don't have yourself. Imagine you can't distinguish red and green, are not aware of it, but want to hire people who can. That's their situation, but they cannot admit it - because it would be clear evidence that they are no good fit for their current role. Use this information responsibly ;)
> Why do people list programming languages on their resumes if it's "least important"?
You put the programming languages in there alongside the HR-soothing stuff because you hope that an actual software person gets to see your resume and gives you an extra vote for being a good match. Notice that most guides recommend a relatively small amount of technical content vs. lots of "using my awesomeness i managed to blafoo the dingleberries in a more efficient manner to earn the company a higher bottom line"
If you don't want to be a software developer that's fine. But your questions point me towards the conclusion that you don't know a lot of things about software development in the first place which doesn't speak for your ability to estimate how easy it will be to automate it using LLMs.
Arguing about programming is not the point, in my opinion.
When AI becomes able to do most non-programming tasks too, say design or solving open-ended problems (yeah, except in trivial cases it cannot -- for now) we can have this conversation again...
I think saying "well, programming is not important, what matters is $THING" is a coping mechanism. Eventually AI will do $THING acceptably enough for the bean counters to push for more layoffs.
When AI can do the software engineering tasks that require expertise outside of coding like system design, scoping problems, cross-team/domain work, etc then it will be AGI, at which point the fact that SWE jobs are automated would be the least of everyones worries.
The main problem I perceive with AI being able to do that kind of work is that it requires an unprecedented level of agency and context-gathering. Right now agents are very much like juniors in that they work in an insular, not collaborative, way.
Another big problem is that these higher level problems often require piecing together a lot of fragmented context. If the AI already had access to the information, sure, it would probably be able to achieve the task. But the hard bit is finding the information. Some logs here, some code there, a conversation with someone on a different team, etc. It's often a highly intuitive and tacit process, not easily explicitly defined. There's a reason that defining what a "Senior" is tends to be very difficult.
> When AI can do the software engineering tasks that require expertise outside of coding like system design, scoping problems, cross-team/domain work, etc then it will be AGI
I think you're talking about the really general case, but in my opinion that's not as important. All that matters is that AI solutions manage (in the near future) to cover the average case -- where most engineers actually work -- in a mediocre but cost effective manner, for this to have huge repercussions on the job market and salaries.
> But the hard bit is finding the information. Some logs here, some code there, a conversation with someone on a different team, etc.
I've no problem believing they will become more and more successful at this. This is information retrieval which can be done faster by machines, and making sense of it all together is where advances in AI will need to happen. I think there's a high chance they'll happen eventually, at least in a way that's enough to cobble together projects that will make the leadership happy (maybe after some review/adjustment by a few human experts they retain?). They do not even have to be particularly successful -- how many human-populated engineering projects succeed, anyway?
Also, because the economy is no longer based on competition, but is controlled by a bunch of industry specific oligopolies, even if the bean counters are wrong it won’t matter, because every other company will be similarly inefficient. Everybody loses, but the people in charge are too dumb to know. Our free market is currently broken.
Is spending 4 years of your life on education that will likely only be 10-20% applicable to your job any less bizarre? It's just another hoop employers want to see you capable of jumping.
If you ignore the syntax programming is just writing detailed instructions. Just because AI is able to translate English to code doesn't mean the 100s of decisions that need to be made go away. Someone still needs to write very detailed instructions even if they are in English and it sure isn't going to be the people sitting in meetings all day.
And let's pretend that I can now be 10x more productive with AI. Great, now I can ship 10x more features in the same timeframe and nothing changes - the development backlog is literally infinite. There are always more features or bugs to work on.
> Just because AI is able to translate English to code doesn't mean the 100s of decisions that need to be made go away. Someone still needs to write very detailed instructions even if they are in English and it sure isn't going to be the people sitting in meetings all day.
What makes you think it will be you? The machines seem increasingly capable of converting English into different English, and if we take it as a given that they can convert English into code.. what are you there for? The people sitting in meetings might as well talk to the machine, to the extent they're willing to talk to you.
To be clear, the professional "meeting participants" are as much on the chopping block as we are, although that's not commonly pointed out.
One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?
One of the more important features of agents is supposedly that they can stop and ask for human input when necessary. It seems it does do this for "hard stops" - like when it needed a human to setup API keys in their cloud console - but for "soft stops" it wouldn't.
By contrast, a human dev would probably throw in the towel after a couple of hours and ask a senior dev for guidance. The chat interface definitely supports that with this system but apparently the agent will churn away in a sort of "infinite thinking loop". (This matches my limited experience with other agentic systems too.)
LLMs can create infinite worlds in the error message it’s receiving. It probably needs some outside signal to stop and re-assess. I don’t think LLMs have any ability to reason if they’re lost in their own world on their own. They’ll just keep creating new less and less coherent context for themselves
If you correct an LLM based agent coder, you are always right. Often, if you give it advice, it pretends like it understands you, then goes on to do something different from what it said it was going to do. Likewise, it will outright lie to you telling you it did things it didn't do. (In my experience)
For sure - but if I'm paying for a tool like Devin then I'd expect the infrastructure around it to do things like stop it if it looks like that has happened.
What you often see with agentic systems is that there's an agent whose role is to "orchestrate", and that's the kind of thing the orchestrator would do: every 10 minutes or so, check the output and elapsed time and decide if the "developer" agent needs a reality check.
> One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?
You are over-estimating the sophistication of their platform and infrastructure. Everyone was talking about Cursor (or maybe was it astroturfing?) but once I checked it out, it was not far from avante on neovim.
They are pushing in this direction with the Composer Agent mode which can carry out a sequence of multi-file changes without you having to specify the files. It's pretty decent. If you're feeling brave there is also a beta "YOLO" mode that will auto approve these changes and run console commands.
Devin does ask for help when it can't do something. I think I have it asked me how to use a testing suite it had trouble running.
The problem is it really really hate asking for help if it had a skill issue, it would prefer running in circles than admitting it just can't do something.
Maybe. It's pretty weird and I'm still thinking about it.
You can't throw junior engineers working on an issue under the bus when they clearly can't do that. Or at least it takes some effort. In return you may coach them and hope they eventually improves.
Devin does look like junior engineers, but I've learned to just click "Terminate Session" immediately after I spotted that it was doing something hopeless. I've managed to get some real work done out of it, without much effort on my side (just check what it's doing every 10~15 minutes and type a few lines or restart session).
If they had built that from the beginning people would have said "every other tasks it asks me for help, how is it a developer then if I have to assist it all the time?"
But now since you are okay with that, I think it's the right time to add that feature.
There should be an energy coefficient to problems. You only get a set amount of energy to solve per issue. When the energy runs out. A human must help.
I'm sure a lot of folks in these comments predicted these sorts of results with surprising accuracy.
Stuff like this is why I scoff when I hear about CEOs freezing engineering hiring or saying they just don't need mid-level engineers anymore because they have AI.
I'll start believing that when I see it happening, and see actual engineers saying that AI can replace a human.
I am long AI, but I think the winning formula is small, repetitive tasks with a little too much variation to make it worth it (or possible) to automate procedurally. Pulling data from Notion into Google sheets, like these folks did initially, is probably fine. Having it manage your infrastructure and app deployments, likely not.
This feels a bit like AI image generation in 2022. The fact that it works at all is pretty mindblowing, and sometimes it produces something really good, but most of the time there are obvious mistakes, errors, etc. Of course, it only took a couple more years to get photorealistic image outputs.
A lot of commenters here seem very quick to write off Devin / similar ideas permanently. But I'd guess in a few years the progress will be remarkable.
One stubborn problem – when I prompt Midjourney, what I get back is often very high-quality, but somehow different than what I expected. In other words, I wouldn't have been able to describe what I wanted, but once I see the output I know it's not quite right. I suspect tools like this will run into similar issues. Maybe there will be features that can help users 'iterate' quickly.
> Of course, it only took a couple more years to get photorealistic image outputs.
"Photorealistic" is a pretty subjective judgement, whereas "does this code produce the correct outputs" is an objective judgement. A blurry background character with three arms might not impact one's view of a "photorealistic" image, but a minor utility function returning the wrong thing will break a whole program.
If were comparing Devin to image generation, then Devin would be a version of Midjourney where you have no prompting skills, you only get one image and if you want something different you can only use the remix feature to make changes, oh and with each change the image resolution goes up and you get more jpeg artifacts.
Those “how I feel about Devin after using it” comments at the bottom are damning, when you compare them to the user testimonials of people using cursor.
Seems to me that agents just aren’t the answer people want them to be, just a hype wave obscuring real progress in other areas (eg. MCST) because they’re easy to implement.
…but really, if things are easy to implement, at this point, you have to ask why they haven’t been done yet.
Probably, it seems, because it’s harder to implement in a way that’s useful than it superficially appears…
Ie. If the smart folk working on Devin can only do something of this level, anyone working on agentic systems should be worried, because it’s unlikely you can do better, without better underlying models.
I recently used cursor and it has felt very capable in implementing tasks across files. I get that cursor is an IDE but it's ai functionality feels very agentic.. where do you draw the line?
I had to look up MCST: it means Model-Centric Software Tools, as opposed to autonomous agents.
Devin is closer to a long-running process that you can interact with as it is processing tasks, whereas Cursor is closer to a function call: once you've made the call, the only think you can do is wait for the result.
Agents are really new and would solve plenty of annoying things.
When I code with Claude, I have to copy paste files around.
But everything we do in AI is new and outdated a few weeks ago.
Claude is really good but blocks you in 1-3h for a bit due to context length.
That type of issues will be solved.
And local coding models are super fast on a 4090 already. Imagine a small project digits on your desktop were you allow these models also more thinking. But the thinking style models again are super new.
Things probably are not done yet because we humans are the bottleneck right now. Getting enough chips, energy, standards, training time, doing experiments with tech a while tech b starts to emerge from another corner of ai.
5090 just was announced and depending on benchmarks it might be 1.x-3 times faster. if it's faster above 1.5 that would again be huge.
Disclosure: Working on a company in the space and have recently been compared to Devin in at least one public talk.
Devin has tried to do too much. There is value in producing a solid code artifact that can be handed off for review to other developers in limited capacities like P2s and minor bugs which pile up in business backlogs.
Focusing on specific elements of the development loop such as fix bugs, add small feature, run tests, produce pull request is enough.
Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.
Not to take away from your opinion, but I guess time will tell? As models get better, it's possible that wide tools like Devin will work better and swallow tools that do one thing. I think companies much rather have a AI solution that works like what they already know (developers), than one that works in the IDE, another that watches to Github issues, another that reviews PRs, and one that hangs on Slack and makes small fixes.
> Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.
Interest isn't what tools like Devin are lacking, (un)fortunately.
To be clear, I do share a lot of scepticism regarding all the businesses working around AI code generation. However, that isn't because I think they'll never be able to figure it out, but because I think they are all likely to figure it out at the end, at the same time, when better models come out. And none of them will have a real advantage over the other.
I've recently had several enterprise level conversations with different companies and what we're being asked for is specifically the simpler approach. I think that is the level of risk they're willing to tolerate and it will still ameliorate a real issue for them.
The key here is my product is no worse positioned to do more things if and when the time comes, but building a solid foundation and trust, and not having the quiet part be (which I heard as early as several months ago) that your product doesn't work means we'll hopefully still have the customer base to roll that out to.
I've talked to Devin's CEO once at Swyx's conference last June, they're very thoughtful and very kind so this must be very rough but between when they showed their demo then and what I'm hearing now the product has not evolved in a way where they are providing value commensurate with their marketing or hype.
I'm a fan of Guillermo Rauch's (Vercel CEO) take on these things. You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.
Devin's investment was fueled by hyperspeculation early on when no one knew what the shape of the game was. In many ways we still don't, but if you burn your reputation before we get there you may not be able to capitalize on it.
To be completely fair to them, taking the long view and the bank account to go with it they may still be entirely fine.
> You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.
Not entirely. We're in interesting times where products with better models can suddenly leapfrog and displace even current upstarts. Cursor won over Copilot from leveraging Claude Sonnet 3.5. They didn't "earn the right".
Improvements with models will help those with the existing infrastructure that can benefit from it. I'm not saying Devin will win when that time comes, but a similar product might find their space quickly.
You can get a much higher hit rate with more constrained agents, but unfortunately if it's too constrained it just doesn't excite people as much.
Ex. the Grit agent (my company) is designed to handle larger maintenance tasks. It has a much higher success rate, with <5% rejected tasks and 96% merged PRs (including some pretty huge repos).
It's also way less exciting. People want the flashy tool that can solve "everything."
Also trialed Devin, it's quite impressive when it understands the code formatting and local test setup, producing well formatted and test case passing code, but it seems to always add extraneous changes beyond the task that can break other things. And it can't seem to undo those changes if you ask. So everything requires more cleanup. Devin opened my eyes to the power of agentic workflows with closed loop feedback, and the coolness of a slack interface, but I am gonna recommend cancelling it because it's not actually saving time and it's quite expensive.
I’ve used Cursor a lot and the conclusion doesn’t surprise me. I feel like I’m the one *forcing* the system in a certain direction and sometimes an LLM gives a small snippet of useful code. Sometimes it goes in the wrong direction and I have to abort the suggestion and force it into another direction. For me, the main benefit is having a typing assistant which can save me from typing one line here and there. Especially refactorings is where Cursor shines. Things like moving argument order around or adding/removing a parameter at function callsites is great. Saved me a ton of typing and time already. I’m way more comfortable just quickly doing a refactoring when I see one.
Weird. I have such a different experience with Cursor.
Most changes occur with a quick back and forth about top level choices in chat.
Followed with me grabbing appropriate interfaces and files for context so Sonnet doesn't hallucinate API, and then code that I'll glance over and around half the time suggest one or more further changes.
It's been successful enough I'm currently thinking of how to adjust best practices to make things even smoother for that workflow, like better aggregating package interfaces into a single file for context, as well as some notes around encouraging more verbose commenting in a file I can provide as context as well on each generation.
Human-centric best practices aren't always the best fit, and it's finally good enough to start rethinking those for myself.
This! I've been using Cursor regularly since late 2023. It's all about building up effective resources to tactfully inject into prompts as needed. I'll even give it sample API responses in addition to API docs. Sometimes I'll have it first distill API docs down into a more tangible implementation guide and then save that as a file in the codebase.
I think I'm just a naturally verbose person by default, and I'm starting to think that has been very helpful in me getting a lot out of my use of LLMs and various LLM tools over the past 2+ years.
I treat them like the improv actors they are and always do the up front work to create (with their assistance) the necessary broader context and grounding necessary for them to do their "improv" as accurately as possible.
I honestly don't use them with the immediate assumption I'll save time (although that happens almost all the time), I use them because they help me tame my thoughts and focus my efforts. And that in and of itself saves me time.
This is what’s needed to get the most out of these tools. You understand deeply how the tool works and so you’re able to optimize its inputs in order to get good results.
This puts you in the top echelon of developers using AI assisted coding. Most developers don’t have this deep of an understanding and so they don’t get results as good as yours.
So there’s a big question here for AI tool vendors. Is AI assisted coding a power tool for experts, or is it a tool for the “Everyman” developer that’s easy to use?
Usage data shows that the most adopted AI coding tool is still ChatGPT, followed by Copilot (even if you’d think it’s Cursor from reading HN :-))
I'll add few things at which Cursor with Claude is better than us (at least in time/effort):
- explaining code. Enter some legacy part of your code nobody understands, LLMs aren't limited to keeping few things in memory like us. Even if the code is very obfuscated and poorly written it can understand what it does and the purpose and suggest refactors to make it understandable
- explaining and fixing bugs. Just the other day Antirez posted a bug of him debugging a Redis segfault on some C code providing context and stack trace. This might be a hit or miss at times, but more often than not it saves you hours
- writing tests. It often comes up with many more examples and edge cases than I thought of. If it doesn't, you can always ask it to.
In any case I want to stress that LLMs are only as good as your data and prompts. They lack the nuance of understanding lots of context, yet I see people talking to them like humans that understand the business, best practices and others.
That first one has always felt super crazy to me, I've figured out what lots of "arcane magic, don't touch" type of functions genuinely do since LLMs have become a thing.
Even if it's slightly wrong it's usually at least in the right ballpark so it gives you a very good starting point to work from. Almost everything is explainable now.
"These next tests require cooperation. Consequently, they have never been solved by a human. That's where you come in. You don't know pride, you don't know fear, you don't know anything. You'll be perfect."
It takes someone with no ego, no preconceptions, and infinite patience to delve in and come back alive.
Agreed, AI has been a godsend for trying to understand snippets of perl code in our codebase that were basically unreadable before unless you were an expert.
I think the .cursorrules and .cursorignore files might be useful here.
Especially the .cursorrules file, as you can include a brief overview of the project and ground rules for suggestions, which are applied to your chat sessions and Cmd/Ctrl K inline edits.
For moving argument order and removing parameter is already doable by a mature IDE and they are more predictable than AI (Jetbrains IDEs support them well. In VSCode it may need extensions.)
But adding parameter is not well supported by IDE as it requires knowing which value to pass. This is where Cursor can shine.
So for anyone who doubted SWE-BENCH's relevance's to typical tasks, it seems that its stated 13.86% almost exactly matches this 3 successes out of 20 pilot outcome.
We're not quite there yet, but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction. I would now expect pretty much the textbook disruptive innovation process over the next decade or so, until the typical human dev role is pushed to something more akin to the responsibilities of current day architects and product managers. QA engineering though will likely see a big resurgence.
>> but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction.
Can you explain why you think this. From what I gather from other comments it seems like if we continue on current trajectory at best you'd still need a dev who understands the projects context to work in tandem w/ the agent so the code doesn't devolve into slop.
As I see it, this is pretty much a given across all codebases, with a natural tendency of all projects to become balls of mud if the developer(s) don't actively strive to address technical debt and continuously refactor to address the new needs. But having said that, my experience is that for a given task in an unfamiliar codebase, an AI agent is already better at maintaining consistency than a typical junior developer, or even a mid-level developer who recently joined the team. And when explicitly given the task of refactoring the codebase while keeping the tests passing, the AI agents are already very capable.
The biggest issue, which is what you may be alluding to, is that AI agents are currently very bad at recognizing the limits of their capabilities and continue trying an approach when a human dev would have long since given up and went to their lead to ask for help or for the task specification to be redefined. That's definitely an issue, but I don't see any fundamental technological limitation here, but rather something addressable via an engineering effort.
In general, I've seen so many benchmarks fall to AI in the recent decade (including SWE-BENCH), that now I'm quite confident that if a task being performed by humans can be defined with clear numerical goals, then it's achievable by AI.
And another way I'm looking at it is that for any specific knowledge work competency, it seems to already be much easier and time effective to train an AI to do well on it than to create a curriculum for humans to learn it and then to have every single human to go through it.
This only reinforces my bias against AI agents. At this point, they are mostly just hype. I believe that for AI to replace a junior, we would need to achieve at least near-AGI, and we are far from that.
If by hype you mean that there isn't extreme real world value right here and right now, then I very much disagree.
Closing in on 20 years since I left school and for me AI is absolutely useful, right here and right now. It is really a bicycle for the mind:
It allows me to get much faster to where I want. (And like bicycles you will get a few crashes early on and possibly later as well, depending on how fast you move and how careful you are.)
I might be in some sweet spot where I am both old enough to know what is going on without using an AI but also young enough to pick up the use of AI relatively effortlessly.
If however by hype you mean that people still have overhyped expactations about the near future, then yes, I agree more and more.
I feel AI can also do simple monotonous coding tasks, but I don't think programming is something it's currently very good at. Samples, yes, trivial programs, sure, but anything non-trivial and it's rarely useful.
Where it really shines today is getting humans up to speed with new technologies, things that are well understood in general but maybe not well understood by you.
Want to say build a window manager in X11, despite never having worked with X11 before? Sure, Claude will point you in the right direction and give you a simple template to work with in 30 seconds. Enormous time saver compared to figuring out how to do that from scratch.
Never touched node in your life but want to build a simple electron app? Sure, here's how you get started. Few hours and several follow up questions later, you're comfortable and productive in the environment.
Getting off the ground with new technologies is so much easier with AI it's kind of ridiculous. The revolutionary part of AI coding is how it makes it much easier to be a true generalist, capable of working in any environment with any technology, whatever is appropriate.
Exactly. LLMs are gullible. They will believe anything you tell them, including incorrect things they have told themselves. This amplifies errors greatly, because they don't have the capacity to step back and try a different approach, or introspect why they failed. They need actual guidance from somebody with much common sense; if let loose in the world, they mostly just spin around in circles because they don't have this executive intelligence.
A regular single-pass LLM indeed cannot step back, but newer ones like o1/o3/Marco-o1/QwQ can, and a larger agentic system composed of multiple LLMs definitely can. There is no "fundamental" limitation here. And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models), the sky's the limit. I'd be very bullish about Deepmind, once they fully enter this race.
> And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models),
Agree with this totally.
I wouldn't call what the CoT models are doing exactly being able to step back - their "stepping back" still dumps tokens into the output, so it is still burdened with seeing all of these failed attempts as it searches for the right one. But my intuition on this can be wrong, and it's a much more advanced reasoning process than what "last-gen" (non-CoT) does, so I can see your point.
For an agentic system composed of multiple LLMs, I would strongly disagree if the LLMs are last-gen. In my experience, it is very hard to prompt a non-CoT LLM into rejecting an upstream assumption without making it paranoid and rejecting valid assumptions as well. This makes it hard to effectively create a robust agentic system that can self-correct.
I think that's different if the agents are o1-level, but I think it's hard to appreciate just how costly and slow doing this would be. Agents consume tokens like candy with all the back-and-forth, so a surprising number of tasks become economically infeasible.
(It seems everyone is waiting for an inference perf breakthrough that may or may not come.)
Yea this is what I was wondering as well. I have o1 not o1 pro but I am gathering from reddit/youtube o1 pro if used correctly is superior for coding tasks.
Sounds exactly like my experience with the “agents” about a year ago. Autogpt or whatever it was called. Works great 1% of the time and the rest it gets stuck in the wrong places completely unable to back out.
I’m now using o1 or Claude Sonnet 3.5 and usually one of them gets it right.
The current frontier models are all neocortex. They have no midbrain or crocodile brain to reconcile any physical, legal or moral feedback. The current state of the art is to preprocess all LLM responses with a physical/legal/moral classifier and respond with a generic "I'm sorry Dave, I'm afraid I can't do that."
We are fooled into thinking these golems have a shred of humanity, but their method of processing information is completely backward. Humans begin with a fight/flight classifier, then a social consensus regression, and only after this do we start generating tokens ... and we do this every moment of every day of our lives, uncountably often, the only prerequisite being the calories in an occasional slice of bread and butter.
The whole idea of Devin is pointless and doomed to fail in my humble opinion, big tech will be quite capable on delivering A.I agents / assistants - very soon. I don't think wrappers over other people's LLMs like Devin make a lot of sense.
Can someone help me understand what's the value proposition / moat of this company?
I recommend you look at tools like Aider or Codebuff... sure they need to call some LLM at some point (could be your own, could be external), but the key thing that they are doing complex modifications of source code using things like treesitter -> i.e. you don't rely directly on the LLM modifying code, but the LLM using trees to modify the code. See in Aider's sourcecode: https://github.com/Aider-AI/aider/tree/main/aider/queries
Simple copy-pasting of "here's my prompt, give me code" was always doomed from the start to be perfect every time, and DEFINITELY won't work for an agent. We need to start thinking about how to use these LLMs in smarter ways (like the above mentioned tools)
Can Aider sit inside VS Code, understand what files I have open, and use them as context? Their docs lead me to say no, that they are an inline chat/completion experience
"Even more telling was that we couldn’t discern any pattern to predict which tasks would work."
I think this cuts to the core of the problem for having a human in the loop. If we cannot learn how to best use the tool from repeated use and discern some kind of patterns of best and worst practices then it isn't really a tool.
The assumption with low-code tooling was that AI is so good at writing actual code in a way that it will make low-code tools redundant. Spending time with Windsurf, Cursor, and a bunch of VSCode extensions, while it was so impressive to see new projects being created autonomously, asking for new requirements or fixing bugs after >10 iterations was more complex.
I had to audit the code and give specific directions on how to restructure the code to avoid getting stuck when the project gets more complex. That makes me think autonomous agents will do much better on low-code tools, as their restrictions ensure the agent is on track. The problem with low-code tools is that they also get more complicated to scale after maybe like >200 iterations. (for a medium-sized project, on average 6 months)
The thing with AI agents I tend to find is they reveal how much heavy lifting the dev is actually doing.
A personal example, my best use out of AI so far has been cases where documentation was poor to nonexistent, and Claude was able to give me a solution. But the thing is, it wasn't a working solution, nowhere close, but it was enough for me to extrapolate and do my own research based on the structure, classes and functions it used. Basically, it gave me somewhere to start from. Whether that's worth the social, economic and environmental problems is another story.
I'm working on AI assistant in Python notebook. It aims to help with data science tasks. I'm not using it to do a full analysis. It will fail. What I ask is to create a code snippet for my next step in the analysis. Many times I need to manually change the code, but it is fine because LLM speed-up my coding a lot. And it is really fantastic in writing matplotlib code for visualization. I don't remember all matplotlib syntax to change axis labels, add annotations or change style, and LLM really can handle it good, in impressive speed.
I remain sceptical about the "Planet Tracker"-task. The task was to debunk claims about historical positions of Jupiter and Saturn.
If the task was to find those planets were NOT in a certain (claimed) position an erroneous program would still appear to "debunk" the claims. Did they check if Devin's code's calculated positions were actually correct? Did they check in some NASA-database?
If Devin gave arbitrary positions for the planets it's much more likely that they're different than any claim and appear to debunk it.
I was able to read the code it wrote, and check that (as hoped) it was using a good existing library to do the heavy lifting. And I had it make plots that I could visually use to check that the values were 'reasonable'. The value in that case was simply that I didn't have to leave the couch and write the code myself (although if the result was actually needed for anything more important than a smug 'i thought so' confirmation I would still have taken over and validated it kore carefully).
At some point people are going to realize that using these LLM AIs is a communications problem, and by that I mean the reason various attempts to use them fail is because they are not being effectively told what to do, vague and implied requests are not enough for a inhuman statistical construct to grasp what you're asking without clearer more details and more specific instructions.
One possible reason is that I'm using popular tech stacks (Next.js, HTML/JS for demo website and SDK). No niche frameworks or tools like nbdev (I've never heard of that).
Also I've been prompting ChatGPT and Claude for over a year, that might help with communicating with Devin.
Your statement is factually wrong, Claude 3.5v2 asks clarifying questions when needed "natively", and you can add similar instructions in your prompt for any model.
The default system prompts are tuned for the naive case. LLMs being all purpose text handling tools, can be reprogrammed for any behavior you wish. This is the crux of skilled use of LLMs.
The better the LLMs get, the worse the average prompt quality.
Do you have good references about using AI coding assistants?
Techniques of prompt engineering help a lot, but I really think there will be created a body of knowledge about how to use, what's the good contexts of use, and good heuristics. They are a valuable tool, but I feel it's possible to extract more value.
Most the problems you mentioned will likely be solved with the next iterations of Devin or similar product.
I can say that because I work daily with Claude as an agent over mcp, and the problems you mentioned feel very familiar.
Based on the type of the issues you mentioned, Devin isn't likely using o1 yet. A workflow like o1 for planning, Claude for Coding, o1 for review, etc., would work better.
The problems you mentioned: ssh-key issue unrelated to script, code not following existing patterns or themes, instructions not being followed, extra abstractions, etc., fall into that category.
Some of the issues are likely due to context length problem. For example, LLM doesn't work well with jupyter notebook because of extra junk in ipynb, which will likely remain a problem.
No matter what happens with Devin specifically, I think this is a really important topic and I enjoy reading updates on this kind of review every time.
Honestly, i have been bitten so many times by LLM hallucinations when I work in parallel with the LLM, I wouldn't trust it autonomously running anything at all. If you have tried to use imaginary APIs, imaginary configuration and imaginary cli arguments, you know what I mean
> If you have tried to use imaginary APIs, imaginary configuration and imaginary cli arguments, you know what I mean
I see this comment a lot but I can't help but feel it's 4 weeks out of date. The version of o1 released on 2024-12-17 so rarely hallucinates when asked code questions of basic to medium difficulty and provided with good context and a well written prompt, in my experience. If the context window is sub-10k tokens, I have very high confidence that the output will be correct. GPT-4o and o1-mini, on the other hand, hallucinates a lot and I have learned to put low trust in the output.
How are you using LLMs? With o1 I've switched to spelling out in lots of details what I want, then asking it it to one shot the full file, so with this approach the wait time has been acceptable.
I'm using it to orient me when tackling something new. For instance, the other day i was making a web driver client in shell and i asked it something apong the lines of "is there an http endpoint to the webdriver to get the class name of an element?"
These are the sort of questions i mostly do. "What is the best practice to read output from a device file in C", "Is there a cli tool to find dead typescript interface fields?"
I have been feeling LLM burnout and favoring code it all my self after a year of LLM assistance. When it gets things wrong it is too annoying. Like, I would get mad and start to curse it, shouting loud and in the chat.
Exactly this. At first started verbally abusing it untill it conformed, but i quickly realised that after the context gets very long it simply discards former instructions and abusing. So i get frustrated, toxic AND don't get my job done
I saw few people around testing and it is quite disappointing. Sometimes a task might take forever and deliver a bad result or fail completely.
It seems it is targeting few specific problems and whatever else is just too hard.
I also think that, thought it is expensive, it is cheap for the technology behind it and it won't be able to keep that price for long
Now is the time for us to hold seemingly contradictory propositions: A child born today will live to see 99% of all computer code written by artificial intelligence, but the current AI boom is massively overcapitalized.
I'd argue that software is being written (either by humans or AI) in an order that it progressively adds less marginal value (if we define value in the capitalistic sense).
Most of the value that software will ever create has already been created.
The only truly valuable missing things are stuff whose value is not easy to translate to capitalists, or need some visionary work.
I'm one of the creators of OpenHands (fka OpenDevin). I agree with most of what's been said here, wrt to software agents in general.
We are not even close to the point where AI can "replace" a software engineer. Their code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp. I've talked to companies who went all in on AI engineers, only to realize two months later that their codebase was rotting because no one was reviewing the changes.
But once you develop some intuition for how to use them, software agents can be a _massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself. I especially love asking it to do simple, tedious things like fixing merge conflicts or failing linters. It's great at getting an existing PR over the line.
It's also important to keep in mind that these agents are literally improving on a _weekly_ basis. A few weeks ago we were at the top of the SWE-bench leaderboard; now there are half a dozen agents that have pulled ahead of us. And we're one launch away from leapfrogging back to the top. Exciting times!
https://github.com/All-Hands-AI/OpenHands
> code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp
> ..._massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself.
I'm having trouble reconciling these statements. Where does the productivity boost come from since that reviewing burden seems much greater than you'd have if you knew commits were coming from a competent human?
There's often a lot of small fixes that not time efficient to do, but a solution is not much code and is quick to verify.
If the cost is small to setting a coding agent (e.g. aider) on a task, seeing if it reaches a quick solution, and just aborting if it spins out, you can solve a subset of these types of issues very quickly, instead of leaving them in issue tracking to grow stale. That lets you up the polish on your work.
That's still quite a different story to having it do the core, most important part of your work. That feels a little further away. One of the challenges is the scout rule, the refactoring alongside change that makes the codebase nicer. I feel like today it's easier to get a correct change that slightly degrades codebase quality, than one that maintains it.
Thanks - this all makes sense - I still don't feel like this would constitute a massive productivity boost in most cases, since it's not fixing time consuming major issues. But I can see how it's nice to have.
The bigger win comes not from saving keystrokes, but from saving you from a context switch.
Merge conflicts are probably the biggest one for me. I put up a PR and move onto a new task. Someone approves, but now there are conflicts. I could switch off my task, spend 5-10 min remembering the intent of this PR and fixing the issues. Or I could just say "@openhands fix the merge conflicts" and move back to my new task.
The issue is that you still need to review the fixed PR (or someone else does) which means you just deferred the context switch, you didn't eliminate it. And if the fix is in a new commit, that's possible (whereas if it rebases you have to remember your old SHA).
Playing the other side, pipelining is real.
I haven't started doing this with agents, but with autocomplete models I know exactly what OP is talking about: you stop trying to use models for things that models are bad at. A lot of people complain that Copilot is more harm than good, but after a couple of months of using it I figured out when to bother and when not to bother and it's been a huge help since then.
I imagine the same thing applies to agents. You can waste a lot of time by giving them tasks that are beyond them and then having to review complicated work that is more likely to be wrong than right. But once you develop an intuition for what they can and cannot do you can act appropriately.
I suspect that many engineers do not expend significant energy on reviewing code; especially if the change is lengthy.
>burden seems much greater than...
Because the burden is much lower than if you were authoring the same commit yourself without any automation?
Is that true? I'd like to think my commits are less burdensome to review than a fresh out of boot camp junior dev especially if all that's being done is fixing linter issues. Perhaps there's a small benefit, but doesn't seem like a major productivity boost.
A junior dev is not a good approximation of the strengths and weaknesses of these models.
Agreed! The comparison is great for estimating the scope of the tasks they're capable of--they do very well with bite-sized tasks that can be individually verified. But their world knowledge is that of a principal engineer!
I think this is why people struggle so much with agents--they see the agent perform magic, then assume it can be trusted with a larger task, where it completely falls down.
The post I originally commented on literally made that comparison when describing the models as a massive productivity boost.
My biggest issue is just how often these agents make subtle, hard to spot mistakes.
It can seem great for certain tasks at first. Yesterday I had to add license headers to the top of a lot of source code files. The reason why I let the AI try is because the repository contained lots of different programming languages.
It was able to do this but I then realized that it just removed the last sentence of the text it was supposed to add.
We've seen exponential improvements in LLM's coding abilities. Went from almost useless to somewhat useful in like two years.
Claude 3.5 is not bad really. I wanted to do a side project that has been on my mind for a few years, and Claude coded it in like 30 seconds.
So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.
> So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.
These sorts of things can’t be extrapolated. It could be 6-months, it could be a local maxima / dead end that’ll take another breakthrough in 10 years like transformers were. See self-driving cars.
I think the most you could say is we’ve had improvements - from gpt 4 to whatever the current model is has definitely not been exponential improvements.
My experience is acctually they’ve become dramatically less helpful over the past two years (past year in particular). Claude seems not to have backslid much but it’s progression also has not been very fast at all (I’ve noticed no difference since the 3.5 launch despite several updates).
Everything grows sinusoidally and I feel we’re well past the tipping point into diminishing rate of improvement
What does the cost look like for running OpenHands yourself? From your docs, it looks like you recommend Sonnet @ $3 / million tokens. But I could imagine this can add up quickly if you are sending large portions of the repository at a time as context.
I think one of the big problems with Devin (and AI agents in general) is that they're only ever as good as they are. Sometimes their intelligence feels magical and they accomplish things within minutes that even mid level or senior software engineers would take a few hours to do. Other times, they make simple mistakes and no matter how much help you give, they run around in circles.
A big quality that I value in junior engineers is coachability. If an AI agent can't be coached (and it doesn't look like it right now), then there's no way I'll ever enjoy using one.
My first job I spent so much time reading Python docs, and the ancient art of Stack Overflow spelunking. But I could intuitively explain a solution in seconds because of my CS background. I used to encounter a certain kind of programmer often, who did not understand algorithms well but had many years of experience with a language like Ruby, and thus was faster in completing tasks because they didn't need to do the reference work that I had to do. Now I think these kinds of programmers will slowly disappear and only the ones with the fast CS intuition will remain.
I've found the opposite true as well.
I disagree. If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).
If anything, my gut says that the CS concepts are very easy for LLMs to recall and will be the first things replaced (if ever) by AI. Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.
There's also the meme in the industry that self-taught, non-CS degree engineers are potentially of the most capable group. Though this is anecdotal.
> If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).
Emerging? I graduated in 2006 with a BEng in Software Engineering.
The difference between it and the BSc CompSci degree I started in, was that optional modules became mandatory — including an industrial placement year (paid internship).
> Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.
My gut disagrees, because LLMs are at about the same level in those things as they are in low level coding: not yet replacing humans in project level tasks any more than they do in coding tasks, but also being OK assistants for both coding and project domains. I have no reason to think either has less or more opportunity for self-training, so I expect progress to track for the foreseeable future.
(That said, the foreseeable future in this case is 1-2 years).
> the CS concepts are very easy for LLMs to recall
They're easy to recall, but you have to know what to recall in the first place. Or even know enough of the territory to realise there's something to recall. Without enough background, you'll get a whole set of amazing tools that you have no idea what to do with.
For example, you may be able to write a long description of your problem with some ideas how to steer the AI to give you possible solutions. And the AI may figure out what the problem is and that the hyperloglog is something that could be useful to you. And you may have the awesome programming skills to implement that. But that's a lot of maybes. It would be much faster/easier if you knew about hyperloglog ahead of time and just asked for the implementation or library recommendation.
Or even if you don't know about the actual solution, you'd have enough of CS vocabulary to ask: "how do I get a fast, approximate distinct count from a multiset". It would take a long imprecise description to get the same thing for a coder with no theory background.
To this point, I use AI programming assistants pretty heavily and find very frequently that they will write extremely inefficient or oddly baroque implementations of what I’m asking for in their first pass, that appear as if they don’t have the “knowledge” or ability to do it better, but then they can be prodded to re-do it very easily. Frequently I look at some generated code and write back the most cursory feedback like “looks o(n^2) can you make more efficient” or “use pointers instead of nested loops” or “how about using X approach” and it will often produce something dramatically better than the initial effort. For now at least I think these tools are still most powerful in the hands of experts. (I am a self-taught programmer but have a fair bit of experience)
I'm not convinced an LLM is really "recalling" any CS concepts when they try to solve a problem. IMHO, we're lucky if it matches the pattern of the request against the pattern of a solution and the two are actually related. I'm no expert but I don't think there's any reason to think that an LLM is taking a CS concept and applying it to something novel in order to get a solution. If they were, I believe their success rate would be much higher.
In many places where someone might reach for something they remember from their CS coursework, there's often an open-source library or tool doing much the same thing. Understanding how these libraries and tools function is certainly valuable but, much of the time, people can get by with only a vague hunch; indeed, this is why they exist! IMHO, I would be happier with the LLM assistant if it picked reliable library code rather than writing up a sketchy solution of its own.
I'm also familiar with this idea people who have managed to be successful in the field without a CS degree are more capable. In my opinion, this is hogwash. I think if we take a step back, we'll see that people graduating from established, top-tier CS programs are looking for higher pay than those who have come from a less expensive and (very often) business focused program. To be fair, people from each of these backgrounds has their strengths; in many organizations a developer who has done two semesters of accounting is a real benefit, in others the ability to turn a research paper into the prototype of a new product is going to be important.
Years of experience often washes out much of these differences. People who have started from business oriented education programs may end up taking a handful of CS courses as they navigate their career, likewise many people with a CS background end up accruing business centered skills.
In my opinion, people start out their education at a place that they can afford, a place that is familiar to them, often a place that they feel comfortable. Someone's economic background (and of their family) plays a big role on what kind of educational program they choose when they are on the cusp of adulthood. Smart and talented people can always learn what they need to learn, regardless of their starting point.
I think honestly the meme that non-CS degree engineers are most capable is selection bias.
If they had taken a CS degree they would likely be just as, of not more capable.
To self-learn the topics you need to make good software takes an immense amount of effort and although the data and material is out there, takes a lot of work to figure out.
I'm only recently starting to pick up on "magic" patterns that are actually extremely simple to understand given the right base knowledge... I can gain tons of insights from talks givem in the early 2010s but if I watched them without the correct practical experience and foundational knowledge it is the same as the title to a HN post this week[1], gibberish.
With the correct time playing with the foundational patterns and learning some of the backing knowledge it unlocks amazing patterns in my mind and makes the magic seem simple. A great example, CSP[2]. I've known about and used the actor model before, which I first discovered when I found Erlang, but now with CSP I could ask the question "Why should actors be heavy?", you can put an actor into a light-weight task and spawn tons of them and build a tree of connections. Stuff like oneTBB flow graph[3]now makes sense and looks like a beautiful pattern with some really interesting ideas that can be implemented in more general computing than the high performance computing it was designed for. It seems niche but golang is built on those foundations, and the true power of concurrency in golang comes from embracing that. It fundamentally changes the way I want to structure and layout code and I feel like a good CS course can get you there quicker...
Unfortunately a good CS course probably wouldn't accelerate the average CS grads understanding of that but can get someone dedicated and hungry there much much quicker. Someone fresh out of a JS bootcamp is maybe a decade away from that if they ever even want to search for that knowledge.
1. https://news.ycombinator.com/item?id=42711751
2. https://en.m.wikipedia.org/wiki/Communicating_sequential_pro...
3. https://oneapi-spec.uxlfoundation.org/specifications/oneapi/...
I completely agree with you. More precisely, I feel they are useful when you have specific tasks with limited scope.
For instance, just yesterday I was battling with a complex SQL query and I got halfway there. I gave our bot the query and an half assed description of what I wanted/what was missing and it got it right on the first try.
Are you sure that your SQL query is correct?
Can they be sure even if they wrote it theirself?
he’s certainly sure, but lord knows if it is
And when working with people it's fairly easy to intervene and improve when needed. I think the current working model with LLMs is definitely suboptimal when we cannot confine their solution space AND where they should apply a solution precisely, and timely.
It’s also often possible to know what a human will be bad at before they start. This allows you to delegate tasks better or vary the level of pre-work you do before getting started. This is pretty unpredictable with LLMs still.
As someone who uses AI coding tools daily and has done a fair amount of experimentation with different approaches (though not Devin), I feel like this tracks pretty well. The problem is that Devin and other "agentic" approaches take on more than they can handle. The best AI coders are positioned as tools for developers, rather than replacements for them.
Github Copilot is "a better tab complete". Sure, it's a neat demo that it can produce a fast inverse square root, but the real utility is that it completes repetitive code. It's like having a dynamic snippet library always available that I never have to configure.
Aider is the next step up the abstraction ladder. It can edit in more locations than just the current cursor position, so it can perform some more high-level edit operations. And although it also uses a smarter model than Copilot, it still isn't very "smart" at the end of the day, and will hallucinate functions and make pointless changes when you give it a problem to solve.
When I tried Copilot the "better tab complete" felt quite annoying, in that the constantly changing suggested completion kept dragging my focus away from what I was writing. That clearly doesn't happen for you. Was that something you got used to over time, or did that just not happen for you? There were elements of it I found useful, but I just couldn't get over the flickering of my attention from what I was doing to the suggested completions.
Edit: I also really want something that takes the existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions. Does Copilot do that now?
I tried to get used to the tab completion tools a few times but always found it distracting like you describe. often I’d have a complete thought, start writing the code, get a suggested completion, start reading it, realize it was wrong, but then I’d have lost my initial thought, or at least have to pause and bring myself back to it.
I have, however, fully adopted chat-to-patch style workflows like Aider, I find it much less intrusive and distracting than the tab completions, since I can give it my entire thought rather than some code to try to complete.
I do think there’s promise in more autonomous tools, but they still very much fall into the compounding-error traps that agents often do at the present.
I have the automatic suggestions turned off. I use a keybind to activate it when I want it.
> existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions
What are you actually looking for? Copilot uses "all of github" via training, and your current project in the context.
> I have the automatic suggestions turned off. I use a keybind to activate it when I want it.
I didn't realise you could do that. Might give it another go.
> Copilot uses "all of github" via training, and your current project in the context.
The current project context is the bit I didn't think it had. Thanks!
For cursor you can chat and ask @codebase and it will do rag (or equivalent) to answer your question
Copilot is also very slow. I'm surprised people use it to be honest. Just use Cursor.
Cursor requires you to use their specific IDE though, doesn't it? With Copilot I don't have to switch contexts as it lives in my Jetbrains IDE.
It's just vscode. I greatly prefer vim but the difference between vim + ai tools and cursor is just a no brainer in terms of productivity. Cursor isn't without problems but it's leagues ahead of the competition in my opinion.
I've been tempted to try Cursor because of vocal fans like yourself. Then I went to their website and forums yesterday. I am no longer tempted.
Can you say more?
pricing model, downtime, model support, pricing model, trying to take over the experience rather than assist within my experience. This last one is big, because Cursor wants to "reimagine" how developers work. The problem is the AIs are so far from being competent, they need to be kept on the sidelines and sub'd in occasionally, not be the quarterback. Oh, did I mention pricing model?
It's a cost per month, supports the top models (tbh sonnet 3.5 is the key model) and is VS code + some more UI.
> trying to take over the experience rather than assist within my experience
I'm not sure I understand. It's got autocomplete, chat, asking it to change files, but it's vscode plus some stuff. What's it "taking over"?
It is worth trying.
It is just a fashion choice though with UI.
Personally, I just prefer the chat interface directly with no Cursor UI.
For me, the best way is to write my prompt in a txt file, away from anything to do with LLMs. The bottleneck is not the update of the files like Cursor is good at.
The bottleneck is the clarity of my thoughts.
I looked at your website.
How to get past Barry Schwartz ideas is the main problem that we face in 2025.
The Godel, Escher, Bach stuff to me is just nonsense. As a huge Bach fan boy it is from when Bach was massively overrated in cultural importance.
Hierarchy Theory? How about O-information?
Doesn't seem the O-information wiki entry exists, yet.
Because of the complaints? If so, yeah I get it. I'm there amongst them. It's kind of like Tesla FSD. There are often setbacks in releases and they definitely need to work on their communication with the community. That said, for the current price it's still worth any misgivings.
The price is one of the issues I have with this space more generally.
I do not want to pay $20/m for a capped experience
I want to pay $10/m to support development, and pay for my AI usage on my own, per request, by choosing my own model and provider
If I was going to shell out money, it would be for Copilot, not Cursor. I prefer my AI to be a side dish, not the main course or core experience
You can use the free version and bring your own api keys if you want. You miss out on features that require cursors models.
I would try cursor. It’s pretty good at copy pasting the relevant parts of the codebase in and out of the chat window. I have the tab autocomplete disabled.
Cursor tab does that. Or at least, it takes other open tabs into account when making suggestions.
i’ve been very impressed with the gemini autocomplete suggestions in google colab, and it doesn’t feel more/less distracting than any IDEs built in tab suggestions
I think a lot of people who are enabling copilot in vs code (like I did a few days ago), are experiencing "suggested autocomplete as I type" for the first time where before there was no grey text below what I am writing personally.
It is a huge distraction, especially if it changes as I write more. I turned it off almost immediately.
I deeply regret turning on copilot in vscode. It (M$) immediately weaseled into so many places and settings. I'm still trying to scaled it back. Super annoying and distracting. I'd prefer a much more opt in for each feature than what they did.
> The best AI coders are positioned as tools for developers, rather than replacements for them.
I agree with this. However, we must not delude ourselves and understand that corporate is pushing for replacement. So there will be a big push to improve on tools like Devin. This is not a conspiracy theory, in many companies (my wife's, for example) they are openly stating this: we are going to reduce (aka "lay off") the engineering staff and use as much AI solutions as possible.
I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. Not everyone can be a top of the cream specialist. And it'll be used to drive down salaries, too.
I remember when I was first getting started in the industry the big fear of the time was that offshoring was going to take all of our jobs and drive down the salaries of those that remained. In fact the opposite happened: it was in the next 10 years that salaries ballooned and tech had a hiring bubble.
Companies always want to reduce staff and bad companies always try to do so before the solution has really proven itself. That's what we're seeing now. But having deep experience with these tools over many years, I'm very confident that this will backfire on companies in the medium term and create even more work for human developers who will need to come in and clean up what was left behind.
(Incidentally, this also happened with offshoring— many companies ended up with large convoluted code bases that they didn't understand and that almost did what they wanted but were wrong in important ways. These companies needed local engineers to untangle the mess and get things back on track.)
But having deep experience with these tools over many years, I'm very confident...
No one has had deep experience with these tools for any amount of time, let alone many years. They're literally just now hitting the market and are rapidly expanding their capabilities. We're at a fundamentally different place than we were just twelve months ago, and there's no reason to think 2025 will be any different.
I was building things with GPT-2 in 2019. I have as much experience engineering with them as anyone who wasn't an AI researcher before then.
And no, we're not at a fundamentally different place than we were just 12 months ago. The last 12 months had much slower growth than the 12 months before that, which had slower growth than the 12 months before that. And in the end these tools have the same weaknesses that I saw in GPT-2, just to a lesser degree.
The only aspect in which we are in a fundamentally different place is that the hype has gone through the roof. The tools themselves are better, but not fundamentally different.
It’s genuinely difficult to take seriously a claim that coding using Sonnet has “the same weaknesses” as GPT-2, which was effectively useless for the task. It’s like suggesting that a flamethrower has the same weaknesses as a matchstick because they both can be put out by water.
We’ll have to agree to disagree about whether the last 12 months has had as much innovation as the preceding 12 months. We started 2024 with no models better than GPT-4, and we ended the year with multiple open source models that beat GPT-4 and can run on your laptop, not to mention a bunch of models that trounce it. Plus tons of other innovations, dramatically cheaper training and inference costs, reasoning models, expanded multi-modal capabilities, etc, etc.
I’m guessing you’ve already seen and dismissed it, but in case you’re interested in an overview, this is a good one: https://simonwillison.net/2024/Dec/31/llms-in-2024/
I'm paying for o1-pro (just for one month) and have been using LLMs since GPT-2 (via AI Dungeon). Progress is absolutely flattering when you're looking at practical applications versus benchmarks.
o1 is actually surprisingly "meh" and I just don't see how they can justify the price when sonnet 3.5 latest is almost as good, 10x as fast and doesn't even have "reasoning".
I'm spending half my day every day for the past few years using LLMs in one way or another. They still confidently (and unpredictability) hallucinate, even o1. They have no memory, can't build up experience, performance rapidly degrades with long conversations, etc.
I'm not saying progress isn't being made, but the rate of progress is definitely slowing.
I think it's qualitatively different this time.
Unlike with offshoring, this is a technological solution, which understandably is received more enthusiastically on HN. I get it. It's interesting as tech! And it's achieved remarkable things. But unlike with offshoring (which is a people thing) or magical NOCODE/CASE/etc "solutions", it seems the consensus is that AI coding assistants will eventually get there. At least a portion of even HN seems to think so. And some are cheering!
The coping mechanism seems to be "it won't happen to me" or "my knowledge is too specialized" but I think this will become increasingly false. And even if your knoweldge is too specialized to be replaced by AI, most engineers aren't like that. "Well, become more specialized" is unrealistic advice, and in any case, the employment pool will shrink.
PS: I am offhsoring (in a way). I'm not based in the US but I work remotely for a US company.
> But unlike with offshoring (which is a people thing) or magical NOCODE/CASE/etc "solutions", it seems the consensus is that AI coding assistants will eventually get there.
There's no consensus to that point. There are a few loud hype artists, most of whom are employed in AI and have so have conflicts of interest and also are pre-filtered to the true believers. Their logic is basically "See this trend? Trends continue, so this is inevitable!"
That's bad logic. Trends do not always continue, they often slow or reverse, and this one is showing all signs of doing so already. OpenAI has come straight out and said that they don't expect to see another jump like GPT-3 to 4, and have resorted to throwing more tokens at the problems, which works with diminishing returns. I do not expect to see a return to the rapid growth we had for a year or two there.
> PS: I am offhsoring (in a way). I'm not based in the US but I work remotely for a US company.
Yes, and this is a good example: there's a place for offshoring, but it didn't replace US devs. The same thing will happen here.
Trends do not always continue, they often slow or reverse, and this one is showing all signs of doing so already. OpenAI has come straight out and said that they don't expect to see another jump like GPT-3 to 4, and have resorted to throwing more tokens at the problems, which works with diminishing returns. I do not expect to see a return to the rapid growth we had for a year or two there.
This feels like the declaration of someone who has spent almost no time playing with these models or keeping up with AI over the last two years. Go look at the benchmarks and leaderboards for the last 18 months and tell me we're not progressing far beyond GPT4. Meanwhile models are also getting faster, cheaper, getting multi-modal capabilities, cheaper to train for a given capability, etc.
And of course there are diminishing returns, the latest public models are in the 90s on many of their benchmarks!
> I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. And it'll be used to drive down salaries, too.
Yeah it's maddening.
The cope is bizarre too: "writing code is the least important part of the job"
Ok then why does nearly every company make people write code for interviews or do take home programming projects?
Why do people list programming languages on their resumes if it's "least important"?
Also bizarre to see people cheering on their replacements as they use all this stuff.
> Ok then why does nearly every company make people write code for interviews or do take home programming projects?
For the same reason they put leetcode problems to "test" an applicants skill. Or have them write mergesort on a chalkboard by hand. It gives them a warm fuzzy feeling in the tummy because now they can say "we did something to check they are competent". Why, you ask? Well it's mostly impossible to come up with a test to verify a competency you don't have yourself. Imagine you can't distinguish red and green, are not aware of it, but want to hire people who can. That's their situation, but they cannot admit it - because it would be clear evidence that they are no good fit for their current role. Use this information responsibly ;)
> Why do people list programming languages on their resumes if it's "least important"?
You put the programming languages in there alongside the HR-soothing stuff because you hope that an actual software person gets to see your resume and gives you an extra vote for being a good match. Notice that most guides recommend a relatively small amount of technical content vs. lots of "using my awesomeness i managed to blafoo the dingleberries in a more efficient manner to earn the company a higher bottom line"
If you don't want to be a software developer that's fine. But your questions point me towards the conclusion that you don't know a lot of things about software development in the first place which doesn't speak for your ability to estimate how easy it will be to automate it using LLMs.
Arguing about programming is not the point, in my opinion.
When AI becomes able to do most non-programming tasks too, say design or solving open-ended problems (yeah, except in trivial cases it cannot -- for now) we can have this conversation again...
I think saying "well, programming is not important, what matters is $THING" is a coping mechanism. Eventually AI will do $THING acceptably enough for the bean counters to push for more layoffs.
When AI can do the software engineering tasks that require expertise outside of coding like system design, scoping problems, cross-team/domain work, etc then it will be AGI, at which point the fact that SWE jobs are automated would be the least of everyones worries.
The main problem I perceive with AI being able to do that kind of work is that it requires an unprecedented level of agency and context-gathering. Right now agents are very much like juniors in that they work in an insular, not collaborative, way.
Another big problem is that these higher level problems often require piecing together a lot of fragmented context. If the AI already had access to the information, sure, it would probably be able to achieve the task. But the hard bit is finding the information. Some logs here, some code there, a conversation with someone on a different team, etc. It's often a highly intuitive and tacit process, not easily explicitly defined. There's a reason that defining what a "Senior" is tends to be very difficult.
> When AI can do the software engineering tasks that require expertise outside of coding like system design, scoping problems, cross-team/domain work, etc then it will be AGI
I think you're talking about the really general case, but in my opinion that's not as important. All that matters is that AI solutions manage (in the near future) to cover the average case -- where most engineers actually work -- in a mediocre but cost effective manner, for this to have huge repercussions on the job market and salaries.
> But the hard bit is finding the information. Some logs here, some code there, a conversation with someone on a different team, etc.
I've no problem believing they will become more and more successful at this. This is information retrieval which can be done faster by machines, and making sense of it all together is where advances in AI will need to happen. I think there's a high chance they'll happen eventually, at least in a way that's enough to cobble together projects that will make the leadership happy (maybe after some review/adjustment by a few human experts they retain?). They do not even have to be particularly successful -- how many human-populated engineering projects succeed, anyway?
Also, because the economy is no longer based on competition, but is controlled by a bunch of industry specific oligopolies, even if the bean counters are wrong it won’t matter, because every other company will be similarly inefficient. Everybody loses, but the people in charge are too dumb to know. Our free market is currently broken.
Is spending 4 years of your life on education that will likely only be 10-20% applicable to your job any less bizarre? It's just another hoop employers want to see you capable of jumping.
If you ignore the syntax programming is just writing detailed instructions. Just because AI is able to translate English to code doesn't mean the 100s of decisions that need to be made go away. Someone still needs to write very detailed instructions even if they are in English and it sure isn't going to be the people sitting in meetings all day.
And let's pretend that I can now be 10x more productive with AI. Great, now I can ship 10x more features in the same timeframe and nothing changes - the development backlog is literally infinite. There are always more features or bugs to work on.
> Just because AI is able to translate English to code doesn't mean the 100s of decisions that need to be made go away. Someone still needs to write very detailed instructions even if they are in English and it sure isn't going to be the people sitting in meetings all day.
What makes you think it will be you? The machines seem increasingly capable of converting English into different English, and if we take it as a given that they can convert English into code.. what are you there for? The people sitting in meetings might as well talk to the machine, to the extent they're willing to talk to you.
To be clear, the professional "meeting participants" are as much on the chopping block as we are, although that's not commonly pointed out.
It's weird to talk about aider hallucinating.
That's whatever model you chose to use with it. Aider can use any.l model you like.
One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?
One of the more important features of agents is supposedly that they can stop and ask for human input when necessary. It seems it does do this for "hard stops" - like when it needed a human to setup API keys in their cloud console - but for "soft stops" it wouldn't.
By contrast, a human dev would probably throw in the towel after a couple of hours and ask a senior dev for guidance. The chat interface definitely supports that with this system but apparently the agent will churn away in a sort of "infinite thinking loop". (This matches my limited experience with other agentic systems too.)
LLMs can create infinite worlds in the error message it’s receiving. It probably needs some outside signal to stop and re-assess. I don’t think LLMs have any ability to reason if they’re lost in their own world on their own. They’ll just keep creating new less and less coherent context for themselves
If you correct an LLM based agent coder, you are always right. Often, if you give it advice, it pretends like it understands you, then goes on to do something different from what it said it was going to do. Likewise, it will outright lie to you telling you it did things it didn't do. (In my experience)
So when people say these things are like junior developers, they really mean that they’re like the worst _stereotype_ of junior developers, then?
For sure - but if I'm paying for a tool like Devin then I'd expect the infrastructure around it to do things like stop it if it looks like that has happened.
What you often see with agentic systems is that there's an agent whose role is to "orchestrate", and that's the kind of thing the orchestrator would do: every 10 minutes or so, check the output and elapsed time and decide if the "developer" agent needs a reality check.
How would it decide if it needs a reality check? Would the thing checking have the same limitations?
Decision trees and random forests (funnily enough, this is not sarcasm).
You can maybe have a supervisor AI agent trigger a retry / new approach
They need impatience!
I think training it to do that would be the hard part.
- stopping is probably the easy part
- I assume this happens during RLFH phase
- Does the model simply stop or does it ask a question?
- You need a good response or interaction, depending on the query? So probably sets or decision trees of them, or agentic even? (chicken-egg problem?)
- This happens 10s of thousands of times, having humans do it, especially with coding, is probably not realistic
- Incumbents like M$ with Copilot may have an advantage in crafting a dataset
> One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?
You are over-estimating the sophistication of their platform and infrastructure. Everyone was talking about Cursor (or maybe was it astroturfing?) but once I checked it out, it was not far from avante on neovim.
Cursor isn't designed to do long running tasks. As someone mentioned in another comment it's closer to a function call than a process like Devin.
It will only do one task at a time that it's asked to do.
...for now.
They are pushing in this direction with the Composer Agent mode which can carry out a sequence of multi-file changes without you having to specify the files. It's pretty decent. If you're feeling brave there is also a beta "YOLO" mode that will auto approve these changes and run console commands.
Devin does ask for help when it can't do something. I think I have it asked me how to use a testing suite it had trouble running.
The problem is it really really hate asking for help if it had a skill issue, it would prefer running in circles than admitting it just can't do something.
So they perfectly nailed the junior engineer. It’s just that that isn’t what people are looking for.
Maybe. It's pretty weird and I'm still thinking about it.
You can't throw junior engineers working on an issue under the bus when they clearly can't do that. Or at least it takes some effort. In return you may coach them and hope they eventually improves.
Devin does look like junior engineers, but I've learned to just click "Terminate Session" immediately after I spotted that it was doing something hopeless. I've managed to get some real work done out of it, without much effort on my side (just check what it's doing every 10~15 minutes and type a few lines or restart session).
If they had built that from the beginning people would have said "every other tasks it asks me for help, how is it a developer then if I have to assist it all the time?"
But now since you are okay with that, I think it's the right time to add that feature.
You can set a "max work time" before it pauses so it wont go for days endlessly spending your credits. By default its set to 10 credits.
So I'm not sure how the author got it to go for days.
There should be an energy coefficient to problems. You only get a set amount of energy to solve per issue. When the energy runs out. A human must help.
I'm sure a lot of folks in these comments predicted these sorts of results with surprising accuracy.
Stuff like this is why I scoff when I hear about CEOs freezing engineering hiring or saying they just don't need mid-level engineers anymore because they have AI.
I'll start believing that when I see it happening, and see actual engineers saying that AI can replace a human.
I am long AI, but I think the winning formula is small, repetitive tasks with a little too much variation to make it worth it (or possible) to automate procedurally. Pulling data from Notion into Google sheets, like these folks did initially, is probably fine. Having it manage your infrastructure and app deployments, likely not.
This feels a bit like AI image generation in 2022. The fact that it works at all is pretty mindblowing, and sometimes it produces something really good, but most of the time there are obvious mistakes, errors, etc. Of course, it only took a couple more years to get photorealistic image outputs.
A lot of commenters here seem very quick to write off Devin / similar ideas permanently. But I'd guess in a few years the progress will be remarkable.
One stubborn problem – when I prompt Midjourney, what I get back is often very high-quality, but somehow different than what I expected. In other words, I wouldn't have been able to describe what I wanted, but once I see the output I know it's not quite right. I suspect tools like this will run into similar issues. Maybe there will be features that can help users 'iterate' quickly.
> Of course, it only took a couple more years to get photorealistic image outputs.
"Photorealistic" is a pretty subjective judgement, whereas "does this code produce the correct outputs" is an objective judgement. A blurry background character with three arms might not impact one's view of a "photorealistic" image, but a minor utility function returning the wrong thing will break a whole program.
If were comparing Devin to image generation, then Devin would be a version of Midjourney where you have no prompting skills, you only get one image and if you want something different you can only use the remix feature to make changes, oh and with each change the image resolution goes up and you get more jpeg artifacts.
Those “how I feel about Devin after using it” comments at the bottom are damning, when you compare them to the user testimonials of people using cursor.
Seems to me that agents just aren’t the answer people want them to be, just a hype wave obscuring real progress in other areas (eg. MCST) because they’re easy to implement.
…but really, if things are easy to implement, at this point, you have to ask why they haven’t been done yet.
Probably, it seems, because it’s harder to implement in a way that’s useful than it superficially appears…
Ie. If the smart folk working on Devin can only do something of this level, anyone working on agentic systems should be worried, because it’s unlikely you can do better, without better underlying models.
How is Devin different from cursor?
I recently used cursor and it has felt very capable in implementing tasks across files. I get that cursor is an IDE but it's ai functionality feels very agentic.. where do you draw the line?
Cursor Composer (both "normal" and "agent" mode) fit the colloquial definition of agent, for sure.
I had to look up MCST: it means Model-Centric Software Tools, as opposed to autonomous agents.
Devin is closer to a long-running process that you can interact with as it is processing tasks, whereas Cursor is closer to a function call: once you've made the call, the only think you can do is wait for the result.
It stands for Monte Carlo search tree.
Ie. Better outputs from models, not external tooling and prompt engineering.
https://github.com/zz1358m/MCTS-AHD-master
Thanks for the correction, I guess I was lured by yet another LLM confabulation
Agents are really new and would solve plenty of annoying things.
When I code with Claude, I have to copy paste files around.
But everything we do in AI is new and outdated a few weeks ago.
Claude is really good but blocks you in 1-3h for a bit due to context length.
That type of issues will be solved.
And local coding models are super fast on a 4090 already. Imagine a small project digits on your desktop were you allow these models also more thinking. But the thinking style models again are super new.
Things probably are not done yet because we humans are the bottleneck right now. Getting enough chips, energy, standards, training time, doing experiments with tech a while tech b starts to emerge from another corner of ai.
5090 just was announced and depending on benchmarks it might be 1.x-3 times faster. if it's faster above 1.5 that would again be huge.
Have you used Cursor, which GP actually refers to?
Disclosure: Working on a company in the space and have recently been compared to Devin in at least one public talk.
Devin has tried to do too much. There is value in producing a solid code artifact that can be handed off for review to other developers in limited capacities like P2s and minor bugs which pile up in business backlogs.
Focusing on specific elements of the development loop such as fix bugs, add small feature, run tests, produce pull request is enough.
Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.
Not to take away from your opinion, but I guess time will tell? As models get better, it's possible that wide tools like Devin will work better and swallow tools that do one thing. I think companies much rather have a AI solution that works like what they already know (developers), than one that works in the IDE, another that watches to Github issues, another that reviews PRs, and one that hangs on Slack and makes small fixes.
> Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.
Interest isn't what tools like Devin are lacking, (un)fortunately.
To be clear, I do share a lot of scepticism regarding all the businesses working around AI code generation. However, that isn't because I think they'll never be able to figure it out, but because I think they are all likely to figure it out at the end, at the same time, when better models come out. And none of them will have a real advantage over the other.
I've recently had several enterprise level conversations with different companies and what we're being asked for is specifically the simpler approach. I think that is the level of risk they're willing to tolerate and it will still ameliorate a real issue for them.
The key here is my product is no worse positioned to do more things if and when the time comes, but building a solid foundation and trust, and not having the quiet part be (which I heard as early as several months ago) that your product doesn't work means we'll hopefully still have the customer base to roll that out to.
I've talked to Devin's CEO once at Swyx's conference last June, they're very thoughtful and very kind so this must be very rough but between when they showed their demo then and what I'm hearing now the product has not evolved in a way where they are providing value commensurate with their marketing or hype.
I'm a fan of Guillermo Rauch's (Vercel CEO) take on these things. You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.
Devin's investment was fueled by hyperspeculation early on when no one knew what the shape of the game was. In many ways we still don't, but if you burn your reputation before we get there you may not be able to capitalize on it.
To be completely fair to them, taking the long view and the bank account to go with it they may still be entirely fine.
> You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.
Not entirely. We're in interesting times where products with better models can suddenly leapfrog and displace even current upstarts. Cursor won over Copilot from leveraging Claude Sonnet 3.5. They didn't "earn the right".
Improvements with models will help those with the existing infrastructure that can benefit from it. I'm not saying Devin will win when that time comes, but a similar product might find their space quickly.
I just want to note that Copilot is multi model now and can also run Sonnet.
You can get a much higher hit rate with more constrained agents, but unfortunately if it's too constrained it just doesn't excite people as much.
Ex. the Grit agent (my company) is designed to handle larger maintenance tasks. It has a much higher success rate, with <5% rejected tasks and 96% merged PRs (including some pretty huge repos).
It's also way less exciting. People want the flashy tool that can solve "everything."
Also trialed Devin, it's quite impressive when it understands the code formatting and local test setup, producing well formatted and test case passing code, but it seems to always add extraneous changes beyond the task that can break other things. And it can't seem to undo those changes if you ask. So everything requires more cleanup. Devin opened my eyes to the power of agentic workflows with closed loop feedback, and the coolness of a slack interface, but I am gonna recommend cancelling it because it's not actually saving time and it's quite expensive.
I’ve used Cursor a lot and the conclusion doesn’t surprise me. I feel like I’m the one *forcing* the system in a certain direction and sometimes an LLM gives a small snippet of useful code. Sometimes it goes in the wrong direction and I have to abort the suggestion and force it into another direction. For me, the main benefit is having a typing assistant which can save me from typing one line here and there. Especially refactorings is where Cursor shines. Things like moving argument order around or adding/removing a parameter at function callsites is great. Saved me a ton of typing and time already. I’m way more comfortable just quickly doing a refactoring when I see one.
Weird. I have such a different experience with Cursor.
Most changes occur with a quick back and forth about top level choices in chat.
Followed with me grabbing appropriate interfaces and files for context so Sonnet doesn't hallucinate API, and then code that I'll glance over and around half the time suggest one or more further changes.
It's been successful enough I'm currently thinking of how to adjust best practices to make things even smoother for that workflow, like better aggregating package interfaces into a single file for context, as well as some notes around encouraging more verbose commenting in a file I can provide as context as well on each generation.
Human-centric best practices aren't always the best fit, and it's finally good enough to start rethinking those for myself.
This! I've been using Cursor regularly since late 2023. It's all about building up effective resources to tactfully inject into prompts as needed. I'll even give it sample API responses in addition to API docs. Sometimes I'll have it first distill API docs down into a more tangible implementation guide and then save that as a file in the codebase.
I think I'm just a naturally verbose person by default, and I'm starting to think that has been very helpful in me getting a lot out of my use of LLMs and various LLM tools over the past 2+ years.
I treat them like the improv actors they are and always do the up front work to create (with their assistance) the necessary broader context and grounding necessary for them to do their "improv" as accurately as possible.
I honestly don't use them with the immediate assumption I'll save time (although that happens almost all the time), I use them because they help me tame my thoughts and focus my efforts. And that in and of itself saves me time.
Interesting. What project are you working on? For me it's writing a library in Rust.
This is what’s needed to get the most out of these tools. You understand deeply how the tool works and so you’re able to optimize its inputs in order to get good results.
This puts you in the top echelon of developers using AI assisted coding. Most developers don’t have this deep of an understanding and so they don’t get results as good as yours.
So there’s a big question here for AI tool vendors. Is AI assisted coding a power tool for experts, or is it a tool for the “Everyman” developer that’s easy to use?
Usage data shows that the most adopted AI coding tool is still ChatGPT, followed by Copilot (even if you’d think it’s Cursor from reading HN :-))
I'll add few things at which Cursor with Claude is better than us (at least in time/effort):
- explaining code. Enter some legacy part of your code nobody understands, LLMs aren't limited to keeping few things in memory like us. Even if the code is very obfuscated and poorly written it can understand what it does and the purpose and suggest refactors to make it understandable
- explaining and fixing bugs. Just the other day Antirez posted a bug of him debugging a Redis segfault on some C code providing context and stack trace. This might be a hit or miss at times, but more often than not it saves you hours
- writing tests. It often comes up with many more examples and edge cases than I thought of. If it doesn't, you can always ask it to.
In any case I want to stress that LLMs are only as good as your data and prompts. They lack the nuance of understanding lots of context, yet I see people talking to them like humans that understand the business, best practices and others.
That first one has always felt super crazy to me, I've figured out what lots of "arcane magic, don't touch" type of functions genuinely do since LLMs have become a thing.
Even if it's slightly wrong it's usually at least in the right ballpark so it gives you a very good starting point to work from. Almost everything is explainable now.
I can relate, I have been genuinely amazed more than once by how it could "understand" some very complex code nobody dared to touch like you mention.
Kinda reminds me of that Glados quote, haha:
"These next tests require cooperation. Consequently, they have never been solved by a human. That's where you come in. You don't know pride, you don't know fear, you don't know anything. You'll be perfect."
It takes someone with no ego, no preconceptions, and infinite patience to delve in and come back alive.
Agreed, AI has been a godsend for trying to understand snippets of perl code in our codebase that were basically unreadable before unless you were an expert.
I think the .cursorrules and .cursorignore files might be useful here.
Especially the .cursorrules file, as you can include a brief overview of the project and ground rules for suggestions, which are applied to your chat sessions and Cmd/Ctrl K inline edits.
For moving argument order and removing parameter is already doable by a mature IDE and they are more predictable than AI (Jetbrains IDEs support them well. In VSCode it may need extensions.)
But adding parameter is not well supported by IDE as it requires knowing which value to pass. This is where Cursor can shine.
So for anyone who doubted SWE-BENCH's relevance's to typical tasks, it seems that its stated 13.86% almost exactly matches this 3 successes out of 20 pilot outcome.
We're not quite there yet, but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction. I would now expect pretty much the textbook disruptive innovation process over the next decade or so, until the typical human dev role is pushed to something more akin to the responsibilities of current day architects and product managers. QA engineering though will likely see a big resurgence.
>> but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction.
Can you explain why you think this. From what I gather from other comments it seems like if we continue on current trajectory at best you'd still need a dev who understands the projects context to work in tandem w/ the agent so the code doesn't devolve into slop.
> so the code doesn't devolve into slop
As I see it, this is pretty much a given across all codebases, with a natural tendency of all projects to become balls of mud if the developer(s) don't actively strive to address technical debt and continuously refactor to address the new needs. But having said that, my experience is that for a given task in an unfamiliar codebase, an AI agent is already better at maintaining consistency than a typical junior developer, or even a mid-level developer who recently joined the team. And when explicitly given the task of refactoring the codebase while keeping the tests passing, the AI agents are already very capable.
The biggest issue, which is what you may be alluding to, is that AI agents are currently very bad at recognizing the limits of their capabilities and continue trying an approach when a human dev would have long since given up and went to their lead to ask for help or for the task specification to be redefined. That's definitely an issue, but I don't see any fundamental technological limitation here, but rather something addressable via an engineering effort.
In general, I've seen so many benchmarks fall to AI in the recent decade (including SWE-BENCH), that now I'm quite confident that if a task being performed by humans can be defined with clear numerical goals, then it's achievable by AI.
And another way I'm looking at it is that for any specific knowledge work competency, it seems to already be much easier and time effective to train an AI to do well on it than to create a curriculum for humans to learn it and then to have every single human to go through it.
This only reinforces my bias against AI agents. At this point, they are mostly just hype. I believe that for AI to replace a junior, we would need to achieve at least near-AGI, and we are far from that.
If by hype you mean that there isn't extreme real world value right here and right now, then I very much disagree.
Closing in on 20 years since I left school and for me AI is absolutely useful, right here and right now. It is really a bicycle for the mind:
It allows me to get much faster to where I want. (And like bicycles you will get a few crashes early on and possibly later as well, depending on how fast you move and how careful you are.)
I might be in some sweet spot where I am both old enough to know what is going on without using an AI but also young enough to pick up the use of AI relatively effortlessly.
If however by hype you mean that people still have overhyped expactations about the near future, then yes, I agree more and more.
I feel AI can also do simple monotonous coding tasks, but I don't think programming is something it's currently very good at. Samples, yes, trivial programs, sure, but anything non-trivial and it's rarely useful.
Where it really shines today is getting humans up to speed with new technologies, things that are well understood in general but maybe not well understood by you.
Want to say build a window manager in X11, despite never having worked with X11 before? Sure, Claude will point you in the right direction and give you a simple template to work with in 30 seconds. Enormous time saver compared to figuring out how to do that from scratch.
Never touched node in your life but want to build a simple electron app? Sure, here's how you get started. Few hours and several follow up questions later, you're comfortable and productive in the environment.
Getting off the ground with new technologies is so much easier with AI it's kind of ridiculous. The revolutionary part of AI coding is how it makes it much easier to be a true generalist, capable of working in any environment with any technology, whatever is appropriate.
Exactly. LLMs are gullible. They will believe anything you tell them, including incorrect things they have told themselves. This amplifies errors greatly, because they don't have the capacity to step back and try a different approach, or introspect why they failed. They need actual guidance from somebody with much common sense; if let loose in the world, they mostly just spin around in circles because they don't have this executive intelligence.
A regular single-pass LLM indeed cannot step back, but newer ones like o1/o3/Marco-o1/QwQ can, and a larger agentic system composed of multiple LLMs definitely can. There is no "fundamental" limitation here. And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models), the sky's the limit. I'd be very bullish about Deepmind, once they fully enter this race.
> And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models),
Agree with this totally.
I wouldn't call what the CoT models are doing exactly being able to step back - their "stepping back" still dumps tokens into the output, so it is still burdened with seeing all of these failed attempts as it searches for the right one. But my intuition on this can be wrong, and it's a much more advanced reasoning process than what "last-gen" (non-CoT) does, so I can see your point.
For an agentic system composed of multiple LLMs, I would strongly disagree if the LLMs are last-gen. In my experience, it is very hard to prompt a non-CoT LLM into rejecting an upstream assumption without making it paranoid and rejecting valid assumptions as well. This makes it hard to effectively create a robust agentic system that can self-correct.
I think that's different if the agents are o1-level, but I think it's hard to appreciate just how costly and slow doing this would be. Agents consume tokens like candy with all the back-and-forth, so a surprising number of tasks become economically infeasible.
(It seems everyone is waiting for an inference perf breakthrough that may or may not come.)
What model does Devin use? How would it change if it used o1 or even o3 for times when it gets stuck?
IE. Generate the initial code using GPT4o/Claude 3.5, then start testing the code, when it gets stuck, use o1/o3 to help.
Yea this is what I was wondering as well. I have o1 not o1 pro but I am gathering from reddit/youtube o1 pro if used correctly is superior for coding tasks.
Sounds exactly like my experience with the “agents” about a year ago. Autogpt or whatever it was called. Works great 1% of the time and the rest it gets stuck in the wrong places completely unable to back out.
I’m now using o1 or Claude Sonnet 3.5 and usually one of them gets it right.
The current frontier models are all neocortex. They have no midbrain or crocodile brain to reconcile any physical, legal or moral feedback. The current state of the art is to preprocess all LLM responses with a physical/legal/moral classifier and respond with a generic "I'm sorry Dave, I'm afraid I can't do that."
We are fooled into thinking these golems have a shred of humanity, but their method of processing information is completely backward. Humans begin with a fight/flight classifier, then a social consensus regression, and only after this do we start generating tokens ... and we do this every moment of every day of our lives, uncountably often, the only prerequisite being the calories in an occasional slice of bread and butter.
The whole idea of Devin is pointless and doomed to fail in my humble opinion, big tech will be quite capable on delivering A.I agents / assistants - very soon. I don't think wrappers over other people's LLMs like Devin make a lot of sense. Can someone help me understand what's the value proposition / moat of this company?
I'm confused here, aren't agents/assistants basically wrappers over LLMs or tools that interact with them as well? Devin seems to be in this category.
I recommend you look at tools like Aider or Codebuff... sure they need to call some LLM at some point (could be your own, could be external), but the key thing that they are doing complex modifications of source code using things like treesitter -> i.e. you don't rely directly on the LLM modifying code, but the LLM using trees to modify the code. See in Aider's sourcecode: https://github.com/Aider-AI/aider/tree/main/aider/queries
Simple copy-pasting of "here's my prompt, give me code" was always doomed from the start to be perfect every time, and DEFINITELY won't work for an agent. We need to start thinking about how to use these LLMs in smarter ways (like the above mentioned tools)
Can Aider sit inside VS Code, understand what files I have open, and use them as context? Their docs lead me to say no, that they are an inline chat/completion experience
Their is a /chat command and an /add command so I'd assume a plugin like that is possible.
"Even more telling was that we couldn’t discern any pattern to predict which tasks would work."
I think this cuts to the core of the problem for having a human in the loop. If we cannot learn how to best use the tool from repeated use and discern some kind of patterns of best and worst practices then it isn't really a tool.
The assumption with low-code tooling was that AI is so good at writing actual code in a way that it will make low-code tools redundant. Spending time with Windsurf, Cursor, and a bunch of VSCode extensions, while it was so impressive to see new projects being created autonomously, asking for new requirements or fixing bugs after >10 iterations was more complex.
I had to audit the code and give specific directions on how to restructure the code to avoid getting stuck when the project gets more complex. That makes me think autonomous agents will do much better on low-code tools, as their restrictions ensure the agent is on track. The problem with low-code tools is that they also get more complicated to scale after maybe like >200 iterations. (for a medium-sized project, on average 6 months)
The thing with AI agents I tend to find is they reveal how much heavy lifting the dev is actually doing.
A personal example, my best use out of AI so far has been cases where documentation was poor to nonexistent, and Claude was able to give me a solution. But the thing is, it wasn't a working solution, nowhere close, but it was enough for me to extrapolate and do my own research based on the structure, classes and functions it used. Basically, it gave me somewhere to start from. Whether that's worth the social, economic and environmental problems is another story.
I'm working on AI assistant in Python notebook. It aims to help with data science tasks. I'm not using it to do a full analysis. It will fail. What I ask is to create a code snippet for my next step in the analysis. Many times I need to manually change the code, but it is fine because LLM speed-up my coding a lot. And it is really fantastic in writing matplotlib code for visualization. I don't remember all matplotlib syntax to change axis labels, add annotations or change style, and LLM really can handle it good, in impressive speed.
I remain sceptical about the "Planet Tracker"-task. The task was to debunk claims about historical positions of Jupiter and Saturn. If the task was to find those planets were NOT in a certain (claimed) position an erroneous program would still appear to "debunk" the claims. Did they check if Devin's code's calculated positions were actually correct? Did they check in some NASA-database? If Devin gave arbitrary positions for the planets it's much more likely that they're different than any claim and appear to debunk it.
I was able to read the code it wrote, and check that (as hoped) it was using a good existing library to do the heavy lifting. And I had it make plots that I could visually use to check that the values were 'reasonable'. The value in that case was simply that I didn't have to leave the couch and write the code myself (although if the result was actually needed for anything more important than a smug 'i thought so' confirmation I would still have taken over and validated it kore carefully).
At some point people are going to realize that using these LLM AIs is a communications problem, and by that I mean the reason various attempts to use them fail is because they are not being effectively told what to do, vague and implied requests are not enough for a inhuman statistical construct to grasp what you're asking without clearer more details and more specific instructions.
I also wrote my first impressions on Devin, more focused on the user experience and analysis of its capabilities (with lots of screenshots):
https://thegroundtruth.substack.com/p/devin-first-impression...
Your take seems much more positive than theirs. What do you think the key differences are between your experience and the one here?
One possible reason is that I'm using popular tech stacks (Next.js, HTML/JS for demo website and SDK). No niche frameworks or tools like nbdev (I've never heard of that).
Also I've been prompting ChatGPT and Claude for over a year, that might help with communicating with Devin.
> Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible. (...)
> Devin spent over a day attempting various approaches and hallucinating features that didn’t exist.
One of the big problems of GenAI is its inability to know what they don't know.
Because of that, they don't ask clarifying questions.
Humans, in the same situation, would spend a lot of time learning before they could be truly productive.
Your statement is factually wrong, Claude 3.5v2 asks clarifying questions when needed "natively", and you can add similar instructions in your prompt for any model.
The default system prompts are tuned for the naive case. LLMs being all purpose text handling tools, can be reprogrammed for any behavior you wish. This is the crux of skilled use of LLMs.
The better the LLMs get, the worse the average prompt quality.
Yep. It's fairly trivial to prompt an LLM to say "I don't know" when it doesn't know something.
I've been experimenting with code gen on and off for the last 18 months, and find this exactly in line with my experience.
Do you have good references about using AI coding assistants?
Techniques of prompt engineering help a lot, but I really think there will be created a body of knowledge about how to use, what's the good contexts of use, and good heuristics. They are a valuable tool, but I feel it's possible to extract more value.
If you're going to compare other tooling, I'm curious to know what you think of our long term goals: https://github.com/charperbonaroo/robtherobot/issues/2
Most the problems you mentioned will likely be solved with the next iterations of Devin or similar product.
I can say that because I work daily with Claude as an agent over mcp, and the problems you mentioned feel very familiar.
Based on the type of the issues you mentioned, Devin isn't likely using o1 yet. A workflow like o1 for planning, Claude for Coding, o1 for review, etc., would work better.
The problems you mentioned: ssh-key issue unrelated to script, code not following existing patterns or themes, instructions not being followed, extra abstractions, etc., fall into that category.
Some of the issues are likely due to context length problem. For example, LLM doesn't work well with jupyter notebook because of extra junk in ipynb, which will likely remain a problem.
We’ll see! We’re just one year away from AGI. Just like we were last year!
I can’t believe they named it after Devin Franco - guess it can take a lot of load!
No matter what happens with Devin specifically, I think this is a really important topic and I enjoy reading updates on this kind of review every time.
Please keep them coming.
An engineer that thinks it knows everything (but doesn't) and can't self-correct is about the worst combo I can think of.
Well, having read too much sci-fi, I am more afraid of an AI engineer that really does know everything.
AI in general doesn’t understand scope
Honestly, i have been bitten so many times by LLM hallucinations when I work in parallel with the LLM, I wouldn't trust it autonomously running anything at all. If you have tried to use imaginary APIs, imaginary configuration and imaginary cli arguments, you know what I mean
> If you have tried to use imaginary APIs, imaginary configuration and imaginary cli arguments, you know what I mean
I see this comment a lot but I can't help but feel it's 4 weeks out of date. The version of o1 released on 2024-12-17 so rarely hallucinates when asked code questions of basic to medium difficulty and provided with good context and a well written prompt, in my experience. If the context window is sub-10k tokens, I have very high confidence that the output will be correct. GPT-4o and o1-mini, on the other hand, hallucinates a lot and I have learned to put low trust in the output.
o1 is way to slow to keep up with my flow of thinking in order to be of any help in the scenario i am describing
How are you using LLMs? With o1 I've switched to spelling out in lots of details what I want, then asking it it to one shot the full file, so with this approach the wait time has been acceptable.
I'm using it to orient me when tackling something new. For instance, the other day i was making a web driver client in shell and i asked it something apong the lines of "is there an http endpoint to the webdriver to get the class name of an element?"
These are the sort of questions i mostly do. "What is the best practice to read output from a device file in C", "Is there a cli tool to find dead typescript interface fields?"
I have been feeling LLM burnout and favoring code it all my self after a year of LLM assistance. When it gets things wrong it is too annoying. Like, I would get mad and start to curse it, shouting loud and in the chat.
I mainly use it as a typing assist. If it suggests ahead what I was thinking it saves time.
Exactly this. At first started verbally abusing it untill it conformed, but i quickly realised that after the context gets very long it simply discards former instructions and abusing. So i get frustrated, toxic AND don't get my job done
I saw few people around testing and it is quite disappointing. Sometimes a task might take forever and deliver a bad result or fail completely.
It seems it is targeting few specific problems and whatever else is just too hard. I also think that, thought it is expensive, it is cheap for the technology behind it and it won't be able to keep that price for long
Now is the time for us to hold seemingly contradictory propositions: A child born today will live to see 99% of all computer code written by artificial intelligence, but the current AI boom is massively overcapitalized.
I don't put much stock in predictions about 100 years into the future.
would you like to buy a flying car?
I have a Tesla in space to sell ya
i accept.
NFT of the current coordinates incoming...
I’m going to make so much money
No, I want 140 characters.
How is it contradictory?
I'd argue that software is being written (either by humans or AI) in an order that it progressively adds less marginal value (if we define value in the capitalistic sense).
Most of the value that software will ever create has already been created.
The only truly valuable missing things are stuff whose value is not easy to translate to capitalists, or need some visionary work.
That's already the case if you call compilers/interpreters "AI". Just a new higher level abstraction for code.
[dead]