AI Development Tips So Far
How I use AI in Sept 2025 to speed software development
I work on Superphonic, the world’s best iOS podcast player 😁, as my full-time job. In the past six months, the way I use AI has completely changed given the latest developments in the models and toolsets. Here is how I currently use AI — and I’d love to hear your tips as well!
Which Model for What
Each model has strengths which I’m still learning and adjusting to:
Grok Fast. It’s fast, alright, but often fastest to the incorrect answer. I sometimes use it for very simple tasks where it’s hard to go wrong. Grok seems to have some fundamental reasoning issues where it’ll spin off and do seemingly illogical things — it takes a lot of babysitting. I use it least.
Sonnet 4 via Cursor. This makes the most reliable edits (i.e. code changes that make sense and don’t look dumb). It’s the model I use most via Cursor’s IDE agent.
Claude Code. While this is the same backing AI as Sonnet 4, I’m torn whether accessing it through Claude Code is better than through Cursor’s agent.
Claude’s interface makes the model easily interruptible, which makes for more interactive sessions that feel like pair programming. You can redirect the AI before it goes too far afield. But it also goes over-the-top often with trying to build the code itself via Xcode CLI or do other system-wide things via CLI.
When Sonnet is piloted by Cursor instead, it seems better at tool use (e.g. knowing how to properly find code files, issue the right commands for your project).
GPT5 Codex. Much better than its predecessors, and excels in reasoning. For instance, it’s likely to give you the best answer to questions about your codebase (e.g. “How can component Boo be refactored to software best practices?”). It can also do a great job speculating on why a bug is happening if you describe the symptoms. But it’s definitely not as good as Sonnet at plain vanilla code editing, often making changes you didn’t ask for or chasing baffling bunny trails.
I’ve got my hands full just trying to develop a deep intuitive understanding of each model’s strengths, so I haven’t played with DeepSeek and other models. The above four are already plenty to keep me very busy.
Multiplicity
Speaking of which, the models are now good enough to often spawn in the background. So in a typical day, I have all of the following going at once:
Two instances of Cursor in two separate copies of the repo. I know I can use git worktrees if I’m careful not to mess things up, but I find it easier just to have two copies of my relatively small repo. I flip between these two IDE instances to keep both AIs occupied.
One Cursor is usually running the Cursor built-in agent.
The other is sometimes running Cursor agent and sometimes running Claude Code within Cursor’s terminal. The latter allows me to use the rest of Cursor’s UI to quickly review code changes, so I prefer to run Claude Code within Cursor even though any terminal would work.
Codex running GPT5. I usually have 3-4 active tasks spinning in here at any one time. To avoid merge conflicts, I make each task smaller and more specific, and I also choose a mix of feature tasks vs. code cleanup tasks since the latter can often be done without interfering with other active work.
Cursor Agent running on cursor.com/agents. This, too, usually has 3-4 active tasks. I put smaller, easier tasks onto the web agent, reserving highly interactive tasks to the two Cursor IDEs I have running.
Since each AI is quite slow, there’s often a ton of available time to bounce between all these active tasks. The rate limiters are usually:
My ability to context switch without making mistakes. It’s easy to misremember which copy of what you’re working on.
The combinatoric explosion of merge conflicts. It depends how well-factored your codebase is… but I tend to find that with ~10 agents running at once, I easily brush up against a complexity around merge conflicts which slow progress.
Teamwork Makes the Dream Work
Given each AI’s strengths and weaknesses, it’s effective to play them off each other for best results. I’ll start with one AI to do a task, but if it gets stuck, I’ll flip to a different one to continue the work.
I’ve also found it very effective to start with GPT5 to analyze code and propose an approach, after which I hand the execution off to Sonnet. You can even ask GPT5 to make the prompt for Sonnet so everything gets done precisely.
Another thing I do with complex changes is have a different AI review the changes once the first AI is done. I use prompts like, “Changes have been made in this branch to blah blah blah. Please critique these changes against origin/main.” Often, the second AI is able to find implementation bugs the first one can’t. The result is usually better than one AI alone — though it matters which AI you end with (e.g. don’t end with Grok, which often goes through and makes things worse).
I recommend enabling both Codex code review as well as Cursor review bot on your PRs. They usually find non-overlapping sets of issues with relatively low false positives. Funny enough, they often find bugs in code they themselves produced.
Speculative Execution
One amazing thing you can do when you have intelligence on tap is to launch AI on various speculative adventures which you aren’t even sure are ultimately a great idea.
For instance, I might think a certain refactoring would make the codebase better. Or I want to change the implementation of one component completely because it could simplify things.
I used to have to guess at the likelihood of success to weigh whether I should spend two hours doing something speculative. Now I spin AI off to do the work, usually via web agents, and I come back after a while to check on progress.
On some days, I’ve thrown away up to 80% of the tasks I set a web agent on, but it doesn’t matter. The 20% of times when the speculative item works, it’s a real win. And in the remaining cases, the time spent spinning on that work didn’t come at the cost of other work being done. It’s essentially all upside.
Simple Tasks FTW
Here are some common things which AI is very good at, and which I also personally find unpleasant (so I’m extra grateful for its help):
Rebasing and fixing merge conflicts. AI is often fairly decent at this, especially if you have good unit tests that let it self-correct.
Fast print debugging. I often describe a bug to the AI and ask it to insert print statements in enough places to narrow down the bug. I then run the app and give the AI all the logs, which it analyzes instantly. This has proven to be a very fast way to find bugs.
Addressing simple code review feedback. Sometimes a code reviewer suggests renaming a variable or refactoring a function to share it amongst several callers. These are great cases to spin AI off on, especially since you can set the branch that any web agent works on (e.g. so you can get it to work on the PR’s branch while you go do other things).
Getting to 100% code coverage. There are many parts of my codebase which now have more comprehensive tests because I spin up AI on these tasks in the background.
Extending lint. This falls in the category of skills I would never take time to learn on my own — but asking AI to extend the linter to enforce custom rules almost feels like a superpower.
Designing Codebases for AI
As AI becomes more powerful, there are increasingly larger payoffs for designing your codebase specifically for AI to be more effective. Many of these principles are the same for human-only software projects, but I find AI further accentuates the benefits of these good software practices because they make AI far more effective:
Comprehensive tests save time. A lot of my feedback to AI is about things it has obviously broken. But as I write more tests (well, as AI writes more tests), the AI needs less feedback from me because it readily detects when changes break things.
Tighter lint rules prevent cheating. These AIs all cheat the way junior engineers do (i.e. not maliciously, but because they miss the point of a task). For instance, when asking AI to hit 100% code coverage on a file, it will often just insert comments which cause the coverage tool (e.g. Jest) to ignore branches. To prevent this sort of cheating, I’ve introduced custom lint rules that trigger when these cheating approaches are used.
Cleaner factorings increase parallelism. The more AIs you run in parallel, the more you quickly realize that tight coupling in the codebase limits how many different changes you can make at once. You start running into a ton of merge conflicts, for example. This realization has caused me to refactor parts of the codebase in order to let more AIs be able to work simultaneously.
None of these things are new, of course. I just hadn’t expected how much these basic software hygiene practices make it easier to babysit AI less.
Additional Tips
I’d love to hear your suggestions of how to get the most out of existing models. What’s your workflow? Which tools and approaches do you find most helpful? Please share in comments so everyone can benefit!



Amazing article, Thank you for writing this
Really liked the idea of one AI doing the reasoning and then creating the prompt for the other AI that is better at implementing! Awesome 👏