Agent-Driven Development in Cursor: Testing, Benchmarking, and Optimizing Functions

[00:00] Let's see if we can use an agent to optimize this function, which is about 100 lines of calculating order prices. So I'm going to open a new Composer session and I'm going to at mention calculate order prices. And the first thing I want the agent to do is please write a suite of 10 unit tests covering various scenarios of calculate order prices so that I can confidently make changes to the code when I attempt to refactor it for performance in the future. We'll paste that in, I'll switch this to agent mode, and I don't have any tests set up yet. So once I hit submit hopefully it'll set up the dependencies and the tests and everything we need to get this covered.

[00:40] Always make sure to review at least a little bit what it's doing here. You can see the test that it wrote and everything looks pretty good to me. I'm going to go ahead and accept and then scroll back down and I'm going to ask it to please install all the necessary dependencies and run the tests for me. If any of the tests fail please fix them. Hit submit again.

[01:01] Now it's going to be able to run terminal commands. So I'll accept that, hit that run, and then it did notice that I hadn't exported calculateOrderPrices yet, which it definitely needed to do. I don't know why it's trying to modify existing code here, or trying to modify my tsconfig, So I'm going to call it out on that. I already have a tsconfig and a vtconfig and please don't make any modifications to code other than just adding exports so that functions are accessible for tests. We'll hit submit again and hopefully it will do better this time.

[01:35] It did apologize which is always silly from an AI and I think it might have taken me too literally around exporting the functions and declared globals to get around it. So let's try avoid declaring globals. Instead you can export whatever you need from the server file, just don't modify any of the objects. Hit submit again and this is looking better now. Let's go ahead and accept that.

[02:01] I actually hadn't looked at the test and maybe the tests were using globals because I had not exported that. So that would explain why it went off on that weird tangent. Let's accept those changes to the tests. And now we've arrived at this step. As you can see it highlighted here.

[02:15] We'll go ahead and run the command to install the necessary packages for testing. The package.json is already initialized so this probably won't fix it but maybe there's something I don't know. It's trying to use yarn instead. I wonder if I can have it read this error log. I did a quick Google search and it looks like since I used pnpm to set up the project and it's trying to use npm the solution is to always use pnpm.

[02:45] This error only happens when you start with pnpm and then you switch to npm. So let's run this. And now it ran the tests and it should notice that the test failed and suggest, there we go, and suggest some changes for us. And it looks like they're all based on math. So I'm going to accept these changes.

[03:02] You can tell again by that purple box. And now it's down here. I'm going to run the command. Something's taking a while so I'm going to pop out the terminal. Not sure what happened so let's try running it again.

[03:11] What it looks like or smells like is that my tests are being watched rather than running one time and so that each time I run a new command it's starting a new file watcher. So if I get a chance to interrupt the agent I'll tell it that. Let's go ahead and accept and before running this I'm gonna tell it I think the tests are running in watch mode. Can you set up a way so that they only run once so that when you ask me to run commands it doesn't leave a process open watching for files to change. Hit submit here.

[03:43] All right let's run this. Pop this out to see what it did. Looks like it's still watching for files to change. Yeah, it does need a flag and not just kind of an argument. So let's run that.

[03:54] There we go, much better. I'm not sure what it's asking me to accept right here. Maybe just to finalize and apply the changes. I'm going to come in and make sure I close out any running terminals just with a little trash can or command backspace. And we can move on to the next phase of our plan now that this is tested.

[04:16] So we'll go ahead and create a new composer session. I'll select this block of code, hit command-I to drop that into our composer session, or you could type at and then type the name of the function, both ways work. Make sure we're on agent. Please create a performance.test.ts file with five performance scenarios on the calculate order prices function so that we can begin benchmarking this function. Add a way to run the benchmark inside of a package.json and also set up a way to track the benching history or track the benchmarking history that we can see improvements or regressions over time.

[04:50] So we'll make sure we're on agent, hit submit. Looks like it's doing some general inspection. Let's go ahead and accept it and allow it to benchmark. Looks like it's going to log benchmark results in a readme. What's going on in here?

[05:07] Create a directory for it and let's go ahead and run this command. It should notice the error here and suggest a fix. Yep it found a file naming convention that didn't match. And it looks like the file name that I requested didn't match the way that VTest typically handles benchmarking. So I'll allow that to happen.

[05:25] I'll accept this and now let's try running again. And it does look like a pattern I need to get used to is always telling it to use pnpm. I'll have to make sure and add that to my cursor rules. And this is taking a while, it might be... Let's see...

[05:40] I don't see any results output. So let's read through this. I'm going to tell it, I'm going to copy and paste this, paste it in here, and say I've noticed two things. One, this command starts a file watching mode. Please make it like a single run mode.

[05:56] And two, I don't see any output in the bench section or the summary section telling me how fast this is running. Please fix these for me. All right so it's adding that run flag to benchmark. Let's accept that. I honestly didn't even look at the output of server bench before.

[06:14] So let's accept that. Let's run the benchmark. And looks like we're getting some results. So now it looks like we're in a good spot to start our new composer session. Let's make sure and exit this out.

[06:28] We don't need that file watcher running. And what I want to do is with this function, so in my new session take this function, hit command-I to drop it into my composer session, and I'll say generate three variations of this function that are optimized for performance. Make sure that each of them are run against the unit tests in the project and the benchmarks in the project. Always run the unit test first so that it fails quickly, and then run the benchmarks, and then report back which of the three variations is the fastest. Then we'll let this run, And it's going through and making all of these variations for me.

[07:07] So there's variation B, let's see variation C. I'm going to accept all, and I'm going to ask it to please run the unit tests and then the benchmarks against each of these variations and track which one is fastest. We'll hit submit here, always making sure agent is selected. Sometimes it's easy to miss that and you'll find this happens a lot where you're trying to get work done and it finds linter errors and it tackles those first, which can be a little bit frustrating since the linter errors are not as important as the task at hand. I haven't found a way around that yet.

[07:40] So all right let's accept the changes for the linter errors. I'm not sure if it set up the test correctly to run all the variations, but we'll run it and see what happens. At this point I'm not sure if I should tell it there is a package.json. Yeah it dug into it before I could tell it that it existed. So let's accept and run.

[08:00] And the thing that I want to check here is I want to make sure it's running the unit tests against every variation. And I didn't see that happen. So I'm going to interject here. I don't see the unit tests running against every variation of our function that we're trying to benchmark. Please review the work so far and make sure that we're testing all of the variations and then we benchmark them.

[08:22] It is critical that each of their variations pass the unit tests before we benchmark them. So we'll submit this and ignore this command. Alright so now it's going to test to test all variations. All right let's accept this and then run our tests. It noticed what had failed, the variations weren't being exported, so let's accept that and run the tests again.

[08:44] And now it looks like it needs to sync up the variations with what's being called in the test. Now the variations not being properly defined is something I would have expected it to get right and I'm trying to think back to what I could have said or done differently. How I could have injected myself into the process earlier to avoid this, like explicitly saying please make sure the variations are defined in the test. Essentially every time you go through one of these agent loops you're going to be learning and experience is the best teacher here. Alright let's accept all the changes, run the benchmark.

[09:16] Oh it didn't ask me to run the tests so I'm gonna have to inject myself in there. Always run the tests before running the benchmarks. And this is starting to feel like one of those days where the AI model is just being dumb. Seeing it do much better work than this. So I'm seriously questioning if there's something going on on the back end with the Claude model right now.

[09:38] All right we'll accept the test changes, we'll accept the benchmark changes. Okay now let's run the tests. Still have lots of failures. The variations are still not being defined. Okay now we're exporting them so let's accept that.

[09:52] Often I'm about to go look and see what was wrong but then it goes and finds what's wrong first before I even get a chance to, which puts me as the user in a weird space where it kind of beats me in a race to finding what needs to be fixed. All right so I accepted the changes to the tests, we'll give it another try. All right so the original implementation is working, it's going to create new files for the variations. All right so we have a variations file which we'll accept now. It's like it has each of the different options and now it's going to test the variations as well and we'll run this to test them.

[10:23] And it looks like all the tests passed. I just want to check here to make sure it's actually importing variations. So it is bringing in each of the variations so that looks good. And if we look for implementations you can see it's running the implementations for the original. Is it running variation A?

[10:42] I don't think it is running variation A. I'm going to accept this but I'm also going to say in the test it doesn't look like it's running the variations it looks like it's only running the original. Am I wrong? If I'm right please correct it so that it runs the original and all the variations. We'll submit this.

[10:59] This is honestly a huge oversight based on what we were doing. And I'm beginning to think that AI is just laughing at me because it knows I'm recording and we're going to run the test. The test failed. And this is good news for us. It recognizes variation B is failing a test and Variation C is failing a test.

[11:16] So this is definitely saving me a ton of work right there. So let's accept the changes, run the command again. We're down to one failed test in variation B. So we can accept this, run the tests again, and everything passes. So now it's time, if it remembers, it's time to benchmark.

[11:36] So let's see what happens. I'm going to pop this out in case there's, yeah I thought there might be a bunch of stuff running, and I'm honestly not sure if this has access to all of the results. So I'm gonna ask it, do you have access to all of the results from the benchmark? If yes, select the fastest variation and explain why you selected it. If it doesn't have access I can just copy and paste from my terminal here.

[12:07] Looks like an API failed we'll just run it again. Looks like the API is still failing. Let's just open a chat with the results just to kind of wrap this up. Let's say add to chat and I'm going to add the code base. Based on the performance results from the terminal please select the fastest variation whether it's the original or one of the variations as the most performant version, explain why it's the most performant version, and give us a brief comparison to the other versions.

[12:38] So we'll hit submit here. All right so it looks like it has variation C as the winner, about 30% faster, and you could read through all the reasons why. But in real time this took about 45 minutes while using this agent. Obviously this video is heavily edited and we've got ourself a huge performance win on this function. And setting up all of this testing, setting up all the benchmarking, even though we ran into a bunch of various failures while using the agent.

[13:08] I don't really remember writing any code and this saved me a huge amount of time, huge amount of testing. It just required a little bit of patience and a little bit of reasoning when the agent kind of went in the wrong directions.

Agent-Driven Development in Cursor: Testing, Benchmarking, and Optimizing Functions

John Lindquist

Transcript