Generating & Preparing JSONL Training Data with GPT for Fine Tuning an OpenAI Model

Colby Fayock
InstructorColby Fayock
Share this video with your friends

Social Share Links

Send Tweet
Published a year ago
Updated a year ago

Using the models that OpenAI provides will get you really far, but their knowledge base can be somewhat limited for particular information or if its past a certain date that the model doesn't support.

To improve the experience, we can train models to have a better understanding of the data we want to work with, whether that's documentation for a new developer tool or guiding the model to respond closer to how how we expect.

The problem—is you need a lot of data in a specific format to do this and its not very practical to write it all our by hand. So we can instead use GPT, where by providing our data, we can ask it to generate JSONL training data for us and format it in a way that we can use to fine tune our model.

Instructor: [0:00] When working with different AI models, you're limited to the set of data that the models were trained on. Fine tuning gives us a way to customize these models in a way that's going to train it on the data that we want it to understand and know. But before we're able to create a fine tuning mechanism, we first need to have our data prepared in a way that the fine tune will actually understand.

[0:17] In particular, we're going to use the JSONL syntax where each line will contain its own JSON object, complete with a prompt and a completion. Now to start, I created a bunch of fake data, which is something that OpenAI wouldn't know about, which is a bunch of sci-fi characters.

[0:31] To do this, I used methods similar to what we did in past lessons where I created a specific prompt that's going to ask ChatGPT to create that character data in JSON format. If you want to follow along or try this out for yourself, you can find a link to the code on the lesson video page.

[0:46] Now before we dive in too far with our specific example, you should be known that OpenAI might recommend actually using embeddings rather than using fine tuning depending on the specific use case. You should also know, the more prompts you're able to give as part of your example, the better results you're going to have with your trained model.

[1:02] For the sake of this example, I'll be using a smaller data set so I'll be limited in the amount of prompts and training data I'll be able to produce. Starting off, while you're probably able to set this up inside of a serverless function, this is really going to be a one-time use thing unless we're building a service around this.

[1:17] We're going to actually just start off by building a script. Inside of my scripts directory, I'm going to create a new file called generate training data.js. Inside that file, I added some boilerplate code including requiring my local environment variable file. I'm going to require the OpenAI SDK. I'm going to configure it and I'm going to create a new instance of OpenAI API.

[1:38] As far as what's inside of the script, I'm creating a new asynchronous function called run where I'm going to immediately invoke it, where I'm going to then use the create chat completion method in order to send my message.

[1:47] Now, as is the same for a lot of the chat completion lessons we've worked through, I'm going to first paste in the first line of our prompt where we're going to first say, you are an assistant that generates JSONL prompts based off of JSON data for fine tuning.

[2:01] What we're going to want to do is describe what we have and what we want GPT to do with the data to ultimately return a result. Let's start off by saying how we want our response to look like. I'm going to say, each response should be formatted as, and make sure I add that space, where I'm going to create a new variable where inside I'm going to JSON.stringify.

[2:21] I'm going to create a new object where I'm going to define my prompt and my completion. I can easily automate this because I already have my characters array so I can actually use a real example. I'm going to add my const characters and set that equal to require.

[2:36] I'm going to go up a level and then I'm going to navigate to /source/data/characters.json. For the question, I can say, "What species is Xander Prime?" and I can say, "They're a cyborg."

[2:46] For my prompt, I'll say, "What species is characters0name?" with a question mark, and then for the answer it's going to be, characters0species. Then I'm going to say, "Please generate 10 questions based off of the JSON and provide it in JSONL format."

[3:09] Then we'll say, each response should come from the following JSON, where now I can use JSON.stringify again, and pass in that first character as a sample. To see how this works, I'm going to console.log out that response.

[3:24] If I run, node scripts generate training data, we can see that I get a response that's just that, where I have my prompt and my completion, "What does Xander Prime do for a living?" and it looks like he's a mercenary that works for the highest bidder. I can even look inside of my actual character data, and I can find where we see that, "Xander is a mercenary that works for the highest bidder."

[3:45] As we've also seen in previous lessons, being able to generate responses in the exact format you want can sometimes be hit or miss. Now here, I seem to have had a success with generating all of my prompts and completion messages, so it seems like we're OK here. Just keep in mind that you might have to massage the prompt a little bit, in order to get the response that you want.

[4:05] Now that I have this, I ultimately want to create this for all of my characters, not just Xander Prime. I'm going to abstract this into a function, so, async function generatePrompts, where I'm going to take an argument of character, where inside, I'm going to grab that API call and I'm going to paste it right into that function.

[4:23] I want to, ultimately, return that same data that I was returning before, so I'm going to return that as my result. I also want to make sure that I update the character that I'm generating a response from to that single character, not just an instance from that array. Now I can loop through all my characters.

[4:38] Let's say, for const character of characters, I can say that I want to, const create a prompt, where I'm going to use my generatePrompts for that particular character. Ultimately, I want to start to collect all these prompts so that I can store inside of a file which I'll later use for my fine-tuning.

[5:01] I'm going to create a new constant of AllPrompts, I'm going to set that equal to a new array, where I will then push into AllPrompts, my new prompts. To write that file every time we have a new set of prompts, I'm going to update that file.

[5:17] To do that, I'm going to first import fs from the fs package, and I'm going to grab the promises version. I'm going to run, await fs.writeFile. Let's say, we want to add that file to /source/data/prompts.jsonl.

[5:34] The data inside, is I'm going to simply put all prompts, but I'm going to join that with a new return, which is going to add it so it's going to stack on top of each other every time a new set of prompts is added.

[5:48] Let's test this out to see what this looks like. If I run that same script again, it's definitely going to take a lot longer this time because it's going to be running that script for every single one of my characters. Keep in mind, if you have a huge data set, it's going to do that for each of them. It's going to take a little bit of time to run through all this.

[6:05] If you want to make sure you see some progress so that you know that it's working, you can always log out the prompts every time they're created, or the character name, or something to give you an idea of the progress, that it's working through those files.

[6:17] Once it's finished, we can now head back to our editor. I'm going to open up that data file, our prompts.jsonl, and we can see that we have all of our prompts listed in here.

[6:26] If you notice here, we do have an issue where it looks like we have some extra spacing. That's probably going to be deemed as a bad file if we're trying to upload this sample data. We want to make sure that we do some light cleanup on this to make sure that we have valid data for everything that we're sending.

[6:42] OpenAI provides a CLI in Python that allows you to prepare your data, but we can just use some simple rules to get a "what's probably going to work" version of our file.

[6:51] Inside of my generatePrompts function, instead of sending back those prompts right away, I'm going to first take that completion data, and let's say, const prompts = completion. Then I'm going to start to filter out these prompts and use it to break up the data and validate each one.

[7:06] The first thing I'm going to do is return and split those prompts. I'm going to split it by a newline, so I'm going to have an array, where each item in that array is going to be one of the lines of my JSONL file.

[7:18] I'm going to then create a map statement. For each of those prompts, I'm going to say that I want to use a try-catch statement. I'm going to try to JSON.parse() that prompt. Looks like I spelled that wrong. That way, I can make sure that it's valid JSON, which needs to be for every single line.

[7:36] If it works, I'm going to simply return that prompt. I'm going to also run the trim statement to make sure we get rid of some of that empty whitespace on the sides. If it doesn't work, we're going to catch that error and we can even log a message, such as, "Bad data."

[7:49] We can log the actual prompt if we want, but ultimately, return from that try-catch so that I can now add a filter statement, where I'm going to say, for each of those prompts, if my prompt is a truthy value, it's going to filter out all the bad data from that. Then finally, to bring it back all together, I'm going to now join all that data with a new line.

[8:11] Let's try running that script again. We can see that, while in my example, all my data looks pretty good so we shouldn't really see many changes, what we do see is we no longer have that spacing in front of the Thaxos the Magnificence answers.

[8:25] What we just did was generate all these different prompts, the questions and answers, based off of that character sheet we had, without having to manually go through and ask all these ourselves.

[8:35] Again, it should be noted that in my example, we did use a limited set of data, so we're going to have a limited set of prompts that we're going to be able to use for our fine-tuning, where generally, you want to make sure that you have a couple hundred of these, if not as many as you possibly can, because the more you have, the better the results.

[8:50] We can still see this as an example of how we can generate these programmatically.

[8:53] In review, in order to prepare our training data, we can take advantage of another OpenAI tool, GPT, in order to automatically generate all of our questions, or prompts and completions, based off of the data that we have.

[9:05] By using the Create Chat Completion method, we can ask GPT to go through and create that JsonML set of prompts, where after running that for each of our data points, we can then dump that into a JSONL file, which we can later use in order to upload for our fine tuning.

egghead
egghead
~ 37 minutes ago

Member comments are a way for members to communicate, interact, and ask questions about a lesson.

The instructor or someone from the community might respond to your question Here are a few basic guidelines to commenting on egghead.io

Be on-Topic

Comments are for discussing a lesson. If you're having a general issue with the website functionality, please contact us at support@egghead.io.

Avoid meta-discussion

  • This was great!
  • This was horrible!
  • I didn't like this because it didn't match my skill level.
  • +1 It will likely be deleted as spam.

Code Problems?

Should be accompanied by code! Codesandbox or Stackblitz provide a way to share code and discuss it in context

Details and Context

Vague question? Vague answer. Any details and context you can provide will lure more interesting answers!

Markdown supported.
Become a member to join the discussionEnroll Today