CLI 2: Data Preparation Tool

Tom Chant
InstructorTom Chant
Share this video with your friends

Social Share Links

Send Tweet
Published a year ago
Updated a year ago

In this lesson, we learn how to prepare data using the OpenAI CLI for fine-tuning. The process involves navigating to the folder where the data is stored and using the "fine-tunes" tool to prepare the data. A potential error involving missing dependencies is addressed by installing the necessary package. The CLI provides information about the formatting of the data, including using separators, white spaces, and stop sequences to define prompts and completions. The tool automates these formatting tasks, saving time and effort.

[00:00] Okay, so we've got the OpenAI CLI up and running and we've given it our API key. Now let's use it to prepare our data. And again, I've listed the CLI commands that I'm going to use in this file right here. So our first task is to cd to the folder where we've stored our data.

[00:18] Now I've saved my data in a folder called we-wingit in apps, which is in documents. So let's head back to the terminal and I'm going to say cd documents-apps-we-wingit. cd here stands for change directory.

[00:34] And of course, you'll need to navigate to wherever it is you saved your data file. Okay, having navigated to the correct folder, we can start our fine-tune preparation. So we do that with this command. And what we're doing here is telling the terminal to use the OpenAI fine-tunes tool to prepare our data.

[00:54] Now this F flag is going to identify the file of data that we want to prepare. Now my data was called we-wingit-data.csv. So what I need to do is just put that right there on the end. Now at this point, you might run into a problem. The first few times I did this, it worked fine.

[01:12] And then suddenly at this point, I started to hit an error. Well, that's the nature of working with new technology. Things can change. And the error I got was this, missing pandas. Well, if OpenAI wants pandas, OpenAI gets pandas.

[01:28] So what you need to do then is say pip install OpenAI pandas. Once you've done that, hit enter, let it do its thing, and then we can try the data preparation task once again. So this is the exact same command. This time it works and it gives us some information.

[01:47] It knows our file is formatted as a CSV file. It also complains about the number of prompt completion pairs we're using and says, in general, we recommend having at least a few hundred examples. Well, we know that already, but this is for demonstration purposes, so it's absolutely fine.

[02:03] Now I'm going to save you reading the rest of this text here because actually it's telling us something we already know. We talked about the format of our data and all of the features that we need. And we know that we need a separator to inform the model when the prompt ends and the completion begins. We know that each completion needs to start with a single white space,

[02:22] and we know that each completion should end with a stop sequence to inform the model when the completion has ended. But this tool is great because it's basically going to do everything for us. So down at the bottom, it's already told us that it's necessary to convert this to JSON-L. That's absolutely fine.

[02:40] It's recommending us to add a suffix separator to all of our prompts. Well, let's say yes to that. Now it's recommending adding a suffix ending of a new line character to all of the completions. Let's say yes to that and press enter. And now it's recommending us to add a white space character to the beginning of all of the completions.

[02:59] Again, let's say yes. Now we just need to confirm that we're ready to proceed. And it goes through its process and it tells us that it has created this file for us. And it invites us to take a look, which is a really good idea.

[03:15] If you go back to your folder, wherever you stored your data, you should see this prepared JSON-L file waiting for you. Now, just so we can see this data, I'm going to copy it into a file in the editor. And there we are. And you can see that it's added the prompt key and completion key to each pair.

[03:33] It's added white space before the start of every completion. It's added a new line character at the end of every completion. And of course, it's added a separator at the end of every prompt. And we could have done all of that by hand and just used the tool to check. But wow, what a lot of boring work that would have been.

[03:52] OK, next we need to fine tune our data. So let's come on to that in the next scrim.

egghead
egghead
~ 9 minutes ago

Member comments are a way for members to communicate, interact, and ask questions about a lesson.

The instructor or someone from the community might respond to your question Here are a few basic guidelines to commenting on egghead.io

Be on-Topic

Comments are for discussing a lesson. If you're having a general issue with the website functionality, please contact us at support@egghead.io.

Avoid meta-discussion

  • This was great!
  • This was horrible!
  • I didn't like this because it didn't match my skill level.
  • +1 It will likely be deleted as spam.

Code Problems?

Should be accompanied by code! Codesandbox or Stackblitz provide a way to share code and discuss it in context

Details and Context

Vague question? Vague answer. Any details and context you can provide will lure more interesting answers!

Markdown supported.
Become a member to join the discussionEnroll Today