1. 53
    The Data We're Using

The Data We're Using

Tom Chant
InstructorTom Chant
Share this video with your friends

Social Share Links

Send Tweet
Published a year ago
Updated a year ago

In this lesson, we explore the process of preparing data for fine-tuning a model using OpenAI's data preparation tool. Tom begins by introducing a dataset organized in a spreadsheet using CSV format and explains the simple structure of prompt and completion pairs. The dataset consists of various customer questions and corresponding service agent responses.

The prompts can vary in complexity, including conversations with multiple dialogue parts. This flexibility allows for training chatbots effectively. The instructor provides a sample dataset and guides learners on how to download and adjust it for use in the terminal. This lesson serves as a foundational understanding of data preparation and sets the stage for fine-tuning a model using the given dataset.

[00:00] Okay let's take a look at the data we're going to use. Now writing JSON out by hand is a pain so I've organized this in a spreadsheet using comma separated values or CSV. All we've got here is two columns prompt and completion. Each prompt

[00:16] is a question from a customer and each completion is an answer from customer service. So we've just got two columns I think we can work with that and I've got 38 pairs of prompts and completions. Ideally I would have ten times that number or even more but this small data set is going to allow us to see the

[00:36] principle of how fine-tuning works. Now I've also pasted the CSV data into this file right here and it doesn't look like it's formatted but actually CSV formatting is really simple. We've got the two column headings right here and then each line contains one prompt completion pair. So this is the prompt

[00:56] right here and this is the completion after the comma. Now I just want to draw your attention to the last few pairs because here I've done something which is just a little bit more complex. It's a little bit tricky to see so I'm just going to take this last prompt completion pair and space it out a

[01:14] little bit. Now instead of just having one question and one answer the prompt actually consists of several parts. The first part is a summary then we've actually got a short conversation. So we've got something that the customer

[01:29] has asked and then we've got the agent's reply and then the customer has responded to the agent and then we finish with the agent and a space and then we get the completion at the very end. So basically all of that is the

[01:46] prompt and that is the completion. Now I've done that just to show you that these prompts don't have to just be one question and one answer. They can actually involve lots and lots of dialogue so if you have got a lot of customer service data you can format it in this way with the completion just

[02:04] being the final answer to the query and that is going to really help train your chatbot. Now you don't need to stick to one data style as long as you're working with prompts and completions you can mix as I've done here. Some of the prompts will be one question and some of the prompts will actually be whole

[02:22] conversations. Okay let me just put this data back as it was. Now as we're going to be working with this data in the terminal you need to download it or have some data of your own in a similar format. You can download it just by clicking on this slide and that's going to take you through to a Google Sheets

[02:39] version which you can save a local copy of or you can come down here to this cog icon and when you click on that it will bring up a menu and click download a zip which will give you a zipped folder and when you unzip it you'll see all of the files from this project. The one you're interested in of course is the we wing

[02:58] it data CSV so I'm going to take that file and save it in my apps folder in a file called we wing it and there we are there is my data ready for the next step. Okay we've got some work to do with this data before we can actually upload it

[03:13] and fine-tune a model. We are going to use OpenAI's data preparation tool to do that for us but before we can do that we need to set up our command line interface environment so next let's open up the terminal and tackle that.