Create Synthetic Datasets with GPT

A new script to generate and label synthetic training data with GPT-3.5 Turbo

May 10, 2023

Natural Language Processing (NLP) has witnessed tremendous growth in recent years, with various applications ranging from sentiment analysis to machine translation. However, one critical challenge developers often face is insufficient training data to build and fine-tune their models. High-quality, labeled data is essential for model performance, but obtaining it can be time-consuming and costly.

To address this challenge, we are excited to announce the release of a Synthetic Data Generator script that leverages the power of OpenAI's GPT-3.5 Turbo. This script generates random, diverse, and labeled comments that can be used to train NLP classifiers, effectively solving the problem of scarce training data. The Synthetic Data Generator is available on GitHub.

In this article, we will provide an overview of the problem this Synthetic Data Generator solves and discuss its importance in NLP model training. We can use GPT-3.5 Turbo to create synthetic data that helps developers build and fine-tune their classifiers without extensive data collection efforts. This new approach has the potential to revolutionize how we train NLP models, making them more accessible and efficient for developers across the globe.

Overview

The Synthetic Data Generator is a Python script designed to create diverse, labeled data for training NLP classifiers using OpenAI's GPT-3.5 Turbo. The script simplifies generating synthetic data by automating the API calls and organizing the generated text and labels in a CSV file.

The main components of the script include:

Setting up the environment by signing up for an OpenAI API key and configuring the .env file
Defining the prompt and other global variables that determine the data generation process
Making API calls to GPT-3.5 Turbo to generate the synthetic data
Processing and validating the generated data
Saving the synthetic data, along with their labels, to a CSV file (output.csv) in the same directory

The Synthetic Data Generator offers several advantages:

Reduces the need for manual data collection, saving time and resources
Provides customizable prompts and configurations to generate diverse and labeled data tailored to specific use cases
Enables a seamless process of generating and saving synthetic data for training NLP classifiers

To use the script, follow the setup instructions provided in the README, modify the global variables as necessary, and run the main.py script. The generated synthetic data will be saved in a CSV file named output.csv, which can be used to train your NLP classifier!

Limitations

While the Synthetic Data Generator offers a convenient solution for generating labeled data for training NLP classifiers, it is important to know its limitations. This section will discuss some potential drawbacks and challenges that users may encounter while using this script.

Overfitting due to repeated data: Since the script relies on GPT-3.5 Turbo for generating synthetic data, there is a possibility that some generated data points may be repeated or very similar. This repetition can lead to overfitting during model training, as the classifier may learn to recognize specific patterns in the synthetic data rather than generalize to new, unseen data.
Limited diversity of generated data: The diversity of the generated data largely depends on the prompt used in the script. If the prompt is not designed to encourage a wide range of responses, the generated data may not be diverse enough to capture the full spectrum of possible inputs, limiting the classifier's performance on real-world data.
Dependence on GPT-3.5 Turbo: The quality and relevance of the generated data are directly tied to the performance of GPT-3.5 Turbo. While this language model is known for its impressive capabilities, it may still produce unexpected or irrelevant results in certain cases. Consequently, the quality of the synthetic data may not be consistent across different use cases.
Cost of API calls: Generating synthetic data using GPT-3.5 Turbo requires making multiple API calls, which can become expensive, depending on the number of calls and the pricing of the OpenAI API.

To mitigate these limitations, users should:

Monitor the generated data for repetitions and similarities, and remove or modify duplicate entries to reduce overfitting risk.
Experiment with different prompts to encourage diverse responses from GPT-3.5 Turbo and ensure a more comprehensive dataset.
Manually review and curate the generated data to ensure its quality and relevance for the intended use case.
Balance the number of API calls based on the available budget and the desired dataset size to manage costs effectively.

Examples

The Synthetic Data Generator script can be adapted for various NLP classifier training tasks by customizing the prompt and configuration parameters. Here are two examples of how this script could be used to generate synthetic data for different purposes:

Detecting Geographic Directions in Text:

To create a dataset for training an NLP classifier that detects whether a given text contains geographic directions, modify the prompt in the script as follows:

"Generate 10 random and diverse sentences in CSV format, each followed by a comma and a label indicating if it contains geographic directions (1) or not (0). "

The generated data can then be used to train a classifier that identifies text with geographic directions, which could be helpful in applications like navigation assistance, location-based recommendations, or parsing travel itineraries.

Identifying Constructive Criticism in Text:

To generate synthetic data for training an NLP classifier that recognizes constructive criticism, adjust the prompt in the script like this:

"Generate 10 random and diverse sentences in CSV format, each followed by a comma and a label indicating if it contains constructive criticism (1) or not (0). "

The resulting dataset can be employed to train a classifier that detects constructive criticism in various contexts, such as moderating online forums, analyzing customer feedback, or improving communication in collaborative environments.

These examples demonstrate the flexibility of the Synthetic Data Generator script in generating labeled data for different NLP classifier training tasks. By customizing the prompt and other parameters, developers can easily generate synthetic data tailored to their specific use cases, streamlining the process of training NLP classifiers for a wide range of applications.

OK Quentin, how well does this work?!

As you might know, I’m working on a side project called Sentivibe which helps YouTubers improve their channel by analyzing their comment section for keywords, themes, and suggestions.

I trained a simple NLP model (in PyTorch) on this dataset to test it. Specifically, I fine-tuned distilbert-base-uncased from HuggingFace because it’s small and fast.

This model classifies if the given text could be considered a suggestion (1) or not (0).

Epoch 1/5
----------
Train loss: 0.11190189626067877
Validation loss: 0.6981576879819235
Epoch 2/5
----------
Train loss: 0.08233773624524474
Validation loss: 0.7425488233566284
Epoch 3/5
----------
Train loss: 0.0620751996524632
Validation loss: 0.9356076419353485
Epoch 4/5
----------
Train loss: 0.07985222465358674
Validation loss: 0.8932865262031555
Epoch 5/5
----------
Train loss: 0.04594298060983419
Validation loss: 0.929664134979248

Alas, we’re overfitting this dataset 😞 No biggie, let’s stick with one epoch and see what happens when I test the model with sample text…

OK, NOT BAD. This is an actual YouTube comment I pulled from a Primeagen video which is an actual suggestion. This model correctly classified it as a suggestion with 98% certainty.

Links

Try this script for yourself and start using GPT to create datasets.

Let me know how it goes!

If you have any questions, hit me up on my socials:

Universal Set

Discussion about this post