By following these steps, you can ensure your dataset is well-prepared for synthetic or elaborative augmentation, leading to more effective and accurate AI model training.
Data is the lifeblood of AI. Your organization likely has invaluable data that can be used to train AI models very specific to your domain:
Data from interactions with customers through various channels.
Logs tracking user actions and behavior patterns.
Customer relationship management data and contact details.
Records of financial transactions and activities.
Information on market trends and economic indicators.
Logs and records related to fraud detection activities.
Evaluations of potential risks and their impacts.
Comprehensive records of patient health information.
Images from medical scans like X-rays and MRIs.
Data from laboratory tests and analyses.
Information and results from clinical trials.
Records of insurance claims and processing details.
It’s essential to prepare your dataset cleanly and systematically. This ensures the quality and relevance of the data, leading to better model performance.
The dataset needs to reflect the prompt-response nature of LLMs. So your dataset will need to have one system prompt
, then a series of prompts
and responses
that the LLM will use to hone itself.
Structure them like this:
System Prompt | User Prompt | Response |
---|---|---|
Hi, how can I help you today? | ||
What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. | |
What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. | |
… | … |
Ensure you have collected all relevant data from various sources within your organization. This includes databases, CRM systems, transaction logs, and more.
Remove Duplicates
Ensure there are no duplicate entries in your dataset to avoid redundancy and bias.
Handle Missing Values
Identify and appropriately handle missing data points. Options include filling in missing values using statistical methods or removing incomplete records.
Correct Errors
Look for and correct any inaccuracies or inconsistencies in the data, such as typographical errors or misformatted entries.
Normalize Data
Standardize the format and scale of your data. This might involve converting all dates to a standard format, ensuring numerical values are on a consistent scale, and normalizing text data.
Categorize Data
Segment your data into meaningful categories or classes. This helps in training models more effectively by ensuring relevant data is grouped together.
Feature Selection
Identify and select the most relevant features for your model. This reduces noise and improves model performance.
Remove Personal Identifiers
Ensure all personal or sensitive information is anonymized or removed to comply with privacy regulations.
Tokenization
Replace sensitive data with tokens to maintain data integrity while protecting privacy. Store the mapping between tokens and the original sensitive data in a secure, separate location.
PII Redaction Example:
Original Data | Tokenized Data |
---|---|
John Doe, 123-45-6789 | T123, T456 |
Jane Smith, 987-65-4321 | T789, T654 |
In this example, “John Doe” is replaced with “T123” and “123-45-6789” with “T456,” ensuring that sensitive information remains secure.
Consistent Structure
Ensure all data follows a consistent structure and format, making it easier to process and analyze.
File Formats
Convert data into appropriate file formats (e.g., CSV, JSON) suitable for your training framework.
For Anarchy’s systems to be able to augment your data, Data should be in csv format with three mandatory fields: system prompt
, prompt
, and completion
. Ensure that these fields are strictly followed in the order specified and without header row.
A properly formatted .csv dataset should look like this, without the header descriptions:
System Prompt | User Prompt | Response |
---|---|---|
Hi, how can I help you today? | ||
What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. | |
What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. | |
… | … |
Perform thorough quality checks to ensure the data is accurate, consistent, and complete.
Use cross-validation techniques to check the reliability and validity of your dataset.
Document metadata, including data sources, collection methods, and any preprocessing steps applied.
Maintain version control to track changes and updates to your dataset over time.