Preparing Your Dataset

Collecting Your Data

Data is the lifeblood of AI. Your organization likely has invaluable data that can be used to train AI models very specific to your domain:

Customer Interactions

Data from interactions with customers through various channels.

User Behavior Logs

Logs tracking user actions and behavior patterns.

CRM Data

Customer relationship management data and contact details.

Transaction Histories

Records of financial transactions and activities.

Market Data

Information on market trends and economic indicators.

Fraud Detection Logs

Logs and records related to fraud detection activities.

Risk Assessment Reports

Evaluations of potential risks and their impacts.

Patient Records

Comprehensive records of patient health information.

Medical Imaging Data

Images from medical scans like X-rays and MRIs.

Lab Results

Data from laboratory tests and analyses.

Clinical Trial Data

Information and results from clinical trials.

Insurance Claims

Records of insurance claims and processing details.

It’s essential to prepare your dataset cleanly and systematically. This ensures the quality and relevance of the data, leading to better model performance.

This will be very important for preparing your data for augmentation, which will be outlined in the next section.

The dataset needs to reflect the prompt-response nature of LLMs. So your dataset will need to have one system prompt, then a series of prompts and responses that the LLM will use to hone itself.

Structure them like this:

System Prompt	User Prompt	Response
Hi, how can I help you today?
	What do these lab results suggest?	These lab results suggest that the patient is healthy, as no anomalous data has been detected.
	What is the sentiment of the last 5 customers who came into support chat?	The last five customers have a neutral to positive sentiment.
	…	…

Collect

Ensure you have collected all relevant data from various sources within your organization. This includes databases, CRM systems, transaction logs, and more.

Clean

Remove Duplicates

Ensure there are no duplicate entries in your dataset to avoid redundancy and bias.

Handle Missing Values

Identify and appropriately handle missing data points. Options include filling in missing values using statistical methods or removing incomplete records.

Correct Errors

Look for and correct any inaccuracies or inconsistencies in the data, such as typographical errors or misformatted entries.

Normalize Data

Standardize the format and scale of your data. This might involve converting all dates to a standard format, ensuring numerical values are on a consistent scale, and normalizing text data.

Segment

Categorize Data

Segment your data into meaningful categories or classes. This helps in training models more effectively by ensuring relevant data is grouped together.

Feature Selection

Identify and select the most relevant features for your model. This reduces noise and improves model performance.

Anonymize

Remove Personal Identifiers

Ensure all personal or sensitive information is anonymized or removed to comply with privacy regulations.

Tokenization

Replace sensitive data with tokens to maintain data integrity while protecting privacy. Store the mapping between tokens and the original sensitive data in a secure, separate location.

PII Redaction Example:

Original Data	Tokenized Data
John Doe, 123-45-6789	T123, T456
Jane Smith, 987-65-4321	T789, T654

In this example, “John Doe” is replaced with “T123” and “123-45-6789” with “T456,” ensuring that sensitive information remains secure.

Format

Consistent Structure

Ensure all data follows a consistent structure and format, making it easier to process and analyze.

File Formats

Convert data into appropriate file formats (e.g., CSV, JSON) suitable for your training framework.

For Anarchy’s systems to be able to augment your data, Data should be in csv format with three mandatory fields: system prompt, prompt, and completion. Ensure that these fields are strictly followed in the order specified and without header row.

A properly formatted .csv dataset should look like this, without the header descriptions:

System Prompt	User Prompt	Response
Hi, how can I help you today?
	What do these lab results suggest?	These lab results suggest that the patient is healthy, as no anomalous data has been detected.
	What is the sentiment of the last 5 customers who came into support chat?	The last five customers have a neutral to positive sentiment.
	…	…

Validate

Quality Checks

Perform thorough quality checks to ensure the data is accurate, consistent, and complete.

Cross-Validation

Use cross-validation techniques to check the reliability and validity of your dataset.

Document

Metadata

Document metadata, including data sources, collection methods, and any preprocessing steps applied.

Version Control

Maintain version control to track changes and updates to your dataset over time.

Welcome

Chat.dev

Platform

Get Involved