Real-World Data Analysis Workflow from Scratch (EP.5)

jason7m
May 6
3 min read

Episode 5: Data Splitting and Scaling — Making Your Model Reliable

In previous episodes, we explored the problem, cleaned and explored the data, and selected features using Information Value. Before we move into full-scale modeling, we need to talk about two crucial technical steps: how to split your data and how to scale it. These steps ensure that your model is both robust and generalizable.

Why Splitting Matters

You can't train and evaluate your model on the same data. Doing so would be like giving students the exam questions ahead of time. To estimate how well your model will perform on unseen data, we split the dataset into separate subsets.

There are several ways to do this:

1) Holdout Validation

This is the most basic strategy:

Training Set: Usually 70%
Test Set: Usually 30%

The training set is used to fit the model, while the test set is held back and used only once to assess final performance.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Pros:

Simple and fast
Good for large datasets

Cons:

Performance estimate can vary depending on the random split

2) Simple Validation (Train/Validation/Test Split)

This method adds a validation set to the mix:

Training Set: 60%
Validation Set: 20%
Test Set: 20%

The validation set is used to tune hyperparameters and select the best model. The test set remains untouched until the very end.

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)  # 0.25 x 0.8 = 0.2

Use Case: When trying multiple models and tuning their parameters.

3) K-Fold Cross Validation

In k-fold cross validation:

Split the training data into k parts (folds)
For each fold:
- Train the model on k-1 parts
- Validate on the remaining 1 part
Repeat this process k times
Average the results to get a reliable performance estimate

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)
print("Average CV Score:", scores.mean())

Pros:

More robust estimate than a single train-test split
Utilizes all data for both training and validation

4) Other Validation Techniques

Here are a few additional variations used in more advanced or specific cases:

Repeated K-Fold: K-fold done multiple times with different splits
Random Subsampling: Randomly split multiple times without using all data each time
Leave-One-Out (LOO): Each sample acts once as a test set; computationally expensive
Leave-P-Out: Similar to LOO, but leaves p points out for testing each time

These are useful in special cases, like when the dataset is small and every point matters.

Scaling Your Features

After splitting, we need to ensure our model isn't skewed by features on vastly different scales. That’s where normalization and standardization come in.

Normalization

Normalization transforms data into a fixed range, typically between 0 and 1.

When to Use: Neural networks, models sensitive to feature magnitude

Common Methods:

Min-Max Scaling:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

Scales features to [0, 1]
Sensitive to outliers
Min-Max Scaling:
Preserves sign while scaling to [-1, 1]

Robust Scaling:

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

Based on median and IQR
Robust to outliers

Log Scaling:
Use when data is heavily right-skewed (e.g., price distributions)

Standardization

Standardization transforms features to have:

Mean = 0
Standard Deviation = 1

This is useful for algorithms that assume normally distributed inputs, such as:

Linear regression
Logistic regression
PCA
SVM

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Wrapping Up

In this episode, we covered:

Different ways to split your dataset (holdout, validation, cross-validation)
Advanced validation strategies for smaller datasets
Scaling techniques: when and how to use normalization vs. standardization

These techniques make your model more reliable, fair, and reproducible. In the next episode, we’ll look at how to evaluate model performance and interpret metrics.