Real-World Data Analysis Workflow from Scratch (EP.5)
- jason7m
- May 6
- 3 min read
Episode 5: Data Splitting and Scaling — Making Your Model Reliable
In previous episodes, we explored the problem, cleaned and explored the data, and selected features using Information Value. Before we move into full-scale modeling, we need to talk about two crucial technical steps: how to split your data and how to scale it. These steps ensure that your model is both robust and generalizable.
Why Splitting Matters
You can't train and evaluate your model on the same data. Doing so would be like giving students the exam questions ahead of time. To estimate how well your model will perform on unseen data, we split the dataset into separate subsets.
There are several ways to do this:
1) Holdout Validation
This is the most basic strategy:
Training Set: Usually 70%
Test Set: Usually 30%
The training set is used to fit the model, while the test set is held back and used only once to assess final performance.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Pros:
Simple and fast
Good for large datasets
Cons:
Performance estimate can vary depending on the random split
2) Simple Validation (Train/Validation/Test Split)
This method adds a validation set to the mix:
Training Set: 60%
Validation Set: 20%
Test Set: 20%
The validation set is used to tune hyperparameters and select the best model. The test set remains untouched until the very end.
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2
Use Case: When trying multiple models and tuning their parameters.
3) K-Fold Cross Validation
In k-fold cross validation:
Split the training data into k parts (folds)
For each fold:
Train the model on k-1 parts
Validate on the remaining 1 part
Repeat this process k times
Average the results to get a reliable performance estimate
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)
print("Average CV Score:", scores.mean())
Pros:
More robust estimate than a single train-test split
Utilizes all data for both training and validation
4) Other Validation Techniques
Here are a few additional variations used in more advanced or specific cases:
Repeated K-Fold: K-fold done multiple times with different splits
Random Subsampling: Randomly split multiple times without using all data each time
Leave-One-Out (LOO): Each sample acts once as a test set; computationally expensive
Leave-P-Out: Similar to LOO, but leaves p points out for testing each time
These are useful in special cases, like when the dataset is small and every point matters.
Scaling Your Features
After splitting, we need to ensure our model isn't skewed by features on vastly different scales. That’s where normalization and standardization come in.
Normalization
Normalization transforms data into a fixed range, typically between 0 and 1.
When to Use: Neural networks, models sensitive to feature magnitude
Common Methods:
Min-Max Scaling:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
Scales features to [0, 1]
Sensitive to outliers
Min-Max Scaling:
Preserves sign while scaling to [-1, 1]
Robust Scaling:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
Based on median and IQR
Robust to outliers
Log Scaling:
Use when data is heavily right-skewed (e.g., price distributions)
Standardization
Standardization transforms features to have:
Mean = 0
Standard Deviation = 1
This is useful for algorithms that assume normally distributed inputs, such as:
Linear regression
Logistic regression
PCA
SVM
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Wrapping Up
In this episode, we covered:
Different ways to split your dataset (holdout, validation, cross-validation)
Advanced validation strategies for smaller datasets
Scaling techniques: when and how to use normalization vs. standardization
These techniques make your model more reliable, fair, and reproducible. In the next episode, we’ll look at how to evaluate model performance and interpret metrics.
Comments