Real-World Data Analysis Workflow from Scratch (EP.3)

jason7m
May 2
2 min read

Updated: May 6

Episode 3: Exploratory Data Analysis — Understanding Customer Behavior

Welcome to Episode 3 of our data analysis journey. After framing the business problem and preparing a clean dataset, it's time to explore our data. In this phase, we don’t build models yet. Instead, we uncover insights that help us ask better questions and make more informed decisions later on.

The key goal of Exploratory Data Analysis (EDA) in this case is to understand customer purchase behavior and identify patterns among repeat vs one-time buyers.

What Are We Trying to Discover?

We want to answer questions like:

What does a typical repeat customer look like?
How frequently do they shop?
How much do they spend?
Are there seasonal or time-based trends?
How long does it take for someone to make a second purchase?

Creating RFM Features

Let’s begin by aggregating user behavior into customer-level metrics. A classic framework for this is RFM (Recency, Frequency, Monetary):

from datetime import datetime

# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Define snapshot date (e.g., max date + 1)
snapshot_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)

# Aggregate RFM features
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'nunique',
    'TotalSpent': 'sum'
})
rfm.columns = ['Recency', 'Frequency', 'Monetary']
rfm.head()

Visualizing Customer Segments

Let’s visualize how these metrics vary across our customer base.

Histogram: Recency

sns.histplot(rfm['Recency'], bins=30, kde=True)
plt.title('Customer Recency Distribution')

Box Plot: Monetary

sns.boxplot(x=rfm['Monetary'])
plt.title('Distribution of Total Spend')

Correlation Heatmap

sns.heatmap(rfm.corr(), annot=True, cmap='Blues')
plt.title("Correlation between RFM metrics")

These charts help us see whether, for instance, high-frequency buyers also tend to spend more (positive correlation), or whether recent buyers are worth targeting.

Cohort Analysis (Optional but Powerful)

Cohort analysis tracks how groups of customers behave over time based on their acquisition month.

df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')
df['CohortMonth'] = df.groupby('CustomerID')['InvoiceDate'].transform('min').dt.to_period('M')

df['CohortIndex'] = (df['InvoiceMonth'].dt.to_timestamp() - df['CohortMonth'].dt.to_timestamp()).dt.days // 30

You can then pivot and visualize repeat behavior per cohort.

cohort_counts = df.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].nunique().unstack(0)
cohort_counts.plot(figsize=(12,6))
plt.title("Cohort Analysis: Repeat Customers Over Time")

Comparing One-Time vs Repeat Buyers

Let’s create a binary flag for repeat buyers:

purchase_counts = df.groupby('CustomerID')['InvoiceNo'].nunique()
df['RepeatBuyer'] = df['CustomerID'].map(lambda x: 1 if purchase_counts[x] > 1 else 0)

Then, compare key metrics across the two groups:

repeat_summary = df.groupby('RepeatBuyer').agg({
    'TotalSpent': 'mean',
    'InvoiceNo': 'nunique'
})
repeat_summary

Wrapping Up

EDA is about curiosity and discovery. In this episode, we:

Created RFM features to describe customer behavior
Visualized distributions and correlations
Built the foundation for segmentation
Created a repeat buyer flag for future modeling

Now that we know who our repeat buyers are and how they behave, we can move on to predictive modeling.

Stay tuned for Episode 4: Variable Selection with IV — Identifying Predictive Power