Loading the dataset and performing EDA

In this project, I’m looking at Waze user data to figure out why some drivers stop using the app. My goal is to build a model that can tell the difference between loyal users and those who are likely to leave. I’ll start by cleaning the data and creating new features that capture specific driving habits. Finally, I’ll test a math model to see if we can accurately predict when a user is about to quit, which helps us understand how to keep them happy and on the road.

In [1]:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression

Exploring data

In [45]:

#dataset available here https://career.skills.google/focuses/133285?parent=catalog
df = pd.read_csv("waze_dataset.csv")

We are going to begin by performing a basic EDA of this dataset.

In [46]:

print(df.shape)
df.info()

(14999, 13) <class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 14999 non-null int64 1 label 14299 non-null object 2 sessions 14999 non-null int64 3 drives 14999 non-null int64 4 total_sessions 14999 non-null float64 5 n_days_after_onboarding 14999 non-null int64 6 total_navigations_fav1 14999 non-null int64 7 total_navigations_fav2 14999 non-null int64 8 driven_km_drives 14999 non-null float64 9 duration_minutes_drives 14999 non-null float64 10 activity_days 14999 non-null int64 11 driving_days 14999 non-null int64 12 device 14999 non-null object dtypes: float64(3), int64(8), object(2) memory usage: 1.5+ MB

In [47]:

df.head()

Out[47]:

	ID	label	sessions	drives	total_sessions	n_days_after_onboarding	total_navigations_fav1	total_navigations_fav2	driven_km_drives	duration_minutes_drives	activity_days	driving_days	device
0	0	retained	283	226	296.748273	2276	208	0	2628.845068	1985.775061	28	19	Android
1	1	retained	133	107	326.896596	1225	19	64	13715.920550	3160.472914	13	11	iPhone
2	2	retained	114	95	135.522926	2651	0	0	3059.148818	1610.735904	14	8	Android
3	3	retained	49	40	67.589221	15	322	7	913.591123	587.196542	7	3	iPhone
4	4	retained	84	68	168.247020	1562	166	5	3950.202008	1219.555924	27	18	Android

In [36]:

df.drop('ID', axis = 1,  inplace = True)

In [37]:

print(df.isna().any())
t_values, f_values = df["label"].isna().value_counts()
print("the rate of NA values in is", round(f_values * 100.0 / ( f_values + t_values ), 2))

label True sessions False drives False total_sessions False n_days_after_onboarding False total_navigations_fav1 False total_navigations_fav2 False driven_km_drives False duration_minutes_drives False activity_days False driving_days False device False dtype: bool the rate of NA values in is 4.67

Our missing value check shows that only about 4.67% of our target label column is null. Since this is a very small fraction of our 15,000-row dataset, dropping these rows is a safe and straightforward approach that shouldn't skew our overall analysis.

In [39]:

df.dropna(inplace = True)

Creating features

Since churn is (in general) strongly negatively correlated with usage hours, we will create two new features: the number of kilometers driven per day, and a binary indicator indicating whether a user is a frequent app user. ( we will define what a frequent user means.)

In [56]:

df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days']
df.loc[df['km_per_driving_day'] == np.inf, 'km_per_driving_day'] = 0 
df['frequent_user'] = np.where((df['drives'] >= 60) & (df['driving_days'] >= 15), 1, 0)

In [57]:

df['km_per_driving_day'].describe()

Out[57]:

count    14999.000000
mean       578.963113
std       1030.094384
min          0.000000
25%        136.238895
50%        272.889272
75%        558.686918
max      15420.234110
Name: km_per_driving_day, dtype: float64

In [58]:

df['frequent_user'].describe()

Out[58]:

count    14999.000000
mean         0.172945
std          0.378212
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: frequent_user, dtype: float64

In [65]:

df.groupby(['frequent_user'])['label'].value_counts(normalize = True)

Out[65]:

frequent_user  label   
0              retained    0.801202
               churned     0.198798
1              retained    0.924437
               churned     0.075563
Name: proportion, dtype: float64

The breakdown confirms our intuition: frequent users are significantly less likely to churn compared to non-frequent users. This new feature shows a strong signal and should give our upcoming logistic regression model a helpful predictive boost.

Checking assumptions and building the model

In [66]:

df['label2'] = np.where(df['label']=='churned', 1, 0)
df['device2'] = np.where(df['device']=='Android', 0, 1)

Before we build a Logistic Regression model, we need to ensure our features aren't too highly correlated with each other. If they are, it violates the no-multicollinearity assumption, which can make our model coefficients unstable. Let's visualize the relationships between our variables using a correlation heatmap.

In [76]:

corr_matrix = df.drop(['label', 'device'], axis = 1).corr(method = "pearson")
sns.heatmap(corr_matrix, vmin=-1, vmax=1, cmap='coolwarm')
plt.show()

As we can observe in the correlation matrix, sessions and drives are perfectly correlated, as well as driving days and activity days. Thus, we are going to delete one column of each group to satisfy the 'no-multicolinearity' assumption.

In [79]:

X = df.drop(columns = ['label', 'label2', 'device', 'sessions', 'driving_days'])
y = df['label2']

In [80]:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [90]:

model = LogisticRegression(penalty=None, max_iter=2000)

model.fit(X_train, y_train)

Out[90]:

LogisticRegression(max_iter=2000, penalty=None)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The Linearity Assumption

Logistic regression also assumes a linear relationship between the continuous predictor variables and the log-odds (logit) of the outcome. Let's plot this relationship for activity_days to ensure this assumption holds up before we evaluate our final metrics.

In [91]:

training_probabilities = model.predict_proba(X_train)
training_probabilities

logit_data = X_train.copy()

logit_data['logit'] = [np.log(prob[1] / prob[0]) for prob in training_probabilities]

In [92]:

sns.regplot(x='activity_days', y='logit', data=logit_data, scatter_kws={'s': 2, 'alpha': 0.5})
plt.show()

Evaluating the model

In [93]:

y_preds = model.predict(X_test)

In [94]:

model.score(X_test, y_test)

Out[94]:

0.832

In [95]:

cm = confusion_matrix(y_test, y_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['retained', 'churned'])
disp.plot()

Out[95]:

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x294050738f0>

In [96]:

target_labels = ['retained', 'churned']
print(classification_report(y_test, y_preds, target_names=target_labels))

precision recall f1-score support retained 0.84 0.99 0.91 3116 churned 0.53 0.05 0.10 634 accuracy 0.83 3750 macro avg 0.68 0.52 0.50 3750 weighted avg 0.79 0.83 0.77 3750

Phase 3: Model Evaluation & Next Steps

Well, the results are a mixed bag!

While our overall accuracy looks great (83%) and the model is excellent at predicting retained drivers, it struggles significantly with our actual goal: predicting churn.

* Our recall for churned users is a dismal 0.05. That means the model only catches 5% of the users who actually ended up churning. You'd have better luck flipping a coin!
* Logistic regression often struggles with highly imbalanced datasets or complex, non-linear patterns that feature engineering couldn't fully capture.

This is a perfect cue to try more robust, tree-based models. Algorithms like a Random Forest or an XGBoost classifier generally handle imbalanced data and non-linear relationships much better (to be continued...).

	penalty	None
	dual	False
	tol	0.0001
	C	1.0
	fit_intercept	True
	intercept_scaling	1
	class_weight	None
	random_state	None
	solver	'lbfgs'
	max_iter	2000
	multi_class	'deprecated'
	verbose	0
	warm_start	False
	n_jobs	None
	l1_ratio	None