Loading the dataset and performing EDA

 
In this project, I’m looking at Waze user data to figure out why some drivers stop using the app. My goal is to build a model that can tell the difference between loyal users and those who are likely to leave. I’ll start by cleaning the data and creating new features that capture specific driving habits. Finally, I’ll test a math model to see if we can accurately predict when a user is about to quit, which helps us understand how to keep them happy and on the road.
In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
 

Exploring data

In [45]:
#dataset available here https://career.skills.google/focuses/133285?parent=catalog
df = pd.read_csv("waze_dataset.csv")
 
We are going to begin by performing a basic EDA of this dataset.
In [46]:
print(df.shape)
df.info()
(14999, 13) <class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 14999 non-null int64 1 label 14299 non-null object 2 sessions 14999 non-null int64 3 drives 14999 non-null int64 4 total_sessions 14999 non-null float64 5 n_days_after_onboarding 14999 non-null int64 6 total_navigations_fav1 14999 non-null int64 7 total_navigations_fav2 14999 non-null int64 8 driven_km_drives 14999 non-null float64 9 duration_minutes_drives 14999 non-null float64 10 activity_days 14999 non-null int64 11 driving_days 14999 non-null int64 12 device 14999 non-null object dtypes: float64(3), int64(8), object(2) memory usage: 1.5+ MB
In [47]:
df.head()
Out[47]:
ID label sessions drives total_sessions n_days_after_onboarding total_navigations_fav1 total_navigations_fav2 driven_km_drives duration_minutes_drives activity_days driving_days device
0 0 retained 283 226 296.748273 2276 208 0 2628.845068 1985.775061 28 19 Android
1 1 retained 133 107 326.896596 1225 19 64 13715.920550 3160.472914 13 11 iPhone
2 2 retained 114 95 135.522926 2651 0 0 3059.148818 1610.735904 14 8 Android
3 3 retained 49 40 67.589221 15 322 7 913.591123 587.196542 7 3 iPhone
4 4 retained 84 68 168.247020 1562 166 5 3950.202008 1219.555924 27 18 Android
In [36]:
df.drop('ID', axis = 1,  inplace = True)
In [37]:
print(df.isna().any())
t_values, f_values = df["label"].isna().value_counts()
print("the rate of NA values in is", round(f_values * 100.0 / ( f_values + t_values ), 2))
label True sessions False drives False total_sessions False n_days_after_onboarding False total_navigations_fav1 False total_navigations_fav2 False driven_km_drives False duration_minutes_drives False activity_days False driving_days False device False dtype: bool the rate of NA values in is 4.67
 
Our missing value check shows that only about 4.67% of our target label column is null. Since this is a very small fraction of our 15,000-row dataset, dropping these rows is a safe and straightforward approach that shouldn't skew our overall analysis.
In [39]:
df.dropna(inplace = True)
 

Creating features

 
Since churn is (in general) strongly negatively correlated with usage hours, we will create two new features: the number of kilometers driven per day, and a binary indicator indicating whether a user is a frequent app user. ( we will define what a frequent user means.)
In [56]:
df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days']
df.loc[df['km_per_driving_day'] == np.inf, 'km_per_driving_day'] = 0 
df['frequent_user'] = np.where((df['drives'] >= 60) & (df['driving_days'] >= 15), 1, 0)
In [57]:
df['km_per_driving_day'].describe()
Out[57]:
count    14999.000000
mean       578.963113
std       1030.094384
min          0.000000
25%        136.238895
50%        272.889272
75%        558.686918
max      15420.234110
Name: km_per_driving_day, dtype: float64
In [58]:
df['frequent_user'].describe()
Out[58]:
count    14999.000000
mean         0.172945
std          0.378212
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: frequent_user, dtype: float64
In [65]:
df.groupby(['frequent_user'])['label'].value_counts(normalize = True)
Out[65]:
frequent_user  label   
0              retained    0.801202
               churned     0.198798
1              retained    0.924437
               churned     0.075563
Name: proportion, dtype: float64
 
The breakdown confirms our intuition: frequent users are significantly less likely to churn compared to non-frequent users. This new feature shows a strong signal and should give our upcoming logistic regression model a helpful predictive boost.
 

Checking assumptions and building the model

In [66]:
df['label2'] = np.where(df['label']=='churned', 1, 0)
df['device2'] = np.where(df['device']=='Android', 0, 1)
 
Before we build a Logistic Regression model, we need to ensure our features aren't too highly correlated with each other. If they are, it violates the no-multicollinearity assumption, which can make our model coefficients unstable. Let's visualize the relationships between our variables using a correlation heatmap.
In [76]:
corr_matrix = df.drop(['label', 'device'], axis = 1).corr(method = "pearson")
sns.heatmap(corr_matrix, vmin=-1, vmax=1, cmap='coolwarm')
plt.show()
Output
 
As we can observe in the correlation matrix, sessions and drives are perfectly correlated, as well as driving days and activity days. Thus, we are going to delete one column of each group to satisfy the 'no-multicolinearity' assumption.
In [79]:
X = df.drop(columns = ['label', 'label2', 'device', 'sessions', 'driving_days'])
y = df['label2']
In [80]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
In [90]:
model = LogisticRegression(penalty=None, max_iter=2000)

model.fit(X_train, y_train)
Out[90]:
LogisticRegression(max_iter=2000, penalty=None)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
 

The Linearity Assumption



Logistic regression also assumes a linear relationship between the continuous predictor variables and the log-odds (logit) of the outcome. Let's plot this relationship for activity_days to ensure this assumption holds up before we evaluate our final metrics.
In [91]:
training_probabilities = model.predict_proba(X_train)
training_probabilities

logit_data = X_train.copy()

logit_data['logit'] = [np.log(prob[1] / prob[0]) for prob in training_probabilities]
In [92]:
sns.regplot(x='activity_days', y='logit', data=logit_data, scatter_kws={'s': 2, 'alpha': 0.5})
plt.show()
Output
 

Evaluating the model

In [93]:
y_preds = model.predict(X_test)
In [94]:
model.score(X_test, y_test)
Out[94]:
0.832
In [95]:
cm = confusion_matrix(y_test, y_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['retained', 'churned'])
disp.plot()
Out[95]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x294050738f0>
Output
In [96]:
target_labels = ['retained', 'churned']
print(classification_report(y_test, y_preds, target_names=target_labels))
precision recall f1-score support retained 0.84 0.99 0.91 3116 churned 0.53 0.05 0.10 634 accuracy 0.83 3750 macro avg 0.68 0.52 0.50 3750 weighted avg 0.79 0.83 0.77 3750
 

Phase 3: Model Evaluation & Next Steps



Well, the results are a mixed bag!

While our overall accuracy looks great (83%) and the model is excellent at predicting retained drivers, it struggles significantly with our actual goal: predicting churn.

* Our recall for churned users is a dismal 0.05. That means the model only catches 5% of the users who actually ended up churning. You'd have better luck flipping a coin!
* Logistic regression often struggles with highly imbalanced datasets or complex, non-linear patterns that feature engineering couldn't fully capture.

This is a perfect cue to try more robust, tree-based models. Algorithms like a Random Forest or an XGBoost classifier generally handle imbalanced data and non-linear relationships much better (to be continued...).