Spaces:
Sleeping
Sleeping
santiviquez
commited on
Commit
•
882c546
1
Parent(s):
1f1c5c3
app
Browse files
app.py
ADDED
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
import nannyml as nml
|
3 |
+
from sklearn.metrics import f1_score
|
4 |
+
import numpy as np
|
5 |
+
|
6 |
+
st.title('Is your model degrading?')
|
7 |
+
st.caption('### :violet[_Estimate_] the performance of an ML model. :violet[_Without ground truth_].')
|
8 |
+
|
9 |
+
st.markdown("""
|
10 |
+
If you have been previously exposed to concepts like [covariate shift or concept drift]('https://www.nannyml.com/blog/types-of-data-shift'),
|
11 |
+
you may be aware that changes in the distribution of
|
12 |
+
the production data can affect the model's performance.
|
13 |
+
""")
|
14 |
+
|
15 |
+
st.markdown("""Recently a paper from MIT, Harvard and other institutions showed how [91% of their ML models
|
16 |
+
experiments degradated]('https://www.nannyml.com/blog/91-of-ml-perfomance-degrade-in-time') in time.""")
|
17 |
+
|
18 |
+
st.markdown("""Typically, to know if a model is degrading we need access ground truth. But most of the times
|
19 |
+
getting new labeled data is either expensive, takes lots of time or imposible. So we end up blindless without
|
20 |
+
knowing how the model is performing in production.
|
21 |
+
""")
|
22 |
+
|
23 |
+
st.markdown("""
|
24 |
+
To overcome this issue, we at NannyML created two methods to :violet[_estimate_] the performance of ML models without needing access to
|
25 |
+
new labeled data. In this demo, we show the **Confidence-based Performance Estimation (CBPE)** method, specially designed to estimate
|
26 |
+
the performance of **classification** models.
|
27 |
+
""")
|
28 |
+
|
29 |
+
reference_df, analysis_df, analysis_target_df = nml.load_synthetic_car_loan_dataset()
|
30 |
+
test_f1_score = f1_score(reference_df['repaid'], reference_df['y_pred'])
|
31 |
+
|
32 |
+
st.markdown("#### The prediction task")
|
33 |
+
|
34 |
+
st.markdown("""
|
35 |
+
A model was trained to predict whether or not a person will repay their car loan. The model used features like:
|
36 |
+
car_value, salary_range, loan_lenght, etc.
|
37 |
+
""")
|
38 |
+
|
39 |
+
st.dataframe(analysis_df.head(3))
|
40 |
+
|
41 |
+
st.markdown("""
|
42 |
+
We know that the model had a **Test F1-Score of: 0.943**. But, what guarantees us that the F1-Score
|
43 |
+
will continue to be good on production data?
|
44 |
+
""")
|
45 |
+
|
46 |
+
st.markdown("#### Estimating the Model Performance")
|
47 |
+
st.markdown("""
|
48 |
+
Instead of waiting for ground truth we can use NannyML's
|
49 |
+
[CBPE]("https://nannyml.readthedocs.io/en/stable/tutorials/performance_estimation/binary_performance_estimation/standard_metric_estimation.html")
|
50 |
+
method to estimate the performance of an ML model.
|
51 |
+
|
52 |
+
CBPE's trick is to use the confidence scores of the ML model. It calibrates the scores to turn them into actual probabilities.
|
53 |
+
Once the probabilities are calibrate it can estimate any performance metric that can be computed from the confusion matrix elements.
|
54 |
+
""")
|
55 |
+
|
56 |
+
chunk_size = st.slider('Chunk/Sample Size', 2500, 7500, 5000, 500)
|
57 |
+
metric = st.selectbox(
|
58 |
+
'Performance Metric',
|
59 |
+
('f1', 'roc_auc', 'precision', 'recall', 'specificity', 'accuracy'))
|
60 |
+
plot_realized_performance = st.checkbox('Compare NannyML estimation with actual outcomes')
|
61 |
+
|
62 |
+
if st.button('**_Estimate_ Performance**'):
|
63 |
+
with st.spinner('Running...'):
|
64 |
+
estimator = nml.CBPE(
|
65 |
+
y_pred_proba='y_pred_proba',
|
66 |
+
y_pred='y_pred',
|
67 |
+
y_true='repaid',
|
68 |
+
timestamp_column_name='timestamp',
|
69 |
+
metrics=[metric],
|
70 |
+
chunk_size=chunk_size,
|
71 |
+
problem_type='classification_binary'
|
72 |
+
)
|
73 |
+
|
74 |
+
estimator.fit(reference_df)
|
75 |
+
estimated_performance = estimator.estimate(analysis_df)
|
76 |
+
|
77 |
+
if plot_realized_performance:
|
78 |
+
analysis_with_targets = analysis_df.merge(analysis_target_df, left_index=True, right_index=True)
|
79 |
+
calculator = nml.PerformanceCalculator(
|
80 |
+
y_pred_proba='y_pred_proba',
|
81 |
+
y_pred='y_pred',
|
82 |
+
y_true='repaid',
|
83 |
+
timestamp_column_name='timestamp',
|
84 |
+
metrics=[metric],
|
85 |
+
chunk_size=chunk_size,
|
86 |
+
problem_type='classification_binary'
|
87 |
+
)
|
88 |
+
|
89 |
+
calculator.fit(reference_df)
|
90 |
+
|
91 |
+
realized_performance = calculator.calculate(analysis_with_targets)
|
92 |
+
|
93 |
+
st.plotly_chart(estimated_performance.compare(realized_performance).plot(), use_container_width=False)
|
94 |
+
|
95 |
+
|
96 |
+
else:
|
97 |
+
st.plotly_chart(estimated_performance.plot(), use_container_width=False)
|
98 |
+
|
99 |
+
|
100 |
+
st.divider()
|
101 |
+
|
102 |
+
|
103 |
+
|
104 |
+
st.markdown("""Created by [santiviquez](https://twitter.com/santiviquez) from NannyML""")
|
105 |
+
|
106 |
+
st.markdown("""
|
107 |
+
NannyML is an open-source library for post-deployment data science. Leave us a 🌟 on [GitHub]("https://github.com/NannyML/nannyml")
|
108 |
+
or [check our docs]('https://nannyml.readthedocs.io/en/stable/landing_page.html') to learn more.
|
109 |
+
""")
|