santiviquez commited on
Commit
882c546
1 Parent(s): 1f1c5c3
Files changed (1) hide show
  1. app.py +109 -0
app.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import nannyml as nml
3
+ from sklearn.metrics import f1_score
4
+ import numpy as np
5
+
6
+ st.title('Is your model degrading?')
7
+ st.caption('### :violet[_Estimate_] the performance of an ML model. :violet[_Without ground truth_].')
8
+
9
+ st.markdown("""
10
+ If you have been previously exposed to concepts like [covariate shift or concept drift]('https://www.nannyml.com/blog/types-of-data-shift'),
11
+ you may be aware that changes in the distribution of
12
+ the production data can affect the model's performance.
13
+ """)
14
+
15
+ st.markdown("""Recently a paper from MIT, Harvard and other institutions showed how [91% of their ML models
16
+ experiments degradated]('https://www.nannyml.com/blog/91-of-ml-perfomance-degrade-in-time') in time.""")
17
+
18
+ st.markdown("""Typically, to know if a model is degrading we need access ground truth. But most of the times
19
+ getting new labeled data is either expensive, takes lots of time or imposible. So we end up blindless without
20
+ knowing how the model is performing in production.
21
+ """)
22
+
23
+ st.markdown("""
24
+ To overcome this issue, we at NannyML created two methods to :violet[_estimate_] the performance of ML models without needing access to
25
+ new labeled data. In this demo, we show the **Confidence-based Performance Estimation (CBPE)** method, specially designed to estimate
26
+ the performance of **classification** models.
27
+ """)
28
+
29
+ reference_df, analysis_df, analysis_target_df = nml.load_synthetic_car_loan_dataset()
30
+ test_f1_score = f1_score(reference_df['repaid'], reference_df['y_pred'])
31
+
32
+ st.markdown("#### The prediction task")
33
+
34
+ st.markdown("""
35
+ A model was trained to predict whether or not a person will repay their car loan. The model used features like:
36
+ car_value, salary_range, loan_lenght, etc.
37
+ """)
38
+
39
+ st.dataframe(analysis_df.head(3))
40
+
41
+ st.markdown("""
42
+ We know that the model had a **Test F1-Score of: 0.943**. But, what guarantees us that the F1-Score
43
+ will continue to be good on production data?
44
+ """)
45
+
46
+ st.markdown("#### Estimating the Model Performance")
47
+ st.markdown("""
48
+ Instead of waiting for ground truth we can use NannyML's
49
+ [CBPE]("https://nannyml.readthedocs.io/en/stable/tutorials/performance_estimation/binary_performance_estimation/standard_metric_estimation.html")
50
+ method to estimate the performance of an ML model.
51
+
52
+ CBPE's trick is to use the confidence scores of the ML model. It calibrates the scores to turn them into actual probabilities.
53
+ Once the probabilities are calibrate it can estimate any performance metric that can be computed from the confusion matrix elements.
54
+ """)
55
+
56
+ chunk_size = st.slider('Chunk/Sample Size', 2500, 7500, 5000, 500)
57
+ metric = st.selectbox(
58
+ 'Performance Metric',
59
+ ('f1', 'roc_auc', 'precision', 'recall', 'specificity', 'accuracy'))
60
+ plot_realized_performance = st.checkbox('Compare NannyML estimation with actual outcomes')
61
+
62
+ if st.button('**_Estimate_ Performance**'):
63
+ with st.spinner('Running...'):
64
+ estimator = nml.CBPE(
65
+ y_pred_proba='y_pred_proba',
66
+ y_pred='y_pred',
67
+ y_true='repaid',
68
+ timestamp_column_name='timestamp',
69
+ metrics=[metric],
70
+ chunk_size=chunk_size,
71
+ problem_type='classification_binary'
72
+ )
73
+
74
+ estimator.fit(reference_df)
75
+ estimated_performance = estimator.estimate(analysis_df)
76
+
77
+ if plot_realized_performance:
78
+ analysis_with_targets = analysis_df.merge(analysis_target_df, left_index=True, right_index=True)
79
+ calculator = nml.PerformanceCalculator(
80
+ y_pred_proba='y_pred_proba',
81
+ y_pred='y_pred',
82
+ y_true='repaid',
83
+ timestamp_column_name='timestamp',
84
+ metrics=[metric],
85
+ chunk_size=chunk_size,
86
+ problem_type='classification_binary'
87
+ )
88
+
89
+ calculator.fit(reference_df)
90
+
91
+ realized_performance = calculator.calculate(analysis_with_targets)
92
+
93
+ st.plotly_chart(estimated_performance.compare(realized_performance).plot(), use_container_width=False)
94
+
95
+
96
+ else:
97
+ st.plotly_chart(estimated_performance.plot(), use_container_width=False)
98
+
99
+
100
+ st.divider()
101
+
102
+
103
+
104
+ st.markdown("""Created by [santiviquez](https://twitter.com/santiviquez) from NannyML""")
105
+
106
+ st.markdown("""
107
+ NannyML is an open-source library for post-deployment data science. Leave us a 🌟 on [GitHub]("https://github.com/NannyML/nannyml")
108
+ or [check our docs]('https://nannyml.readthedocs.io/en/stable/landing_page.html') to learn more.
109
+ """)