File size: 4,532 Bytes
882c546
 
 
 
 
 
 
 
 
 
 
 
 
 
a107c57
 
882c546
a107c57
 
 
882c546
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a107c57
882c546
 
 
 
 
a107c57
882c546
 
 
 
a107c57
882c546
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a107c57
882c546
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import streamlit as st
import nannyml as nml
from sklearn.metrics import f1_score
import numpy as np

st.title('Is your model degrading?')
st.caption('### :violet[_Estimate_] the performance of an ML model. :violet[_Without ground truth_].')

st.markdown("""
If you have been previously exposed to concepts like [covariate shift or concept drift]('https://www.nannyml.com/blog/types-of-data-shift'),
you may be aware that changes in the distribution of
the production data can affect the model's performance.
""")
            
st.markdown("""Recently a paper from MIT, Harvard, and other institutions showed how [91% of their ML models
experiments degraded]('https://www.nannyml.com/blog/91-of-ml-perfomance-degrade-in-time') in time.""")
            
st.markdown("""Typically, we need access to ground truth to know if a model is degrading.
But most of the time, getting new labeled data is expensive, time-consuming, or impossible.
So we end up blindless without knowing how the model performs in production. 
""")

st.markdown("""
To overcome this issue, we at NannyML created two methods to :violet[_estimate_] the performance of ML models without needing access to
new labeled data. In this demo, we show the **Confidence-based Performance Estimation (CBPE)** method, specially designed to estimate
the performance of **classification** models.
""")

reference_df, analysis_df, analysis_target_df = nml.load_synthetic_car_loan_dataset()
test_f1_score = f1_score(reference_df['repaid'], reference_df['y_pred'])

st.markdown("#### The prediction task")

st.markdown("""
A model was trained to predict whether or not a person will repay their car loan. The model used features like:
car_value, salary_range, loan_lenght, etc.
""")

st.dataframe(analysis_df.head(3))

st.markdown("""
We know that the model had a **Test F1-Score of: 0.943**. But what guarantees us that the F1-Score
will continue to be good on production data?
""")

st.markdown("#### Estimating the Model Performance")
st.markdown("""
Instead of waiting for ground truth, we can use NannyML's 
[CBPE]("https://nannyml.readthedocs.io/en/stable/tutorials/performance_estimation/binary_performance_estimation/standard_metric_estimation.html")
method to estimate the performance of an ML model.

CBPE's trick is to use the confidence scores of the ML model. It calibrates the scores to turn them into actual probabilities.
Once the probabilities are calibrated, it can estimate any performance metric that can be computed from the confusion matrix elements.
            """)

chunk_size = st.slider('Chunk/Sample Size', 2500, 7500, 5000, 500)
metric = st.selectbox(
'Performance Metric',
('f1', 'roc_auc', 'precision', 'recall', 'specificity', 'accuracy'))
plot_realized_performance = st.checkbox('Compare NannyML estimation with actual outcomes')

if st.button('**_Estimate_ Performance**'):
    with st.spinner('Running...'):
        estimator = nml.CBPE(
            y_pred_proba='y_pred_proba',
            y_pred='y_pred',
            y_true='repaid',
            timestamp_column_name='timestamp',
            metrics=[metric],
            chunk_size=chunk_size,
            problem_type='classification_binary'
        )

        estimator.fit(reference_df)
        estimated_performance = estimator.estimate(analysis_df)

        if plot_realized_performance:
            analysis_with_targets = analysis_df.merge(analysis_target_df, left_index=True, right_index=True)
            calculator = nml.PerformanceCalculator(
                y_pred_proba='y_pred_proba',
                y_pred='y_pred',
                y_true='repaid',
                timestamp_column_name='timestamp',
                metrics=[metric],
                chunk_size=chunk_size,
                problem_type='classification_binary'
            )
            
            calculator.fit(reference_df)
            
            realized_performance = calculator.calculate(analysis_with_targets)

            st.plotly_chart(estimated_performance.compare(realized_performance).plot(), use_container_width=False)


        else:
            st.plotly_chart(estimated_performance.plot(), use_container_width=False)


st.divider()



st.markdown("""Created by [santiviquez](https://twitter.com/santiviquez) from NannyML.""")

st.markdown("""
NannyML is an open-source library for post-deployment data science. Leave us a 🌟 on [GitHub]("https://github.com/NannyML/nannyml")
            or [check our docs]('https://nannyml.readthedocs.io/en/stable/landing_page.html') to learn more.
""")