.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/plot_example.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_plot_example.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_plot_example.py:


Calculate stopping criteria
===========================

.. GENERATED FROM PYTHON SOURCE LINES 8-25

Introduction
------------------------

When we do machine learning-prioritised screening in a systematic review, we
can only save work if we have a method for deciding when to stop.
Several methods have been suggested, but those which rely on rules of thumb
(like stopping after N consecutive irrelevant records) are not supported by either
theory or empirical evaluations. A particular value of N may work well in one
review, but do badly in another. The right value depends on the size of the dataset,
the prevalence of relevant documents, the effectiveness of the machine learning
algorithm, and a bit of luck. Moreover, using such a criterion does not allow
us to say anything about our expected recall, nor our confidence in achieving it.
In `Callaghan and Müller-Hansen, 2020 <https://doi.org/10.1186/s13643-020-01521-4>`_
we offered a theoretically well motivated stopping
criteria, which we demonstrated was safe to use. It allows you to communicate
your confidence in achieving any arbitrary recall target. This package aims to
make this stopping criteria easy to use for python users.

.. GENERATED FROM PYTHON SOURCE LINES 27-35

Data
------------------------

Lets initialise some test data. The `generate_data` function takes a number
of documents, as well as the  the prevalence of relevant documents, or uses
default values, and simulates a prioritised screening-like
process, where we sample documents, where we are `bias` times more likely to select a random
relevant document than a random irrelevant document.

.. GENERATED FROM PYTHON SOURCE LINES 35-45

.. code-block:: default


    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from buscarpy import generate_dataset

    df = generate_dataset(random_seed=12345)
    df.relevant.cumsum().plot()
    df.head()


.. image-sg:: /auto_examples/images/sphx_glr_plot_example_001.png
   :alt: plot example
   :srcset: /auto_examples/images/sphx_glr_plot_example_001.png
   :class: sphx-glr-single-img


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>relevant</th>
          <th>id</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>0</td>
          <td>0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0</td>
          <td>1</td>
        </tr>
        <tr>
          <th>2</th>
          <td>0</td>
          <td>2</td>
        </tr>
        <tr>
          <th>3</th>
          <td>0</td>
          <td>3</td>
        </tr>
        <tr>
          <th>4</th>
          <td>0</td>
          <td>4</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 46-70

When is it safe to stop?
------------------------

Let's imagine we've seen just the first 10,000 documents. We can use our stopping criteria
to calculate a p score for a null hypothesis that we have missed so many documents that we have
not achieved our recall target.

If the p score is low, then we can reject that null hypothesis and stop safely.
The lower the score, the more confident we can be about doing this. The p score
is given by calculating the probability of observing the previous sequence
of relevant and irrelevant documents, if there were enough remaining relevant
documents to mean that our recall target had not been achieved.

For example, if we have seen 95 relevant documents,
and our recall target is 95%, then there would have to be at least 6 relevant documents
remaining for us to have missed our target. If we have just observed a sequence of 100 irrelevant
document in a row, we ask how likely it would be to observe that by random sampling,
if there were 6 relevant documents remaining.

We can calculate this using the buscarR package, by passing a list of 1s and 0s
to our `calculate_h0` function, along with the total number of documents in
in the dataset. The list of 1s and 0s represents INCLUDE and EXCLUDE decisions
by human screeners, and should be in the order in which the documents were
screened

.. GENERATED FROM PYTHON SOURCE LINES 70-81

.. code-block:: default


    from buscarpy import calculate_h0
    seen_documents = df['relevant'][:10000]

    fig, ax = plt.subplots()
    seen_documents.cumsum().plot()
    ax.set_xlim(xmax=df.shape[0])

    calculate_h0(seen_documents, df.shape[0])


.. image-sg:: /auto_examples/images/sphx_glr_plot_example_002.png
   :alt: plot example
   :srcset: /auto_examples/images/sphx_glr_plot_example_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.10980177863883214


.. GENERATED FROM PYTHON SOURCE LINES 82-84

Our p score indicates that we are not yet confident enough to stop screening.
If we "see" an additional 2,000 documents, this will change

.. GENERATED FROM PYTHON SOURCE LINES 84-93

.. code-block:: default


    seen_documents = df['relevant'][:12000]

    fig, ax = plt.subplots()
    seen_documents.cumsum().plot()
    ax.set_xlim(xmax=df.shape[0])

    calculate_h0(seen_documents, df.shape[0])


.. image-sg:: /auto_examples/images/sphx_glr_plot_example_003.png
   :alt: plot example
   :srcset: /auto_examples/images/sphx_glr_plot_example_003.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.009418930806106517


.. GENERATED FROM PYTHON SOURCE LINES 94-95

We can now be **very confident** that we have **not missed** our recall target

.. GENERATED FROM PYTHON SOURCE LINES 97-102

Changing recall targets
------------------------

We can calculate the same stopping criteria for a different **recall target**,
simply by using the `recall_target` argument in `calculate_h0`.

.. GENERATED FROM PYTHON SOURCE LINES 102-105

.. code-block:: default


    calculate_h0(seen_documents, df.shape[0], recall_target=0.99)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.2803368882417732


.. GENERATED FROM PYTHON SOURCE LINES 106-114

If we increase the recall target, we become **less** confident that we
have **not missed** our target.

In many practical cases, we may not be very confident in one target, but much
more confident in a target that is only a little smaller. The `recall_frontier`
function calculates and plots the p score for several different recall targets,
helping to inform and transparently communicate our decision about the safety
of stopping screening at any given point.

.. GENERATED FROM PYTHON SOURCE LINES 114-119

.. code-block:: default


    from buscarpy import recall_frontier

    recall_frontier(seen_documents, df.shape[0])


.. image-sg:: /auto_examples/images/sphx_glr_plot_example_004.png
   :alt: plot example
   :srcset: /auto_examples/images/sphx_glr_plot_example_004.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    {'recall_target': [0.99, 0.985, 0.98, 0.975, 0.97, 0.965, 0.96, 0.955, 0.95], 'p': [0.2803368882417732, 0.18345778193694434, 0.12005306689318333, 0.07855821491332465, 0.051403320783917274, 0.033633492389149526, 0.022005638245933937, 0.014397171706505495, 0.009418930806106517]}


.. GENERATED FROM PYTHON SOURCE LINES 120-128

Retrospective stopping criteria
------------------------

The package also includes a helper function to calculate the stopping criteria
at each point on a curve that has already been seen. By default we calculate
this after each batch of 1,000 documents. Change the `batch_size` to alter this,
though be warned that reducing it will increase the number of calculations that
needs to be made.

.. GENERATED FROM PYTHON SOURCE LINES 128-133

.. code-block:: default


    from buscarpy import retrospective_h0

    retrospective_h0(seen_documents, df.shape[0])


.. image-sg:: /auto_examples/images/sphx_glr_plot_example_005.png
   :alt: plot example
   :srcset: /auto_examples/images/sphx_glr_plot_example_005.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    {'batch_sizes': array([ 1000,  2000,  3000,  4000,  5000,  6000,  7000,  8000,  9000,
           10000, 11000]), 'p': array([0.99606155, 0.99302676, 0.97631598, 0.99875051, 0.97048041,
           0.89350356, 0.68316963, 0.81479109, 0.31343407, 0.10980178,
           0.0344359 ])}


.. GENERATED FROM PYTHON SOURCE LINES 134-157

Biased urns
------------------------

The stopping criteria, which is described in
`Callaghan and Müller-Hansen, 2020 <https://doi.org/10.1186/s13643-020-01521-4>`_
assumes that the documents we have screened previously were drawn *at random*
from the remaining records. This assumption is *conservative*, as the
machine-learning process should make it more likely that we pick a relevant
document than an irrelevant document.

Being conservative, it is safe to use this stopping criteria (and evaluations
show that it is wrong less than 5% of the time if the confidence level is set
to 95%), but its conservative nature means that we will stop later than we
strictly need to.

Biased urn theory offers us a more realistic set of assumptions, as it
describes the probability distribution given a situation where we are more
likely to select one type of item than another. We can implement this in
buscarpy, by setting the `bias` parameter of our functions. `bias` describes
how much more likely it is to select a relevant than a non-relevant document.

However, estimating this parameter is non-trivial, and work
on how to do this safely is currently ongoing.

.. GENERATED FROM PYTHON SOURCE LINES 157-172

.. code-block:: default


    h0_1 = retrospective_h0(seen_documents, df.shape[0], bias=1, plot=False)
    h0_5 = retrospective_h0(seen_documents, df.shape[0], bias=5, plot=False)

    fig, ax = plt.subplots()

    ax.plot(seen_documents.cumsum())
    ax2 = ax.twinx()
    ax2.scatter(h0_1['batch_sizes'], h0_1['p'], label='Unbiased')
    ax2.scatter(h0_5['batch_sizes'], h0_5['p'], label='Bias==5')
    ax.set_xlabel('Documents screened')
    ax.set_ylabel('Relevant documents identified')
    ax2.set_ylabel('p score for H0 that recall target missed')

    ax2.legend()


.. image-sg:: /auto_examples/images/sphx_glr_plot_example_006.png
   :alt: plot example
   :srcset: /auto_examples/images/sphx_glr_plot_example_006.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.legend.Legend object at 0x7f55e0951cf0>


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 14.254 seconds)


.. _sphx_glr_download_auto_examples_plot_example.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_example.py <plot_example.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_example.ipynb <plot_example.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_