
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/lmfit_emcee_model_selection.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_lmfit_emcee_model_selection.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_lmfit_emcee_model_selection.py:


Model Selection using lmfit and emcee
=====================================

FIXME: this is a useful example; however, it doesn't run correctly anymore as
the PTSampler was removed in emcee v3...

.. GENERATED FROM PYTHON SOURCE LINES 10-13

`lmfit.emcee` can be used to obtain the posterior probability distribution
of parameters, given a set of experimental data. This notebook shows how it
can be used for Bayesian model selection.

.. GENERATED FROM PYTHON SOURCE LINES 13-19

.. code-block:: Python


    import matplotlib.pyplot as plt
    import numpy as np

    import lmfit


.. GENERATED FROM PYTHON SOURCE LINES 20-21

Define a Gaussian lineshape and generate some data:

.. GENERATED FROM PYTHON SOURCE LINES 21-34

.. code-block:: Python



    def gauss(x, a_max, loc, sd):
        return a_max * np.exp(-((x - loc) / sd)**2)


    x = np.linspace(3, 7, 250)
    np.random.seed(0)
    y = 4 + 10 * x + gauss(x, 200, 5, 0.5) + gauss(x, 60, 5.8, 0.2)
    dy = np.sqrt(y)
    y += dy * np.random.randn(y.size)



.. GENERATED FROM PYTHON SOURCE LINES 35-36

Plot the data:

.. GENERATED FROM PYTHON SOURCE LINES 36-38

.. code-block:: Python

    plt.errorbar(x, y)


.. GENERATED FROM PYTHON SOURCE LINES 39-40

Define the normalised residual for the data:

.. GENERATED FROM PYTHON SOURCE LINES 40-55

.. code-block:: Python



    def residual(p, just_generative=False):
        v = p.valuesdict()
        generative = v['a'] + v['b'] * x
        M = 0
        while f'a_max{M}' in v:
            generative += gauss(x, v[f'a_max{M}'], v[f'loc{M}'], v[f'sd{M}'])
            M += 1

        if just_generative:
            return generative
        return (generative - y) / dy



.. GENERATED FROM PYTHON SOURCE LINES 56-57

Create a Parameter set for the initial guesses:

.. GENERATED FROM PYTHON SOURCE LINES 57-79

.. code-block:: Python

    def initial_peak_params(M):
        p = lmfit.Parameters()

        # a and b give a linear background
        a = np.mean(y)
        b = 1

        # a_max, loc and sd are the amplitude, location and SD of each Gaussian
        # component
        a_max = np.max(y)
        loc = np.mean(x)
        sd = (np.max(x) - np.min(x)) * 0.5

        p.add_many(('a', a, True, 0, 10), ('b', b, True, 1, 15))

        for i in range(M):
            p.add_many((f'a_max{i}', 0.5 * a_max, True, 10, a_max),
                       (f'loc{i}', loc, True, np.min(x), np.max(x)),
                       (f'sd{i}', sd, True, 0.1, np.max(x) - np.min(x)))
        return p



.. GENERATED FROM PYTHON SOURCE LINES 80-81

Solving with `minimize` gives the Maximum Likelihood solution.

.. GENERATED FROM PYTHON SOURCE LINES 81-86

.. code-block:: Python

    p1 = initial_peak_params(1)
    mi1 = lmfit.minimize(residual, p1, method='differential_evolution')

    lmfit.printfuncs.report_fit(mi1.params, min_correl=0.5)


.. GENERATED FROM PYTHON SOURCE LINES 87-103

From inspection of the data above we can tell that there is going to be more
than 1 Gaussian component, but how many are there? A Bayesian approach can
be used for this model selection problem. We can do this with `lmfit.emcee`,
which uses the `emcee` package to do a Markov Chain Monte Carlo sampling of
the posterior probability distribution. `lmfit.emcee` requires a function
that returns the log-posterior probability. The log-posterior probability is
a sum of the log-prior probability and log-likelihood functions.

The log-prior probability encodes information about what you already believe
about the system. `lmfit.emcee` assumes that this log-prior probability is
zero if all the parameters are within their bounds and `-np.inf` if any of
the parameters are outside their bounds. As such it's a uniform prior.

The log-likelihood function is given below. To use non-uniform priors then
should include these terms in `lnprob`. This is the log-likelihood
probability for the sampling.

.. GENERATED FROM PYTHON SOURCE LINES 103-110

.. code-block:: Python



    def lnprob(p):
        resid = residual(p, just_generative=True)
        return -0.5 * np.sum(((resid - y) / dy)**2 + np.log(2 * np.pi * dy**2))



.. GENERATED FROM PYTHON SOURCE LINES 111-117

To start with we have to create the minimizers and *burn* them in. We create
4 different minimizers representing 0, 1, 2 or 3 Gaussian contributions. To
do the model selection we have to integrate the over the log-posterior
distribution to see which has the higher probability. This is done using the
`thermodynamic_integration_log_evidence` method of the `sampler` attribute
contained in the `lmfit.Minimizer` object.

.. GENERATED FROM PYTHON SOURCE LINES 117-148

.. code-block:: Python


    # Work out the log-evidence for different numbers of peaks:
    total_steps = 310
    burn = 300
    thin = 10
    ntemps = 15
    workers = 1  # the multiprocessing does not work with sphinx-gallery
    log_evidence = []
    res = []

    # set up the Minimizers
    for i in range(4):
        p0 = initial_peak_params(i)
        # you can't use lnprob as a userfcn with minimize because it needs to be
        # maximised
        mini = lmfit.Minimizer(residual, p0)
        out = mini.minimize(method='differential_evolution')
        res.append(out)

    mini = []
    # burn in the samplers
    for i in range(4):
        # do the sampling
        mini.append(lmfit.Minimizer(lnprob, res[i].params))
        out = mini[i].emcee(steps=total_steps, ntemps=ntemps, workers=workers,
                            reuse_sampler=False, float_behavior='posterior',
                            progress=False)
        # get the evidence
        print(i, total_steps, mini[i].sampler.thermodynamic_integration_log_evidence())
        log_evidence.append(mini[i].sampler.thermodynamic_integration_log_evidence()[0])


.. GENERATED FROM PYTHON SOURCE LINES 149-151

Once we've burned in the samplers we have to do a collection run. We thin
out the MCMC chain to reduce autocorrelation between successive samples.

.. GENERATED FROM PYTHON SOURCE LINES 151-166

.. code-block:: Python

    for j in range(6):
        total_steps += 100
        for i in range(4):
            # do the sampling
            res = mini[i].emcee(burn=burn, steps=100, thin=thin, ntemps=ntemps,
                                workers=workers, reuse_sampler=True, progress=False)
            # get the evidence
            print(i, total_steps, mini[i].sampler.thermodynamic_integration_log_evidence())
            log_evidence.append(mini[i].sampler.thermodynamic_integration_log_evidence()[0])


    plt.plot(log_evidence[-4:])
    plt.ylabel('Log-evidence')
    plt.xlabel('number of peaks')


.. GENERATED FROM PYTHON SOURCE LINES 167-171

The Bayes factor is related to the exponential of the difference between the
log-evidence values.  Thus, 0 peaks is not very likely compared to 1 peak.
But 1 peak is not as good as 2 peaks. 3 peaks is not that much better than 2
peaks.

.. GENERATED FROM PYTHON SOURCE LINES 171-177

.. code-block:: Python

    r01 = np.exp(log_evidence[-4] - log_evidence[-3])
    r12 = np.exp(log_evidence[-3] - log_evidence[-2])
    r23 = np.exp(log_evidence[-2] - log_evidence[-1])

    print(r01, r12, r23)


.. GENERATED FROM PYTHON SOURCE LINES 178-198

These numbers tell us that zero peaks is 0 times as likely as one peak. Two
peaks is 7e49 times more likely than one peak. Three peaks is 1.1 times more
likely than two peaks. With this data one would say that two peaks is
sufficient. Caution has to be taken with these values. The log-priors for
this sampling are uniform but improper, i.e. they are not normalised properly.
Internally the lnprior probability is calculated as 0 if all parameters are
within their bounds and `-np.inf` if any parameter is outside the bounds.
The `lnprob` function defined above is the log-likelihood alone. Remember,
that the log-posterior probability is equal to the sum of the log-prior and
log-likelihood probabilities. Extra terms can be added to the lnprob function
to calculate the normalised log-probability. These terms would look something
like:

.. math::

    \log (\prod_i \frac{1}{max_i - min_i})

where :math:`max_i` and :math:`min_i` are the upper and lower bounds for the
parameter, and the prior is a uniform distribution. Other types of prior are
possible. For example, you might expect the prior to be Gaussian.


.. _sphx_glr_download_examples_lmfit_emcee_model_selection.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: lmfit_emcee_model_selection.ipynb <lmfit_emcee_model_selection.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: lmfit_emcee_model_selection.py <lmfit_emcee_model_selection.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: lmfit_emcee_model_selection.zip <lmfit_emcee_model_selection.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
