Batchfitting of multiple datasets

This notebook is a demonstration of how to batch fit multiple datasets using refnx. Batch fitting is essential for bulk analysis of hundreds or thousands of datasets. Such situations are becoming more common as faster instruments (such as those being built at the ESS) come online. This example is based on a deuterated polymer film that is gradually being swollen by an hydrogenous solvent. 314 datasets were acquired over a period of a couple of hours.

# some initial imports
import time  # for figuring out how long the fitting takes
from multiprocessing import Pool  # for doing the fitting in parallel
from copy import deepcopy  # for copying objects
import glob  # for listing all files with a certain pattern
from tqdm import tqdm  # progress bars
from pathlib import Path  # file path manipulation
import matplotlib.pyplot as plt  # graphing everything
from natsort import natsorted  # sort the files in alphanumeric order
import numpy as np  # all maths-wise things

# grabs the data
# it's been archived in the refnx-testdata repository.
import zipfile  # unzipping files
import shutil  # copying files
import urllib.request  # for retrieving URL

url = "https://github.com/refnx/refnx-testdata/archive/master.zip"

# download the refnx-test data repo as a zip, and write the zip file
# to the current directory.
with (
    urllib.request.urlopen(url, timeout=5) as response,
    open("master.zip", "wb") as f,
):
    shutil.copyfileobj(response, f)

# extract all data from the zip file
with zipfile.ZipFile("master.zip") as zf:
    zf.extractall()

# the refnx imports
import refnx
from refnx.dataset import Data1D
from refnx.analysis import Objective, CurveFitter
from refnx.reflect import SLD, Slab, ReflectModel, Structure
from refnx.dataset import ReflectDataset

# we print out the version so that others can repeat our analysis
print(refnx.version.version)

0.1.62

# natsort is used for alphanumeric sort, but it's not essential
# An alphanumeric sort typically ensures that the datasets are loaded in
# the order in which they were run.
pth = Path(".", "refnx-testdata-master", "data", "reflect", "batch", "*.dat")
files = natsorted(list(glob.iglob(str(pth))))
print(len(files), files[0])

314 refnx-testdata-master/data/reflect/batch/PLP0027325_0.dat

"""
initial model setup
"""

# set up the SLD objects for each layer
air = SLD(0.0 + 0.0j, name="air_sld")
polymer = SLD(6.14127029941648 + 0.0j, name="d-polymer")
sio2 = SLD(3.47 + 0.0j, name="native SiO2")
si = SLD(2.07 + 0.0j, name="Si")

# set up Slabs corresponding to each layer
polymer_l = Slab(520, polymer, 8.0, name="polymer layer")
sio2_l = Slab(15.0, sio2, 5.0, name="sio2 layer")
si_l = Slab(0.0, si, 3.0, name="Si")

# SLD limits for polymer film
# we allow the lower SLD bound to go low, as it's absorbing
# hydrogenous solvent
polymer.real.setp(vary=True, bounds=(0, 7.0))

# limits for thickness and roughness of the layers
polymer_l.thick.setp(vary=True, bounds=(200, 1020.0))
polymer_l.rough.setp(vary=True, bounds=(2.0, 20.0))
sio2_l.thick.setp(vary=True, bounds=(5.0, 10.0))
sio2_l.rough.setp(vary=True, bounds=(2.0, 10.0))

# set up the Structure object from the Slabs
structure = air | polymer_l | sio2_l | si_l

# make the reflectometry model
model = ReflectModel(structure, scale=1.0, bkg=1e-10, dq=8.6, dq_type="constant")
model.scale.setp(vary=True, bounds=(0.5, 1.5))
model.bkg.setp(vary=True, bounds=(1e-10, 1e-5))
model.threads = 1

"""
Create lists of all the objectives to fit
"""

filenames = []
models = []
objectives = []

# go through the 314 files we need to fit
for idx, file in enumerate(files):
    # load the data
    data = Data1D(data=file)

    # make an objective for each dataset. deepcopy is used so that all the
    # models are independent of each other
    objective = Objective(deepcopy(model), data)

    filenames.append(data.filename)
    models.append(objective.model)
    objectives.append(objective)

Since all the objectives and curvefits are independent we can fit the datasets in parallel. To do this in Python one can use a multiprocessing.Pool object to map over all the datasets. Use of a Pool object requires us to create a fit_an_objective function in a separate Python file to fit each objective (this is because the function needs to be pickleable). The tqdm package can be used to display a progress bar.

from parallel_curvefitter import fit_an_objective

"""
# parallel_curvefitter.py

def fit_an_objective(objective):
    # make the curvefitter and do the fit
    fitter = CurveFitter(objective)
    fitter.fit('differential_evolution', verbose=False, tol=0.05)
    return objective
"""

with Pool() as p:
    obj_out = list(p.map(fit_an_objective, tqdm(objectives)))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 314/314 [00:21<00:00, 14.79it/s]

We fitted all 314 datasets in ~22 seconds, which is 0.07 seconds per fit. If we had fitted in a serial manner this would have taken a lot longer.

# now process the output
thickness = []
sld = []
for objective in obj_out:
    # grab the model from the objective
    model = objective.model

    # extract the thickness/sld parameter from the model structure
    # and append to the thickness/sld list
    thickness.append(model.structure[1].thick.value)
    sld.append(model.structure[1].sld.real.value)

# make a fancy twin plot of the thickness and SLD vs dataset number
fig, ax1 = plt.subplots()
ax1.set_xlabel("dataset #")
ax1.set_ylabel("thickness /$\\AA$")
l_0 = ax1.plot(thickness, "b", label="thickness")
ax2 = ax1.twinx()
l_1 = ax2.plot(sld, "r", label="sld")
ax2.set_ylabel("SLD /$10^{-6}\\AA^{-2}$")

# setup the legend
lines = l_0 + l_1
labels = [line.get_label() for line in lines]
ax1.legend(lines, labels, loc="right");

_images/6c08317cba945a3a4979932edf660cd3317e822ddbe4b623b334e5c02391a9fd.png

Here we see an induction period during which the film thickness and SLD remains constant (even a possible loss of solvent?). After this period the the thickness grows rapidly before with the rate than slowing down. The SLD has a continuous decrease, but the decrease is not significantly faster during the period in which the thickness of the film grows rapidly.