Preprocessing
Preprocessing lives in the alchemrs::prep module and is where most of the scientifically important workflow details live.
Core concepts
The prep crate supports:
- duplicate-time cleanup
- sorting by time
- optional time slicing
- equilibration detection
- decorrelation / subsampling
The main option struct is DecorrelationOptions.
Default values are:
drop_duplicates = truesort = trueconservative = trueremove_burnin = falsefast = falsenskip = 1lower = Noneupper = Nonestep = None
Time cleanup
Before decorrelation or equilibration detection, the prep crate can:
- drop duplicate time values
- sort by time
This matters because the timeseries logic assumes one contiguous series ordered in time.
For UNkMatrix, cleanup preserves row/state alignment by selecting or reordering entire rows, not individual values.
Time slicing
lower, upper, and step allow simple time-domain slicing before decorrelation.
These options are applied to:
- the series itself for
DhdlSeries - both the selected scalar observable and the aligned
u_nkrows forUNkMatrix
Statistical inefficiency and g
Decorrelation is based on an estimate of statistical inefficiency g.
The implementation follows the same basic pymbar.timeseries logic:
- center the series
- estimate autocorrelation contributions at increasing lag
- stop once the correlation function becomes non-positive after a minimum lag
- clamp
gto at least1
fast
fast changes how g is estimated.
fast = falseevaluates lag contributions one lag at a timefast = trueincreases the lag increment as it goes, which is cheaper but less accurate
This affects:
- plain decorrelation
- equilibration detection
because equilibration detection repeatedly estimates g on suffixes of the series.
conservative
conservative changes how the code turns g into retained sample indices.
conservative = trueuses a uniform stride ofceil(g)conservative = falseuses rounded multiples of fractionalg
The conservative mode keeps fewer samples and is intentionally cautious.
Equilibration detection
Equilibration detection returns:
t0: the chosen start index for equilibrated datag: statistical inefficiency for the retained suffixneff_max: the estimated number of effectively uncorrelated samples
The heuristic scans possible start positions and chooses the one that maximizes:
Neff = (N - t0) / g
This is the same style of automated equilibration detection used by pymbar, but alchemrs counts the actual retained suffix length N - t0 when reporting Neff_max. That means Neff_max can differ by 1 from pymbar/alchemlyb, while the broader workflow remains the same. For further reading on the topic, see Chodera 2016.
Note: when nskip > 1, alchemrs maximizes Neff only over the sampled candidate time origins (0, nskip, 2*nskip, ...). This intentionally differs from pymbar's current implementation, which pre-fills arrays for all indices before taking the maximum, because the documented meaning of nskip is to try only every nskip-th sample as a potential origin.
remove_burnin
In the prep crate, remove_burnin = true means:
- run equilibration detection
- discard all data before
t0 - then decorrelate the remaining suffix using the
gestimated for that suffix
This is different from the CLI flag --remove-burnin <N>, which is a fixed-count trim done before any automated detection.
u_nk observables
u_nk decorrelation needs a scalar time series. The prep crate supports two native derived observables and one external-observable path.
UNkSeriesMethod::DE
For each sample row:
- identify the sampled-state column
- use the next evaluated-state column if it exists
- otherwise use the previous column for the last state
- compute:
DE_t = u_nk[t, other] - u_nk[t, sampled]
This matches the alchemlyb dE convention.
Important detail:
- “adjacent” means adjacent in evaluated-state order, not nearest by numeric lambda distance
UNkSeriesMethod::All
For each sample row, sum all evaluated-state reduced energies in that row.
This is a generic matrix-derived scalar, but it is usually less targeted than DE.
decorrelate_u_nk_with_observable
This path accepts an external scalar observable, most commonly EPtot.
The observable determines which sample indices are retained, and those retained indices are then applied back to the full u_nk rows.
This is useful when:
- the matrix contains infinite energy values that make
DEinvalid
Non-finite u_nk behavior
decorrelate_u_nk rejects non-finite derived scalar series.
In practice:
DEorAllcan fail if the derived scalar is not finitedecorrelate_u_nk_with_observablecan still succeed if the supplied observable itself is finite
That is why the CLI exposes epot as an observable choice instead of only supporting de.
CLI preprocessing order
For CLI commands, preprocessing order is:
- fixed-count
--remove-burnin <N> --auto-equilibrate--decorrelate
TI uses dH/dlambda as the scalar series.
BAR, MBAR, EXP, and DEXP use the observable chosen by --u-nk-observable <de|all|epot>.
CLI auto-equilibrate overrides
When --auto-equilibrate is enabled in the CLI, the effective preprocessing flags become:
fast = trueconservative = false
even if the user supplied different --fast or --conservative values.
This is intentional and mirrors the alchemlyb / pymbar automated-equilibration workflow:
detect_equilibration(..., fast=True)- then
subsample_correlated_data(..., conservative=False)
Those effective values are still recorded in CLI provenance so the output remains auditable.