wdfkit package
Python package for WDF data treatment.
- class wdfkit.CosmicRayRemover(sensitivity: float = 0.01, width: float = 0.02, disk_radius: int = 3, single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median', kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3, spectral_dim: str | None = None, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True)[source]
Bases:
objectCosmic-ray removal: spatial median for maps; robust 1D for singles.
Optionally removes broad Nd:YAG laser harmonics on ~355 nm excitation before narrow spike removal (
harmonic_check(),remove()).Maps (3D): spatial disk median on a min/median-normalized cube; per-λ scaled MAD cutoffs and noisy-band
relax_λ; repair by spectral interpolation (not copying the full median surface).Single spectrum / line scan (2D or 1×1 map): see
remove_cosmic_rays_1d()— up tomax_passesiterations ofscipy.signal.medfilt-based MAD detection, mask dilation by 1 channel, and linear-interpolation repair from the original signal.- Parameters:
sensitivity (float) – Map path: scales aggressiveness. The cutoff includes
(0.01 / sensitivity)times the per-channel MAD level (0.01is the legacy default reference). Largersensitivity→ more hits.width (float) – Map path: spectral dilation of the CR mask (fraction of length).
disk_radius (int) – Map path: spatial disk radius for the reference median filter.
map_mad_multiplier (float) – Map path: multiplier on
noise_λ × relax_λ(like 1Dthreshold; larger → fewer false positives).map_noisy_channel_relax_min (float) – Map path: floor on
relax_λin noisy channels. Higher → weaker boost in noisy bands → fewer false positives.map_spectral_dilate_cap (int) – Map path: max footprint length (in spectral channels) when dilating hits along λ. Caps
width × Nso repair stays narrow and interp stays accurate.map_require_spatial_local_max (bool) – Map path: if True (default), keep only voxels that are strict maxima in their
(y, x)slice at fixed λ (8-neighbour), reducing extended bright features being treated as CRs.map_max_spectral_repair_extent (int | None) – Map path: after spectral dilation, each contiguous repair segment along λ is clipped to at most this many channels (centered on max residual in the segment).
Nonedisables (not recommended for noisy maps).map_min_residual_over_cutoff (float) – Map path: require
residual > cutoff *this factor (> 1 stricter, fewer false positives). Use1.0for the legacy strict inequality.single_spectrum_method (Literal['median', 'interpolate', 'derivative']) –
"median"or"interpolate"(both use medfilt detection and linear-interpolation repair — equivalent in practice), or"derivative"(neighbour-difference peak test).kernel_size (int) – Odd,
>= 3. Passed tomedfiltfor single-spectrum median-based methods.threshold (float) – Single-spectrum only: spike cutoff is
threshold * MAD_noise. Lower → more aggressive (try3.5–4.0for noisy spectra).max_passes (int) – Single-spectrum only: number of detection–repair iterations (default 3). Each pass runs on the already-repaired signal so that large spikes no longer mask smaller ones.
1replicates old single-pass behaviour.spectral_dim (str | None) – Name of the spectral axis (default: last dimension). Used for harmonic cleanup and when the spectral dimension is not last.
- harmonic_check(spectrum: DataArray) DataArray[source]
Notch broad harmonics when
LaserWaveLengthis ~355 nm (Nd:YAG).If
spectrum.attrs['LaserWaveLength']is outside 354–356 nm, returnsspectrumunchanged.Searches 1064 / 532 / 355 / 266 nm (±2.5 nm); replaces ~1 nm around each found peak with linear interpolation. Prints one line per removal.
- remove_cosmic_rays_with_diagnostics(spectrum: DataArray) tuple[DataArray, dict[str, Any]][source]
Like
remove_cosmic_rays(), but returns a diagnostics dict for visualization / QC (not written toDataArray.attrs).For 3D maps,
diagnosticsincludes booleancore_mask,repair_mask, and float arraysresidual,preprocessed,spatial_median_reference,cutoff,per_spectrum_median, etc. Use matplotlib to overlay masks or compare spectra at selected(y, x).For 2D single-spectrum input, diagnostics contain
cosmic_maskandcorrected_1d(the 1D corrected intensity).
- class wdfkit.SpectraCleaner(method: Literal['pca'] = 'pca', n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, spectral_dim: str | None = None, pca_kwargs: dict[str, ~typing.Any]=<factory>)[source]
Bases:
objectDenoise a population of spectra by low-rank reconstruction.
Designed for 3D map cubes
(ny, nx, n_spectral)and 2D stacks(n_spectra, n_spectral). PCA reconstruction needs more than one spectrum to separate shared signal from per-channel noise — a single spectrum is rejected with a clear error (use a 1D smoother instead).- Parameters:
method (Literal['pca']) – Denoising method. Currently only
"pca"is implemented; the switch is kept for forward compatibility.n_components (int | float | str | None) – Forwarded to
sklearn.decomposition.PCA."mle"(default), afloatin(0, 1)for variance-explained, anintcount, orNoneformin(n_spectra, n_spectral).subtract_min (bool) – Subtract per-spectrum min before the fit (legacy default
True). PCA also mean-centers internally, so this only changes the baseline offset fed to the fit.restore_min (bool) – Add the saved per-spectrum min back after reconstruction. Off by default (legacy behavior); enable to preserve absolute intensities.
spectral_dim (str | None) – Name of the spectral axis in DataArray inputs. Defaults to the last dimension; pass when spectra are not last (e.g.
"raman_shift"with leading spectral axis).pca_kwargs (dict[str, Any]) – Extra kwargs forwarded to
sklearn.decomposition.PCA(e.g.{"svd_solver": "full"}).
- clean(spectra: DataArray) DataArray[source]
Return a denoised copy of
spectra(no decomposition payload).
- clean_with_decomposition(spectra: DataArray) tuple[DataArray, dict[str, Any]][source]
Like
clean(), but also returns the PCA decomposition.The returned
decompositiondict has keyscomponents(shape(n_components, n_spectral)),coeffs(per-spectrum scores reshaped to the input’s spatial layout + components axis),mean,explained_variance,explained_variance_ratio, andnoise_variance. These arrays can be large — they’re returned separately rather than written toDataArray.attrs.
- class wdfkit.SpectralAxisSpec(dim_name: str, units: str)[source]
Bases:
objectResolved spectral coordinate used as xarray dimension name + coord attrs.
- class wdfkit.WDFReader(path: str | PathLike[str], *, verbose: bool = False, time_coord: str = 'seconds_elapsed', spectral_dim: str | None = None, chunks: bool | int = False)[source]
Bases:
objectLoad spectra and metadata from a Renishaw WiRE
.wdfbinary file.Typical usage:
data_array, white_light_image = WDFReader(path)
After construction,
.dataand.imagehold the same objects as the unpacked tuple.- Parameters:
spectral_dim – Name for the spectral axis coordinate (default
None/"auto"). WiREXLSTXlistDataUnitsselects the default (e.g.RamanShift→ dimension"raman_shift"). Set to"shifts"for legacy notebooks.chunks – Enable lazy Dask-backed reading.
False(default) = eager;True= auto-chunk at ~128 MB per chunk;int= target MB.
- wdfkit.classify(path: str | PathLike[str]) dict[source]
Return scan classification for a WiRE
.wdffile without loading the spectral data.- Returns:
Keys:
kind,measurement_type,scan_type,wmap_flag,nspectra,npoints,nsteps.- Return type:
- wdfkit.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) DataArray | ndarray[source]
Scale spectra along the spectral axis.
For
xarray.DataArrayinput, the spectral axis defaults to the last dimension (e.g.nm,raman_shift,shifts, …). Passspectral_dimto select another dimension when spectra are not last.Dask-backed DataArrays are handled transparently:
Per-spectrum methods (
"l1","l2","max","min_max","area"): processed chunk-by-chunk viaxr.apply_ufunc— no data is loaded into RAM beyond the current chunk.Global methods (
"robust_scale","wave_number"): require statistics across all spectra; the full array is computed first. AUserWarningis emitted so you know RAM is being used.
- Parameters:
input_spectra – DataArray or 2D ndarray of shape
(n_spectra, n_points).method – One of
"l1","l2","max","min_max","wave_number","robust_scale","area".spectral_dim – Spectral dimension name when
input_spectrais a DataArray.x_values – Spectral abscissa for ndarray input (default
arange(n_points)).
- Returns:
Same type as
input_spectrawith updatedattrs["treatments"]forDataArray output.
- wdfkit.read(path: str | PathLike[str], *, verbose: bool = False, spectral_dim: str | None = None, chunks: bool | int = False) DataArray[source]
Read a WiRE
.wdffile and return axarray.DataArray.- Parameters:
path – Path to the
.wdffile.spectral_dim – Override for the spectral-axis dimension name.
chunks – Dask chunking:
False(eager),True(auto), or int (target MB).
- Returns:
Shape and dims depend on scan kind; spectral axis is always last.
- Return type:
- wdfkit.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) tuple[ndarray, ndarray][source]
Remove sharp positive spikes from one 1D spectrum (PL-style).
Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.
The algorithm runs up to
max_passesiterations. Each pass:Detects new spikes on the current (already-repaired) signal.
Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.
Accumulates into a single cumulative mask across all passes.
Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.
Early termination when a pass finds no new spikes.
- Parameters:
y – One spectral trace (any numeric dtype; cast to float).
method –
"median"or"interpolate"—scipy.signal.medfiltreference signal, MAD on residual for detection (both methods now repair identically via linear interpolation)."derivative"— neighbour-difference test ondiff(y)MAD; interior points only.kernel_size – Odd length
>= 3formedfilt(median / interpolate methods).threshold – Multiplier on MAD-derived noise (larger → fewer detections).
max_passes – Maximum number of detection–repair iterations (default 3). Use
1for the old single-pass behaviour.
- Returns:
corrected_y – Same shape as
y; unchanged if no spikes are found or if noise is degenerate.cosmic_mask – Boolean mask, same shape as
y;Trueat all channels that were corrected (including dilation neighbours). AllFalsewhen nothing was found or when the mask would cover the entire spectrum.
- wdfkit.resolve_spectral_axis(xlist_data_units: str, spectral_dim: str | None) SpectralAxisSpec[source]
Choose spectral coordinate dimension name and
coord.attrs["units"].- Parameters:
xlist_data_units –
DATA_UNITSlabel resolved from raw XLST (e.g."Nanometre").spectral_dim –
Noneor"auto"— derive dim name fromxlist_data_units. Any other string — force this dimension name (units still come from the table when known; unknown Wire enums fall back tounits="unknown").
- Returns:
dim_nameis safe as an xarray dimension identifier (ASCII tokens).- Return type:
Submodules
wdfkit.reader
Public WDFReader API plus module-level read() and
classify().
- class wdfkit.reader.WDFReader(path: str | PathLike[str], *, verbose: bool = False, time_coord: str = 'seconds_elapsed', spectral_dim: str | None = None, chunks: bool | int = False)[source]
Bases:
objectLoad spectra and metadata from a Renishaw WiRE
.wdfbinary file.Typical usage:
data_array, white_light_image = WDFReader(path)
After construction,
.dataand.imagehold the same objects as the unpacked tuple.- Parameters:
spectral_dim – Name for the spectral axis coordinate (default
None/"auto"). WiREXLSTXlistDataUnitsselects the default (e.g.RamanShift→ dimension"raman_shift"). Set to"shifts"for legacy notebooks.chunks – Enable lazy Dask-backed reading.
False(default) = eager;True= auto-chunk at ~128 MB per chunk;int= target MB.
- wdfkit.reader.classify(path: str | PathLike[str]) dict[source]
Return scan classification for a WiRE
.wdffile without loading the spectral data.- Returns:
Keys:
kind,measurement_type,scan_type,wmap_flag,nspectra,npoints,nsteps.- Return type:
- wdfkit.reader.read(path: str | PathLike[str], *, verbose: bool = False, spectral_dim: str | None = None, chunks: bool | int = False) DataArray[source]
Read a WiRE
.wdffile and return axarray.DataArray.- Parameters:
path – Path to the
.wdffile.spectral_dim – Override for the spectral-axis dimension name.
chunks – Dask chunking:
False(eager),True(auto), or int (target MB).
- Returns:
Shape and dims depend on scan kind; spectral axis is always last.
- Return type:
wdfkit.cosmic_ray
High-level cosmic-ray removal: CosmicRayRemover for maps and singles.
- class wdfkit.cosmic_ray.CosmicRayRemover(sensitivity: float = 0.01, width: float = 0.02, disk_radius: int = 3, single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median', kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3, spectral_dim: str | None = None, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True)[source]
Bases:
objectCosmic-ray removal: spatial median for maps; robust 1D for singles.
Optionally removes broad Nd:YAG laser harmonics on ~355 nm excitation before narrow spike removal (
harmonic_check(),remove()).Maps (3D): spatial disk median on a min/median-normalized cube; per-λ scaled MAD cutoffs and noisy-band
relax_λ; repair by spectral interpolation (not copying the full median surface).Single spectrum / line scan (2D or 1×1 map): see
remove_cosmic_rays_1d()— up tomax_passesiterations ofscipy.signal.medfilt-based MAD detection, mask dilation by 1 channel, and linear-interpolation repair from the original signal.- Parameters:
sensitivity (float) – Map path: scales aggressiveness. The cutoff includes
(0.01 / sensitivity)times the per-channel MAD level (0.01is the legacy default reference). Largersensitivity→ more hits.width (float) – Map path: spectral dilation of the CR mask (fraction of length).
disk_radius (int) – Map path: spatial disk radius for the reference median filter.
map_mad_multiplier (float) – Map path: multiplier on
noise_λ × relax_λ(like 1Dthreshold; larger → fewer false positives).map_noisy_channel_relax_min (float) – Map path: floor on
relax_λin noisy channels. Higher → weaker boost in noisy bands → fewer false positives.map_spectral_dilate_cap (int) – Map path: max footprint length (in spectral channels) when dilating hits along λ. Caps
width × Nso repair stays narrow and interp stays accurate.map_require_spatial_local_max (bool) – Map path: if True (default), keep only voxels that are strict maxima in their
(y, x)slice at fixed λ (8-neighbour), reducing extended bright features being treated as CRs.map_max_spectral_repair_extent (int | None) – Map path: after spectral dilation, each contiguous repair segment along λ is clipped to at most this many channels (centered on max residual in the segment).
Nonedisables (not recommended for noisy maps).map_min_residual_over_cutoff (float) – Map path: require
residual > cutoff *this factor (> 1 stricter, fewer false positives). Use1.0for the legacy strict inequality.single_spectrum_method (Literal['median', 'interpolate', 'derivative']) –
"median"or"interpolate"(both use medfilt detection and linear-interpolation repair — equivalent in practice), or"derivative"(neighbour-difference peak test).kernel_size (int) – Odd,
>= 3. Passed tomedfiltfor single-spectrum median-based methods.threshold (float) – Single-spectrum only: spike cutoff is
threshold * MAD_noise. Lower → more aggressive (try3.5–4.0for noisy spectra).max_passes (int) – Single-spectrum only: number of detection–repair iterations (default 3). Each pass runs on the already-repaired signal so that large spikes no longer mask smaller ones.
1replicates old single-pass behaviour.spectral_dim (str | None) – Name of the spectral axis (default: last dimension). Used for harmonic cleanup and when the spectral dimension is not last.
- harmonic_check(spectrum: DataArray) DataArray[source]
Notch broad harmonics when
LaserWaveLengthis ~355 nm (Nd:YAG).If
spectrum.attrs['LaserWaveLength']is outside 354–356 nm, returnsspectrumunchanged.Searches 1064 / 532 / 355 / 266 nm (±2.5 nm); replaces ~1 nm around each found peak with linear interpolation. Prints one line per removal.
- remove_cosmic_rays_with_diagnostics(spectrum: DataArray) tuple[DataArray, dict[str, Any]][source]
Like
remove_cosmic_rays(), but returns a diagnostics dict for visualization / QC (not written toDataArray.attrs).For 3D maps,
diagnosticsincludes booleancore_mask,repair_mask, and float arraysresidual,preprocessed,spatial_median_reference,cutoff,per_spectrum_median, etc. Use matplotlib to overlay masks or compare spectra at selected(y, x).For 2D single-spectrum input, diagnostics contain
cosmic_maskandcorrected_1d(the 1D corrected intensity).
- wdfkit.cosmic_ray.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) tuple[ndarray, ndarray][source]
Remove sharp positive spikes from one 1D spectrum (PL-style).
Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.
The algorithm runs up to
max_passesiterations. Each pass:Detects new spikes on the current (already-repaired) signal.
Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.
Accumulates into a single cumulative mask across all passes.
Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.
Early termination when a pass finds no new spikes.
- Parameters:
y – One spectral trace (any numeric dtype; cast to float).
method –
"median"or"interpolate"—scipy.signal.medfiltreference signal, MAD on residual for detection (both methods now repair identically via linear interpolation)."derivative"— neighbour-difference test ondiff(y)MAD; interior points only.kernel_size – Odd length
>= 3formedfilt(median / interpolate methods).threshold – Multiplier on MAD-derived noise (larger → fewer detections).
max_passes – Maximum number of detection–repair iterations (default 3). Use
1for the old single-pass behaviour.
- Returns:
corrected_y – Same shape as
y; unchanged if no spikes are found or if noise is degenerate.cosmic_mask – Boolean mask, same shape as
y;Trueat all channels that were corrected (including dilation neighbours). AllFalsewhen nothing was found or when the mask would cover the entire spectrum.
wdfkit.spectra_cleaner
High-level spectral denoising: SpectraCleaner.
Currently implements PCA-based reconstruction (legacy pca_clean); the
method switch is kept so other denoisers can be added (e.g. Savitzky-
Golay or wavelet) without breaking callers.
- class wdfkit.spectra_cleaner.SpectraCleaner(method: Literal['pca'] = 'pca', n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, spectral_dim: str | None = None, pca_kwargs: dict[str, ~typing.Any]=<factory>)[source]
Bases:
objectDenoise a population of spectra by low-rank reconstruction.
Designed for 3D map cubes
(ny, nx, n_spectral)and 2D stacks(n_spectra, n_spectral). PCA reconstruction needs more than one spectrum to separate shared signal from per-channel noise — a single spectrum is rejected with a clear error (use a 1D smoother instead).- Parameters:
method (Literal['pca']) – Denoising method. Currently only
"pca"is implemented; the switch is kept for forward compatibility.n_components (int | float | str | None) – Forwarded to
sklearn.decomposition.PCA."mle"(default), afloatin(0, 1)for variance-explained, anintcount, orNoneformin(n_spectra, n_spectral).subtract_min (bool) – Subtract per-spectrum min before the fit (legacy default
True). PCA also mean-centers internally, so this only changes the baseline offset fed to the fit.restore_min (bool) – Add the saved per-spectrum min back after reconstruction. Off by default (legacy behavior); enable to preserve absolute intensities.
spectral_dim (str | None) – Name of the spectral axis in DataArray inputs. Defaults to the last dimension; pass when spectra are not last (e.g.
"raman_shift"with leading spectral axis).pca_kwargs (dict[str, Any]) – Extra kwargs forwarded to
sklearn.decomposition.PCA(e.g.{"svd_solver": "full"}).
- clean(spectra: DataArray) DataArray[source]
Return a denoised copy of
spectra(no decomposition payload).
- clean_with_decomposition(spectra: DataArray) tuple[DataArray, dict[str, Any]][source]
Like
clean(), but also returns the PCA decomposition.The returned
decompositiondict has keyscomponents(shape(n_components, n_spectral)),coeffs(per-spectrum scores reshaped to the input’s spatial layout + components axis),mean,explained_variance,explained_variance_ratio, andnoise_variance. These arrays can be large — they’re returned separately rather than written toDataArray.attrs.
wdfkit.spectral
Spectral-axis naming from WiRE XLST unit enums.
- class wdfkit.spectral.SpectralAxisSpec(dim_name: str, units: str)[source]
Bases:
objectResolved spectral coordinate used as xarray dimension name + coord attrs.
- wdfkit.spectral.resolve_spectral_axis(xlist_data_units: str, spectral_dim: str | None) SpectralAxisSpec[source]
Choose spectral coordinate dimension name and
coord.attrs["units"].- Parameters:
xlist_data_units –
DATA_UNITSlabel resolved from raw XLST (e.g."Nanometre").spectral_dim –
Noneor"auto"— derive dim name fromxlist_data_units. Any other string — force this dimension name (units still come from the table when known; unknown Wire enums fall back tounits="unknown").
- Returns:
dim_nameis safe as an xarray dimension identifier (ASCII tokens).- Return type:
wdfkit.preprocessing
Spectral preprocessing (normalization).
Cosmic-ray removal: see wdfkit.cosmic_ray.
PCA-based denoising: see wdfkit.spectra_cleaner.
- wdfkit.preprocessing.denoise_spectra_pca(values: ndarray, *, n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, pca_kwargs: dict[str, Any] | None = None, return_decomposition: bool = False) tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]
Denoise a stack / cube of spectra by PCA reconstruction.
The input is reshaped to
(n_spectra, n_spectral)for the fit, then reshaped back to the original spatial layout on return. PCA itself mean-centers internally; the optional per-spectrum min subtraction below only changes the baseline offset fed to the decomposition.- Parameters:
values – Array of shape
(..., n_spectral). Typical inputs:(ny, nx, n_spectral)map cube, or(n_spectra, n_spectral)stack. Needs more than one spectrum (PCA on a single spectrum is degenerate).n_components – Forwarded to
sklearn.decomposition.PCA."mle"(default) picks the number with Minka’s MLE; afloatin(0, 1)keeps the components that explain that fraction of variance; anintfixes the count;Noneusesmin(n_spectra, n_spectral).subtract_min – If True (default, matches legacy
pca_clean), subtract the per-spectrum minimum before the fit so PCA models the spectral shape rather than offsets.restore_min – If True, add the saved per-spectrum minimum back to the cleaned output. Off by default to match legacy
pca_clean; turn on to preserve absolute intensities.pca_kwargs – Extra kwargs passed straight to
sklearn.decomposition.PCA(e.g.{"svd_solver": "full"}).return_decomposition – If True, also return a third dict with the components, per-spectrum coefficients, mean, and explained-variance arrays (large; not suitable for
DataArray.attrs).
- Returns:
cleaned – Same shape and dtype-flavor (float) as
values.meta – Small dict with the parameters actually used and summary stats — safe to attach to
DataArray.attrs.decomposition_payload – Only when
return_decomposition=True. Has keyscomponents,coeffs,mean,explained_variance,explained_variance_ratio,noise_variance.
- wdfkit.preprocessing.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) DataArray | ndarray[source]
Scale spectra along the spectral axis.
For
xarray.DataArrayinput, the spectral axis defaults to the last dimension (e.g.nm,raman_shift,shifts, …). Passspectral_dimto select another dimension when spectra are not last.Dask-backed DataArrays are handled transparently:
Per-spectrum methods (
"l1","l2","max","min_max","area"): processed chunk-by-chunk viaxr.apply_ufunc— no data is loaded into RAM beyond the current chunk.Global methods (
"robust_scale","wave_number"): require statistics across all spectra; the full array is computed first. AUserWarningis emitted so you know RAM is being used.
- Parameters:
input_spectra – DataArray or 2D ndarray of shape
(n_spectra, n_points).method – One of
"l1","l2","max","min_max","wave_number","robust_scale","area".spectral_dim – Spectral dimension name when
input_spectrais a DataArray.x_values – Spectral abscissa for ndarray input (default
arange(n_points)).
- Returns:
Same type as
input_spectrawith updatedattrs["treatments"]forDataArray output.
wdfkit.preprocessing.normalize
Per-spectrum normalization (dynamic spectral coordinate).
- wdfkit.preprocessing.normalize.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) DataArray | ndarray[source]
Scale spectra along the spectral axis.
For
xarray.DataArrayinput, the spectral axis defaults to the last dimension (e.g.nm,raman_shift,shifts, …). Passspectral_dimto select another dimension when spectra are not last.Dask-backed DataArrays are handled transparently:
Per-spectrum methods (
"l1","l2","max","min_max","area"): processed chunk-by-chunk viaxr.apply_ufunc— no data is loaded into RAM beyond the current chunk.Global methods (
"robust_scale","wave_number"): require statistics across all spectra; the full array is computed first. AUserWarningis emitted so you know RAM is being used.
- Parameters:
input_spectra – DataArray or 2D ndarray of shape
(n_spectra, n_points).method – One of
"l1","l2","max","min_max","wave_number","robust_scale","area".spectral_dim – Spectral dimension name when
input_spectrais a DataArray.x_values – Spectral abscissa for ndarray input (default
arange(n_points)).
- Returns:
Same type as
input_spectrawith updatedattrs["treatments"]forDataArray output.
wdfkit.preprocessing.pca_clean
PCA-based spectral denoising for stacks of spectra and 3D map cubes.
PCA decomposes a population of spectra into orthogonal components and
reconstructs each spectrum from the leading ones. Components dominated by
uncorrelated per-channel noise are dropped, so the reconstruction is a
denoised version of the input. This requires more than one spectrum — see
wdfkit.spectra_cleaner.SpectraCleaner for the user-facing API.
- wdfkit.preprocessing.pca_clean.denoise_spectra_pca(values: ndarray, *, n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, pca_kwargs: dict[str, Any] | None = None, return_decomposition: bool = False) tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]
Denoise a stack / cube of spectra by PCA reconstruction.
The input is reshaped to
(n_spectra, n_spectral)for the fit, then reshaped back to the original spatial layout on return. PCA itself mean-centers internally; the optional per-spectrum min subtraction below only changes the baseline offset fed to the decomposition.- Parameters:
values – Array of shape
(..., n_spectral). Typical inputs:(ny, nx, n_spectral)map cube, or(n_spectra, n_spectral)stack. Needs more than one spectrum (PCA on a single spectrum is degenerate).n_components – Forwarded to
sklearn.decomposition.PCA."mle"(default) picks the number with Minka’s MLE; afloatin(0, 1)keeps the components that explain that fraction of variance; anintfixes the count;Noneusesmin(n_spectra, n_spectral).subtract_min – If True (default, matches legacy
pca_clean), subtract the per-spectrum minimum before the fit so PCA models the spectral shape rather than offsets.restore_min – If True, add the saved per-spectrum minimum back to the cleaned output. Off by default to match legacy
pca_clean; turn on to preserve absolute intensities.pca_kwargs – Extra kwargs passed straight to
sklearn.decomposition.PCA(e.g.{"svd_solver": "full"}).return_decomposition – If True, also return a third dict with the components, per-spectrum coefficients, mean, and explained-variance arrays (large; not suitable for
DataArray.attrs).
- Returns:
cleaned – Same shape and dtype-flavor (float) as
values.meta – Small dict with the parameters actually used and summary stats — safe to attach to
DataArray.attrs.decomposition_payload – Only when
return_decomposition=True. Has keyscomponents,coeffs,mean,explained_variance,explained_variance_ratio,noise_variance.
wdfkit.preprocessing.cosmic_ray_1d
1D spectrum cosmic-ray (positive spike) removal.
- wdfkit.preprocessing.cosmic_ray_1d.linear_interpolate_masked_channels_1d(y: ndarray, bad_channel_mask: ndarray) ndarray[source]
Fill masked channels by linear interpolation from good ones.
- wdfkit.preprocessing.cosmic_ray_1d.positive_spike_mask_from_derivative_peaks(y: ndarray, threshold_multiplier: float) ndarray[source]
Interior
iwherey[i]is above both neighbors bythreshold_multiplier * noise.noiseis scaled MAD ofdiff(y).
- wdfkit.preprocessing.cosmic_ray_1d.positive_spike_mask_vs_median_smooth(y: ndarray, median_smoothed_y: ndarray, threshold_multiplier: float) tuple[ndarray, float][source]
Mask where positive residual exceeds
threshold_multiplier * noise.Residual is
y - median_smoothed_y;noiseis scaled MAD of residual.
- wdfkit.preprocessing.cosmic_ray_1d.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) tuple[ndarray, ndarray][source]
Remove sharp positive spikes from one 1D spectrum (PL-style).
Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.
The algorithm runs up to
max_passesiterations. Each pass:Detects new spikes on the current (already-repaired) signal.
Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.
Accumulates into a single cumulative mask across all passes.
Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.
Early termination when a pass finds no new spikes.
- Parameters:
y – One spectral trace (any numeric dtype; cast to float).
method –
"median"or"interpolate"—scipy.signal.medfiltreference signal, MAD on residual for detection (both methods now repair identically via linear interpolation)."derivative"— neighbour-difference test ondiff(y)MAD; interior points only.kernel_size – Odd length
>= 3formedfilt(median / interpolate methods).threshold – Multiplier on MAD-derived noise (larger → fewer detections).
max_passes – Maximum number of detection–repair iterations (default 3). Use
1for the old single-pass behaviour.
- Returns:
corrected_y – Same shape as
y; unchanged if no spikes are found or if noise is degenerate.cosmic_mask – Boolean mask, same shape as
y;Trueat all channels that were corrected (including dilation neighbours). AllFalsewhen nothing was found or when the mask would cover the entire spectrum.
wdfkit.preprocessing.cosmic_ray_map
Spatial (3D map) cosmic-ray detection and replacement.
- wdfkit.preprocessing.cosmic_ray_map.correct_cosmic_rays_on_map_cube(values: ndarray, *, sensitivity: float, spectral_width_fraction: float, disk_radius: int, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True, return_diagnostic_masks: bool = False) tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]
Spatial disk median on a per-spectrum normalized cube; robust positive residual test per wavelength.
Per channel λ, the cutoff is
map_mad_multiplier * (0.01/sensitivity) * relax_λ * noise_λ, wherenoise_λis scaled MAD of(preprocessed - spatial_median_reference)in the(y, x)plane, andrelax_λcomes frommap_noisy_channel_relax_min(noisy bands more sensitive).Spectral dilation length is
min(width×N, map_spectral_dilate_cap). After dilation, each contiguousTruesegment along λ at fixed(y, x)is clipped to at mostmap_max_spectral_repair_extentchannels (Nonedisables) so repair stays localized.Detection uses
residual > map_min_residual_over_cutoff * cutoff.If
map_require_spatial_local_max, a voxel must be a strict spatial maximum in its λ slice among 8 neighbours (reduces false positives).Repair: dilate core hits along λ, then for each
(y, x)interpolate masked samples along λ fromspatial_median_reference[y, x, :]; unmasked λ keeppreprocessed.If
return_diagnostic_masksis True, returns a third dict (large numpy arrays — do not put them inDataArray.attrs).
- wdfkit.preprocessing.cosmic_ray_map.interpolate_cosmic_ray_regions_spectrally(preprocessed: ndarray, spatial_median_reference: ndarray, repair_mask: ndarray) ndarray[source]
Inpaint
repair_maskpoints by interp along λ.Reference curve is
spatial_median_reference[y, x, :]; other channels keep originalpreprocessedvalues.