Predictive Soil Model overview

Collecting soil samples from every field in a large-scale carbon project is rarely practical. Regrow's Predictive Soil Model bridges this gap by taking measurements from a strategically selected subset of fields and extending those values across the entire project area. The result is a field-level digital soil map providing estimates of soil organic carbon (SOC) and bulk density — the key inputs needed to run DNDC simulations and quantify carbon outcomes.

This article explains how the model works, how its outputs are used, and how uncertainty is handled. It covers the third phase in Regrow's soil workflow, following the Soil Sample Design and soil measurements.

What you'll learn:

Model inputs
Predictive modeling methodology
How predicted values initialize DNDC
Uncertainty handling
Edge cases for non-sampled or out-of-domain fields.

Why a predictive model?

Protocols such as Verra VM0042 or CAR SEP require DNDC to be initialized with field-level measurements of SOC and bulk density. Directly sampling every field is often cost-prohibitive at scale. Regrow's approach uses a stratified sampling design to collect measurements from a representative subset of fields, then fits a statistical model to extrapolate those measurements across all fields in the project.

This approach aligns with the alternative method described on page 55 of VM0042, which permits modeling at the level of a homogeneous management unit — in this case, the individual field.

Model inputs

The Predictive Soil Model relies on two categories of inputs:

Measured soil samples collected at stratified sample locations as part of the Soil Sample Design Plan. Each sample provides values for:
- SOC percentage (SOC%)
- Fine Soil Bulk Density (g/cm³)
Auxiliary geospatial data recorded at every point on the 30-meter discretized field grid, including:

- Soil texture and properties from SoilGrids (or SSURGO for CONUS fields)
- IPCC climate zone classification
- Elevation and topographic indices
- Remotely sensed vegetation indices
- Country or region (as a proxy for land management practices)

These auxiliary variables are the same data layers used to define strata in the Soil Sample Design Model. They inform the extrapolation indirectly.

Model methodology

Regrow's Predictive Soil Model uses a stratified additive approach to estimate soil properties. Rather than assuming a single average for the entire project, the model identifies the specific statistical "offset" associated with each stratum.

Tip: Think of it like estimating the price of a house based on its neighborhood. If you know the average price in "Neighborhood A," you can provide a grounded estimate for any house in that area, even if you haven't stepped inside. The model learns from the fields you did sample to make informed, group-based predictions about the fields you didn't.

Using an ANOVA approach within a GAM framework

ANOVA (Analysis of Variance) is a statistical method used to determine if there are significant differences between the means of three or more independent groups.

While a standard ANOVA tells us if a difference exists between stratum, using ANOVA within a GAM framework allows us to quantify those differences as specific "offsets", including:

The "Treatment" (Factor): In your case, this is the strata_id.
The "Response": This is the SOC% or bulk density measurement.
The "Fixed Effect": The model calculates a unique mean for every single strata_id.
The "Pooled Variance":The model assumes the "spread" (variance) of the data is the same across all strata.

By using an Additive Factor Model (which is functionally an ANOVA), we aren't trying to force the soil data into a trend line that might not exist (as is the case with other modeling approaches). Instead, this method groups fields by their characteristics and similarities. The model then calculates the most likely mean for each group, ensuring that the differences between them are real and not just a fluke.

Fitting the model

Regrow fits two independent statistical models, one for SOC% and one for bulk density. While these are implemented using the Generalized Additive Model (GAM) framework, the model is structured to treat the strata_id (derived from country, soil texture class, and climate zone) as a discrete factor effect, similar to an ANOVA.

For each response variable, the model calculates a unique mean for every individual stratum identified during the sample design phase. Crucially, while these means are unique to each stratum, a common variance is estimated for the entire project. This creates a "pooling" effect: by assuming a shared error structure across all groups, the model borrows statistical strength from the full dataset. This allows for more stable and robust estimates—and more reliable 90% prediction intervals—than could be achieved by looking at any single stratum in isolation, especially those with limited sample counts.

Why this approach is well-suited for this use case:

- Isolates landscape-scale variance: By treating the strata_id as a discrete factor, the model can quantify exactly how much soil properties differ between various climate zones and texture classes.
- Data efficiency: Factor-based models can extract a reliable signal from a modest number of well-placed samples, which is ideal for stratified soil sampling programs.
- Honest uncertainty estimates: The model generates interpretable prediction intervals based on the global variance of the dataset, providing the transparency required by carbon program auditors.

Limitations: The model may not work as well on smaller programs & field areas.

Standard fit metrics like R² can be unreliable indicators of model quality, particularly for small programs or sparse datasets. When sample sizes are fewer than roughly 15–20 observations across the project, treat fit metrics with caution and evaluate model outputs in the context of known soil variability in the project area.

Prediction domain

The model is applied to every discretized 30-meter grid point across the entire project area. Its valid domain is limited to points belonging to strata where at least one lab measurement was returned. Points in strata with no usable samples are excluded from model predictions — see the Edge Cases section below for how these fields are handled.

Point-level predictions are averaged by field to produce a single mean SOC% and mean bulk density for each field. These field-level values are what get passed to DNDC.

Handling uncertainty

The Predictive Soil Model doesn't just provide a single number; it quantifies the confidence of its estimates. Because the model uses a pooled variance approach, the uncertainty is calculated based on the global behavior of the dataset rather than just the samples within a single stratum.

Global Variance: The model estimates a single, common variance for all strata. This assumes that while the average SOC% or bulk density changes between environments, the "noise" or spread of the data remains relatively consistent across the project.
Borrowing Strength: By using this shared error term, the model generates a reliable 90% prediction interval for every field. This is particularly valuable for fields in strata with few samples; instead of having an uncertainty range based on limited data, they benefit from the statistical power of the entire project dataset.
Propagating Uncertainty: For protocols like Verra VM0042, this uncertainty is propagated through DNDC. We run simulations initialized at both the lower and upper bounds of the 90% prediction interval to ensure the final carbon outcomes account for the inherent variability in the initial soil state.

Model domain and edge cases

The table below summarizes how common scenarios are handled when fields fall outside or at the boundary of the model’s valid domain.

Scenario	How It's Handled
Field is in a stratum with ≥1 usable sample	Field receives predicted soil properties from the soil model.
Field is in a stratum with 0 usable samples	Field is excluded from model domain; SoilsGrid defaults may be used as fallback, or the field may be excluded from quantification.
Field is dropped from the program after sample design	Samples from the dropped field are excluded; remaining fields in the same stratum are unaffected.
New fields added after sample design	If new fields fall within an existing stratum, Soil Model predictions apply. If in a new or unseen stratum, SoilsGrid may be used as fallback.
Fields spanning multiple strata	A field can belong to multiple strata; it is not at risk as long as at least one of its strata has a usable sample.

Model outputs and interpretation

The main outputs of the Predictive Soil Model are:

Field-level estimates of SOC% and bulk density for every field in the project
A project-specific digital soil map providing spatially continuous predictions across the project area
90% prediction intervals quantifying predictive uncertainty at the field level

These outputs serve two purposes. First, they initialize DNDC, which produces estimates of SOC change (dSOC), nitrous oxide (N₂O) emissions, and in some cases methane (CH₄). Second, they communicate model assumptions and uncertainty to auditors and stakeholders, supporting the credibility of carbon offset claims.

DNDC output uncertainty is combined with the model’s structural uncertainty, which is based on literature-derived estimates, to produce a final uncertainty range for project outcomes.

Using predicted soil properties to initialize DNDC

The predicted SOC% and bulk density values for each field are used as initialization inputs for the DNDC biogeochemical model, which then simulates changes in soil carbon and greenhouse gas emissions over time.

How these inputs are used depends on whether input uncertainty is considered negligible (de minimis):

When input uncertainty is de minimis — In most protocols, the same SOC and bulk density values are used to initialize DNDC for both the baseline and practice-change scenarios. Because the uncertainty in these inputs cancels out when calculating the difference, it is treated as negligible. In this case, each field runs a single deterministic DNDC simulation using the field mean SOC and bulk density from the digital soil map.

When input uncertainty must be propagated — For some protocols, such as Verra VM0042, the de minimis assumption may not be justified. In these cases, input uncertainty is propagated through DNDC using the 90% prediction interval from the model. Two DNDC simulations are run per field: one initialized at the lower bound and one at the upper bound of the prediction interval. The resulting range brackets the potential effect of input uncertainty on key outputs.