CORE: Context-Robust Remasking for Diffusion Language Models

Kevin Zhai, Sabbir Mollah, Zhenyi Wang, Mubarak Shah

University of Central Florida

CORE is a training-free inference-time revision method for masked diffusion language models that identifies context-brittle tokens by stress-testing them under jointly-masked context perturbations.

Abstract

Standard decoding in Masked Diffusion Models (MDMs) is hindered by context rigidity: tokens are retained based on transient high confidence, often ignoring that early predictions lack full context. This creates cascade effects where initial inconsistencies misguide the remaining generation. Existing revision strategies attempt to mitigate this by relying on static confidence scores, but these signals are inherently myopic; inconsistent tokens frequently appear confident to the model itself. To address this, we propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision. We introduce a new selection paradigm: rather than trusting static token probabilities, we identify context-brittle tokens by probing their sensitivity to targeted perturbations. We formalize revision as a robust optimization problem targeting worst-case context shifts. CORE efficiently approximates this objective to expose unstable tokens, prioritizing them for revision. On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, boosting performance on code generation (MBPP) by up to +9.2%.

TL;DR

Problem

In masked diffusion decoding, early token choices made under partial context can later become inconsistent, causing cascading structural errors.

Idea

Identify context-brittle tokens by stress-testing their support under jointly-masked context perturbations, then remask only those tokens for revision.

Method

Restrict attention to a small candidate subset, apply a group-masked perturbation, score instability, and revise the most unstable tokens.

Outcome

Largest gains appear on structure-sensitive tasks (e.g., code, math, and logical reasoning), where brittle early decisions can propagate widely.

Step

Before

—

After

—

Method Overview

CORE prioritizes tokens based on context brittleness: if the current token loses support under a jointly-masked perturbed context, it is a strong candidate for revision.

Select a small candidate subset using a lightweight uncertainty proxy (e.g., small top-2 margin).
Perturb context by masking the candidate subset jointly.
Score instability by evaluating each current token’s likelihood under the perturbed pass.
Revise the most unstable tokens using predictions from the perturbed pass.

Candidate Selection

To control compute, CORE restricts scoring to a small candidate set proposed by an efficient and effective heuristic (e.g., top-2 margin), then applies a stronger robustness test only to those positions.

Context Perturbation

Jointly masking the candidate set creates a controlled context shift that exposes conditional dependencies and hidden structural roles.

Instability Scoring

CORE ranks candidates by how well the current token is supported under the jointly-masked perturbed context (the masked-context support signal).

Selective Revision

Only the most unstable tokens are remasked and regenerated, preserving stable regions while fixing brittle dependencies.

Results

CORE improves over the base sampler and shows its largest gains on structure-sensitive code benchmarks. Notably, the confidence-based baseline (ReMDM-conf) can degrade code performance under some unmasking strategies, while CORE yields consistent improvements on HumanEval and MBPP. On GSM8K, CORE remains competitive and does not trade away reasoning accuracy for code gains.

Table 1: Accuracy (%) (# Few-Shot)

Method	GSM8K (4)	MATH (4)	BBH (3)	HumanEval (0)	MBPP (3)
Low-Confidence Base	51.40	16.72	45.81	12.20	15.60
+ ReMDM-conf^†	52.31	16.56	46.05	10.98	15.20
+ CORE (Ours)	52.69	17.06	47.18^*	17.07^*	24.80^*

Top-k Margin Base	50.27	17.54	48.33	17.07	21.20
+ ReMDM-conf^†	51.78	18.20	46.31	14.02	14.80
+ CORE (Ours)	51.40	18.34	49.01^*	22.56^*	29.60^*

Values are accuracy (%). Parentheses indicate the number of few-shot examples. ^* denotes statistically significant improvements over the best baseline in that block. ^† ReMDM-conf follows the notation in the paper.

Table 2: Accuracy (%) (# Few-Shot)

Method	GSM8K (4)	MATH (4)	BBH (3)	HumanEval (0)	MBPP (3)
Low-Confidence Base	51.40	16.72	45.81	12.20	15.60
+ Random Remask	51.55	16.72	45.77	13.41	16.60
+ Margin Remask	51.33	16.74	46.29	13.41	17.40
+ CORE (Ours)	52.69	17.06	47.18	17.07	24.80

Selection-mechanism controls under Low-Confidence unmasking (compute-matched).

BibTeX

@misc{zhai2026corecontextrobustremaskingdiffusion,
    title={CORE: Context-Robust Remasking for Diffusion Language Models}, 
    author={Kevin Zhai and Sabbir Mollah and Zhenyi Wang and Mubarak Shah},
    year={2026},
    eprint={2602.04096},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2602.04096}, 
}