gpt-oss-safeguard Safety Report: Architecture, Training, and Baseline Evaluation

18 February 2026 by

Suraj Barman

gpt-oss-safeguard Safety Report Overview

The gpt-oss-safeguard‑120b and ‑20b models are open‑weight language systems refined to follow explicit safety policies while generating chain‑of‑thought explanations. Released under Apache 2.0, they support configurable reasoning intensity, structured output formats, and policy‑driven content labeling. This report details their architecture, training refinements, and baseline safety metrics compared with the original gpt-oss series.

Model Architecture and Training Process

Both models inherit the transformer backbone of the parent large language model and add a policy‑conditioned reasoning head. Fine‑tuning employed supervised data where each prompt was paired with a policy label, enabling the system to generate chain‑of‑thought sequences that culminate in a safety decision. No additional cybersecurity or biological corpora were introduced, preserving the original knowledge scope while enhancing policy adherence.

Policy‑Conditioned Reasoning Layer

This layer receives the policy vector and merges it with token embeddings, guiding the attention mechanism toward safety‑relevant patterns. The design allows users to swap policy modules without retraining the entire model.

Structured Output Capability

Outputs can be formatted as JSON objects containing fields such as decision, confidence, and optional explanatory steps, facilitating downstream automation.

Baseline Safety Evaluation Methodology

Safety testing followed the OpenAI system‑card framework, measuring false‑positive and false‑negative rates across ten benchmark prompts. Evaluations were run in chat‑style interactions despite the models being intended for classification tasks, providing a conservative estimate of risk when misapplied.

Evaluation Metrics

Key metrics included policy violation detection rate, unintended content generation, and consistency of structured outputs. Results were compared against the baseline gpt-oss models to quantify safety gains.

Multilingual Performance Snapshot

Preliminary tests in five languages showed comparable decision accuracy to English, though confidence scores varied modestly. No language‑specific safety degradations were observed.

Recommended Deployment Practices

For optimal safety, employ the models strictly for content classification under a defined policy, and route user‑facing conversational flows to the original gpt-oss models. Adjust reasoning intensity based on workload: low intensity for bulk labeling, high intensity for nuanced policy disputes.