gpt-oss-safeguard provides open‑weight reasoning models that classify content based on developer‑supplied safety policies at inference time, enabling explainable and adaptable safety pipelines.
Core Capabilities
The models combine chain‑of‑thought reasoning with policy input to deliver transparent decisions.
- Accepts a policy and target content simultaneously.
- Outputs a classification label plus step‑by‑step reasoning.
- Supports two sizes: 120 billion and 20 billion parameters.
- Licensed under Apache 2.0 for unrestricted modification and redistribution.
- Built on large language model technology.
Deployment Options
Models are hosted on Hugging Face and can be run locally or in cloud environments.
- Download via Hugging Face repositories.
- Run on GPU‑enabled servers for low‑latency inference.
- Integrate with existing moderation pipelines using standard REST calls.
- Use cloud computing architecture for auto‑scaling.
- Reference Choosing the Right AI Model for hardware sizing.
Policy Reasoning Workflow
Developers write policies in natural language; the model interprets them at runtime.
- Write a policy document (e.g., "no cheating discussion").
- Pass the policy string as an input field during inference.
- Model generates a rationale explaining its decision.
- Review the rationale to audit policy interpretation.
- Iterate quickly by editing the policy without retraining.
Performance and Evaluation
Benchmarks show competitive accuracy on multi‑policy tasks despite smaller size.
- Outperforms baseline open models on internal multi‑policy tests.
- Matches or exceeds public moderation datasets in safety recall.
- Latency higher than static classifiers; suitable where explainability matters.
- Evaluation details are in the released technical report.
- See Securing Development Environments for safe deployment practices.
Known Limitations and Mitigations
Understanding current constraints helps plan realistic usage.
- Reasoning adds compute cost; pair with fast pre‑filters to limit load.
- For high‑volume streams, use a lightweight classifier to triage content.
- Policy quality directly impacts accuracy; test policies on representative samples.
- Model may lag behind specialized classifiers on niche risks.
- Community feedback is encouraged via the ROOST Model Community repository.