DPO
ai_safety
Overview
Developed byStanford University
Open source✓ Open Source
Use casedirectly optimizing language model policy without a reward model
Also see
Alternative to
Knowledge graph stats
Claims9
Avg confidence94%
Avg freshness99%
Last updatedUpdated yesterday
Trust distribution
100% unverified
Governance
Not assessed
DPO
concept
Direct Preference Optimization, simplified RLHF alternative that directly optimizes policy without reward model
Compare with...used by
| Value | Trust | Confidence | Freshness | Sources |
|---|---|---|---|---|
| Mistral AI | ○Unverified | High | Fresh | 1 |
primary use case
| Value | Trust | Confidence | Freshness | Sources |
|---|---|---|---|---|
| directly optimizing language model policy without a reward model | ○Unverified | High | Fresh | 1 |
first released
| Value | Trust | Confidence | Freshness | Sources |
|---|---|---|---|---|
| 2023 | ○Unverified | High | Fresh | 1 |
developed by
| Value | Trust | Confidence | Freshness | Sources |
|---|---|---|---|---|
| Stanford University | ○Unverified | High | Fresh | 1 |
alternative to
| Value | Trust | Confidence | Freshness | Sources |
|---|---|---|---|---|
| RLHF | ○Unverified | High | Fresh | 1 |
open source
| Value | Trust | Confidence | Freshness | Sources |
|---|---|---|---|---|
| true | ○Unverified | High | Fresh | 1 |
implemented by
| Value | Trust | Confidence | Freshness | Sources |
|---|---|---|---|---|
| TRL library | ○Unverified | High | Fresh | 1 |
| OpenAI | ○Unverified | Moderate | Fresh | 1 |
| Anthropic | ○Unverified | Moderate | Fresh | 1 |