ProxySite

false-refusal-rate

Here is 1 public repository matching this topic...

MidnightDarling / when-better-means-less

When Better Means Less: Quantifying What Benchmarks Miss Between Model Generations. 2,310 controlled comparisons show GPT-5 series lost 6.7x creativity and gained 4.4x false refusals vs chatgpt-4o-latest — invisible to standard benchmarks.

ai-safety model-comparison gpt-5 llm-evaluation benchmark-evaluation chatgpt-4o-latest keep4o alignment-tax false-refusal-rate creativity-measurement

Updated Feb 23, 2026
Python

Improve this page

Add a description, image, and links to the false-refusal-rate topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the false-refusal-rate topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

false-refusal-rate

Here is 1 public repository matching this topic...

MidnightDarling / when-better-means-less

Improve this page

Add this topic to your repo