DreamLayer AI: Evaluate & Benchmark Diffusion Models | Reproducible Diffusion Experiments

Leaderboard Overview

200 Prompts Benchmarked in 45 Minutes per Model

See how top diffusion models compare using reproducible metrics like CLIP Score, FID, and Composition Correctness. DreamLayer’s automated pipeline handled generation, scoring, and output aggregation.
‍
To add your model to the leaderboard, contact us.

Methodology:

Microsoft's COCO Dataset

Evaluated on CIFAR training set

Same prompts, seeds & configs

Released on Sept 2025

CLIP Score

Measures how closely a generated image matches its text prompt.

Rank

Company

Model

CLIP Score

Luma Labs

Photon

0.265

Black Forest Labs

Flux Pro

0.263

OpenAI

Dall-E 3

0.259

Google Gemini

Nano Banana

0.258

Runway AI

Runway Gen 4

0.2505

Ideogram

Ideogram V3

0.2501

Stability AI

Stability SD Turbo

0.249

FID Score

Assess how close an AI-generated images are to real images.

Rank

Company

Model

FID Score

Ideogram

Ideogram V3

305.60

OpenAI

Dall-E 3

306.08

Runway AI

Runway Gen 4

317.52

Luma Labs

Photon

318.55

Black Forest Labs

Flux Pro

318.63

Google Gemini

Nano Banana

318.80

Stability AI

Stability SD Turbo

321.75

F1 Score

Combines precision and recall to show overall image accuracy.

Rank

Company

Model

F1 Score

Luma Labs

Photon

0.463

Stability AI

Stability SD Turbo

0.447

Runway AI

Runway Gen 4

0.445

Black Forest Labs

Flux Pro

0.421

Ideogram

Ideogram V3

0.415

OpenAI

Dall-E 3

0.380

Google Gemini

Nano Banana

0.351

Precision

Measures how many AI-images came out correct when they were compared to the total number of images the AI generated.

Rank

Company

Model

Precision Score

Luma Labs

Photon

0.448

Stability AI

Stability SD Turbo

0.432

Runway AI

Runway Gen 4

0.423

Black Forest Labs

Flux Pro

0.406

Ideogram

Ideogram V3

0.397

OpenAI

Dall-E 3

0.358

Google Gemini

Nano Banana

0.339

Recall

Measures how many of the correct images the AI was able to produce out of all the possible correct images it could've generated.

Rank

Company

Model

Recall Score

Stability AI

Stability SD Turbo

0.533

Luma Labs

Photon

0.532

Runway AI

Runway Gen 4

0.522

Ideogram

Ideogram V3

0.497

Black Forest Labs

Flux Pro

0.495

OpenAI

Dall-E 3

0.477

Google Gemini

Nano Banana

0.415

CLIP Score

Measures how closely a generated image matches its text prompt.

Rank

Company

CLIP Score

Luma Photon

0.265

BFL Flux Pro

0.263

OpenAI Dall-E 3

0.259

Google Nano Banana

0.258

Runway Gen 4

0.2505

Ideogram V3

0.2501

Stability SD Turbo

0.249

FID Score

Assess how close an AI-generated images are to real images.

Rank

Company

FID Score

Ideogram V3

305.60

OpenAI Dall-E 3

306.08

Runway Gen 4

317.52

Luma Photon

318.55

BFL Flux Pro

318.63

Google Nano Banana

318.80

Stability SD Turbo

321.75

F1 Score

Combines precision and recall to show overall image accuracy.

Rank

Company

F1 Score

Luma Photon

0.463

Stability SD Turbo

0.447

Runway Gen 4

0.445

BFL Flux Pro

0.421

Ideogram V3

0.415

OpenAI Dall-E 3

0.380

Google Nano Banana

0.351

Precision

Measures how many AI-images came out correct when they were compared to the total number of images the AI generated.

Rank

Company

Precision Score

Luma Photon

0.448

Stability SD Turbo

0.432

Runway Gen 4

0.423

BFL Flux Pro

0.406

Ideogram V3

0.397

OpenAI Dall-E 3

0.358

Google Nano Banana

0.339

Recall

Measures how many of the correct images the AI was able to produce out of all the possible correct images it could've generated.

Rank

Company

Recall Score

Stability SD Turbo

0.533

Luma Photon

0.532

Runway Gen 4

0.522

Ideogram V3

0.497

BFL Flux Pro

0.495

OpenAI Dall-E 3

0.477

Google Nano Banana

0.415