Reproducible Diffusion Model Benchmarks
In Hours, Not Weeks
Automate prompt pack setup, seed control, and metric scoring so you can run experiments faster, publish sooner, and stay ahead ✨
Take the Pain Out of Diffusion Benchmarking
Thank you for the support! You are on the Waitlist
⭐️ Go Star the DreamLayer Github to become a Founding Supporter ⭐️
Oops! Something went wrong while submitting the form.
Compatible with all major image generation APIs and open-source models
Leaderboard Overview
200 Prompts Benchmarked in 45 Minutes per Model
See how top diffusion models compare using reproducible metrics like CLIP Score, FID, and Composition Correctness. DreamLayer’s automated pipeline handled generation, scoring, and output aggregation.

To add your model to the leaderboard, contact us.
Methodology:
Microsoft's COCO Dataset
Evaluated on CIFAR training set
Same prompts, seeds & configs
Released on Sept 2025
CLIP Score
Measures how closely a generated image matches its text prompt.
Rank
Company
Model
CLIP Score
1
Luma Labs
Photon
0.265
2
Black Forest Labs
Flux Pro
0.263
3
OpenAI
Dall-E 3
0.259
4
Google Gemini
Nano Banana
0.258
5
Runway AI
Runway Gen 4
0.2505
6
Ideogram
Ideogram V3
0.2501
7
Stability AI
Stability SD Turbo
0.249
FID Score
Assess how close an AI-generated images are to real images.
Rank
Company
Model
FID Score
1
Ideogram
Ideogram V3
305.60
2
OpenAI
Dall-E 3
306.08
3
Runway AI
Runway Gen 4
317.52
4
Luma Labs
Photon
318.55
5
Black Forest Labs
Flux Pro
318.63
6
Google Gemini
Nano Banana
318.80
7
Stability AI
Stability SD Turbo
321.75
F1 Score
Combines precision and recall to show overall image accuracy.
Rank
Company
Model
F1 Score
1
Luma Labs
Photon
0.463
2
Stability AI
Stability SD Turbo
0.447
3
Runway AI
Runway Gen 4
0.445
4
Black Forest Labs
Flux Pro
0.421
5
Ideogram
Ideogram V3
0.415
6
OpenAI
Dall-E 3
0.380
7
Google Gemini
Nano Banana
0.351
Precision
Measures how many AI-images came out correct when they were compared to the total number of images the AI generated.
Rank
Company
Model
Precision Score
1
Luma Labs
Photon
0.448
2
Stability AI
Stability SD Turbo
0.432
3
Runway AI
Runway Gen 4
0.423
4
Black Forest Labs
Flux Pro
0.406
5
Ideogram
Ideogram V3
0.397
6
OpenAI
Dall-E 3
0.358
7
Google Gemini
Nano Banana
0.339
Recall
Measures how many of the correct images the AI was able to produce out of all the possible correct images it could've generated.
Rank
Company
Model
Recall Score
1
Stability AI
Stability SD Turbo
0.533
2
Luma Labs
Photon
0.532
3
Runway AI
Runway Gen 4
0.522
4
Ideogram
Ideogram V3
0.497
5
Black Forest Labs
Flux Pro
0.495
6
OpenAI
Dall-E 3
0.477
7
Google Gemini
Nano Banana
0.415
CLIP Score
Measures how closely a generated image matches its text prompt.
Rank
Company
CLIP Score
1
Luma Photon
0.265
2
BFL Flux Pro
0.263
3
OpenAI Dall-E 3
0.259
4
Google Nano Banana
0.258
5
Runway Gen 4
0.2505
6
Ideogram V3
0.2501
7
Stability SD Turbo
0.249
FID Score
Assess how close an AI-generated images are to real images.
Rank
Company
FID Score
1
Ideogram V3
305.60
2
OpenAI Dall-E 3
306.08
3
Runway Gen 4
317.52
4
Luma Photon
318.55
5
BFL Flux Pro
318.63
6
Google Nano Banana
318.80
7
Stability SD Turbo
321.75
F1 Score
Combines precision and recall to show overall image accuracy.
Rank
Company
F1 Score
1
Luma Photon
0.463
2
Stability SD Turbo
0.447
3
Runway Gen 4
0.445
4
BFL Flux Pro
0.421
5
Ideogram V3
0.415
6
OpenAI Dall-E 3
0.380
7
Google Nano Banana
0.351
Precision
Measures how many AI-images came out correct when they were compared to the total number of images the AI generated.
Rank
Company
Precision Score
1
Luma Photon
0.448
2
Stability SD Turbo
0.432
3
Runway Gen 4
0.423
4
BFL Flux Pro
0.406
5
Ideogram V3
0.397
6
OpenAI Dall-E 3
0.358
7
Google Nano Banana
0.339
Recall
Measures how many of the correct images the AI was able to produce out of all the possible correct images it could've generated.
Rank
Company
Recall Score
1
Stability SD Turbo
0.533
2
Luma Photon
0.532
3
Runway Gen 4
0.522
4
Ideogram V3
0.497
5
BFL Flux Pro
0.495
6
OpenAI Dall-E 3
0.477
7
Google Nano Banana
0.415