Perceptual Graphics Quality Evaluation Metrics
Published:
(work in progress…)
Sililar to the conventional graphics (image) quality metics, MSE, PSNR, SSIM, the perceptual graphics quality metrics are also full-reference. However, the perceptual quality metrics are mostly deep neural netork-dependent. Consequently, sometimes calles as the Black Box Approaches as well.
1. NVidia’s FLIP
FLIP is an excellent tool for visualizing and communicating errors in the rendered images, both for low dynamic range (LDR) and high dynamic range (HDR). Big thanks to the NVlabs for making the tool publicly available with the source code. The FLIP is a command line interface (CLI) tool. As in Windows platform follow the steps:
- Clone:
git clone --recursive https://github.com/NVlabs/flip.git, thencd flip - BUILD:
mkdir build cd build cmake .. cmake --build .After this, the
flip.exeshould be under build/Debug. Similar way this can be extended toDebugandReleaseversion. TheReleasemode performs faster as always.
// for Release
mkdir Release
cd Release
cmake -DCMAKE_BUILD_TYPE=Release ..
// for Debug
mkdir Debug
cd Debug
cmake -DCMAKE_BUILD_TYPE=Debug ..
- RUNNING
// go to Release/Debug folder and run .\flip.exe -r .\reference.png -t .\test.png // similarly .\flip.exe -r .\reference.exr -t .\test.exr
To save the metrics (recommended) save as .csv file
.\flip.exe -r .\reference.png -t .\test.png -c nameFile.csv
Discussion
The FLIP metrics maps perceptual error into the [0.0, 1.0] range, and the official tool reports pooled values such as mean, weighted median, quartiles, min, and max from that FLIP error map. 0.0 means identical (no visible difference), and larger values (closer to 1.0) indicate stronger perceived difference. The default output looks like this:
| --------------------------- | ------------------------------------------------------------------------------- |
| Column | Use |
| --------------------------- | ------------------------------------------------------------------------------- |
|Reference | |
|Test | |
|Mean,e.g.,0.048645 | Main scalar score; use this for comparison across rendering settings. |
|Weighted median,e.g.,0.117111| Secondary/legacy statistic; do not use as the main score. |
|1st / 3rd weighted quartile | Spread of perceptual error distribution. |
|Max, e.g.,0.908822 | Worst localized artifact; useful if VRS/SDF creates small but strong artifacts. |
|Min | Usually not very informative. |
|Evaluation time | In second |
| --------------------------- | ------------------------------------------------------------------------------- |
The mean is definitely the score we are looking for. It tells the overall perceptual difference. Other than mean, the max warns whether there are localized visible artifacts. In some context, it could be significant. The _max__ represents the largest local perceptual error anywhere in the image. It is useful for detecting strong localized artifacts, such as edge discontinuities, shading-rate block boundaries, missing fragments, or peripheral degradation artifacts.
Visually, color represents the error map. Black means no perceived error, and Yellow represents high error. For more details, see the Technical Blog and the Flip Publication.
There is an UI version of FLIP implementation, named as FLOP. Find in git with the blog explaining every step clearly. However, I guess the repository is not well-updated (2025) and there are several bugs that need to be resolved.
2. LPIPS
LPIPS is also a full-reference perceptual distance metric. The interpretation should be: Lower LPIPS score, the target is better perceptually similar to the reference and vice versa. Unlike the FLIP, LPIPS is not hardly range bounded. However, the range is in between [-1,1].
Install
pip install lpips torch torchvision pillow
Use
then you can use a python script as:
import torch
import lpips
from PIL import Image
import torchvision.transforms as T
def load_image_as_tensor(path):
img = Image.open(path).convert("RGB")
transform = T.Compose([
T.ToTensor(), # [0, 1], shape: C x H x W
T.Lambda(lambda x: x * 2 - 1) # convert to [-1, 1]
])
return transform(img).unsqueeze(0) # shape: 1 x 3 x H x W
# LPIPS model
loss_fn = lpips.LPIPS(net='alex') # good default for perceptual comparison
# Load images
ref = load_image_as_tensor("reference.png")
test = load_image_as_tensor("target.png")
# Compute LPIPS
with torch.no_grad():
score = loss_fn(ref, test)
print("LPIPS:", score.item())
Deep Image Structure and Texture Similarity Metric (DISTS)
The full paper can be found here. DISTS is in the range [0,1], where closer to 0 represents better evaluation score.
ShiftTolerant-LPIPS (ST-PIPS)
4. TOPIQ
TOPIQ: A Top-Down Approach From Semantics to Distortions for Image Quality Assessment. It is using range [0.0,1.0]. Unlike previous metrics (FLIP, LPIPS, DISTS), in TOPIQ higher score represents better perceptual quality.
Install
pip install pyiqa
// or upgrade
pip install --upgrade pyiqa
Then run the command python -c "import pyiqa; print('topiq_fr' in pyiqa.list_models())". If output is True, installation is success, else there is a problem.
Run
python -u .\topiq_sample_code.py
ST-LPIPS
- If
pip install pyiqa, then ST-LPIPS is already installed. Or, can upgradepython -m pip install --upgrade pyiqa
1.1. python -m pip install stlpips_pytorch torch torchvision pillow
- check if it is correctly installed through
pyiqapython -c "import pyiqa; models=pyiqa.list_models(); print('ST-LPIPS:', 'stlpips' in models)If output is True, everything is fine.
- ``` python -m pip install –upgrade pip python -m pip install IPython python -m pip install torchsummary python -c “import IPython; import stlpips_pytorch; print(‘Imports successful’)”
4. run `python -u .\stlpips_batch.py`
## PieAPP
1. If `pip install pyiqa`, then __ST-LPIPS__ is already installed. Or, can upgrade `python -m pip install --upgrade pyiqa`
1.1. `python -m pip install --upgrade pyiqa torch torchvision pillow`
2. check if it is correctly installed through `pyiqa`
python -c “import pyiqa; models=pyiqa.list_models(); print(‘PieAPP:’, ‘pieapp’ in models)”
If output is _True_, everything is fine.
Lower is better. Near 0 means low perceptual error. It does not have a strict universal upper bound
***
## 5. MILO
__MILO: A Lightweight Perceptual Quality Metric for Image and Latent-Space Optimization__ is lightweight multiscale perceptual metric that outputs both a global score and spatial distortion map. The public implementation uses an error scale where _0.0_ denotes perceptually identical images and values toward _1.0_ indicate stronger disruption. This would be my first recommendation for your static molecular-rendering images.
1.
git clone https://github.com/ugurcogalan06/MILO.git cd MILO
2. while CUDA-enabled PyTorch is already installed and working, install the remaining dependencies
python -m pip install torchvision pillow numpy
3. repository currently specifies CUDA-enabled PyTorch and provides this installation example
python -m pip install torch torchvision –index-url https://download.pytorch.org/whl/cu118 python -m pip install pillow numpy
4. create a "batch script" if required
5. run
python -u .\sample_milo_batch_code.py ```
For more details, see the project page, and git repo
