Perceptual Graphics Quality Evaluation Metrics 01

8 minute read

Published: January 13, 2026

(work in progress…)

Sililar to the conventional graphics (image) quality metics, MSE, PSNR, SSIM, the perceptual graphics quality metrics are also full-reference. However, the perceptual quality metrics are mostly deep neural netork-dependent. Consequently, sometimes calles as the Black Box Approaches as well.

NVIDIA’s FLIP

FLIP is an excellent tool for visualizing and communicating errors in the rendered images, both for low dynamic range (LDR) and high dynamic range (HDR). Big thanks to the NVlabs for making the tool publicly available with the source code. The FLIP is a command line interface (CLI) tool. As in Windows platform follow the steps:

Clone: git clone --recursive https://github.com/NVlabs/flip.git, then cd flip
BUILD:
```
mkdir build
cd build
cmake ..
cmake --build .
```
After this, the flip.exe should be under build/Debug. Similar way this can be extended to Debug and Release version. The Release mode performs faster as always.

// for Release
mkdir Release
cd Release
cmake -DCMAKE_BUILD_TYPE=Release ..

// for Debug
mkdir Debug
cd Debug
cmake -DCMAKE_BUILD_TYPE=Debug ..

RUNNING

// go to Release/Debug folder and run
.\flip.exe -r .\reference.png -t .\test.png
// similarly
.\flip.exe -r .\reference.exr -t .\test.exr

To save the metrics (recommended) save as .csv file

.\flip.exe -r .\reference.png -t .\test.png -c nameFile.csv

Discussion

The FLIP metrics maps perceptual error into the [0.0, 1.0] range, and the official tool reports pooled values such as mean, weighted median, quartiles, min, and max from that FLIP error map. 0.0 means identical (no visible difference), and larger values (closer to 1.0) indicate stronger perceived difference. The default output looks like this:

| --------------------------- | ------------------------------------------------------------------------------- |
| Column                      | Use                                                                             |
| --------------------------- | ------------------------------------------------------------------------------- |
|Reference                    |                                                                                 |
|Test                         |                                                                                 |
|Mean,e.g.,0.048645           | Main scalar score; use this for comparison across rendering settings.           |
|Weighted median,e.g.,0.117111| Secondary/legacy statistic; do not use as the main score.                       |
|1st / 3rd weighted quartile  | Spread of perceptual error distribution.                                        |
|Max, e.g.,0.908822           | Worst localized artifact; useful if VRS/SDF creates small but strong artifacts. |
|Min                          | Usually not very informative.                                                   |
|Evaluation time              | In second                                                                       |
| --------------------------- | ------------------------------------------------------------------------------- |

The mean is definitely the score we are looking for. It tells the overall perceptual difference. Other than mean, the max warns whether there are localized visible artifacts. In some context, it could be significant. The _max__ represents the largest local perceptual error anywhere in the image. It is useful for detecting strong localized artifacts, such as edge discontinuities, shading-rate block boundaries, missing fragments, or peripheral degradation artifacts.

Visually, color represents the error map. Black means no perceived error, and Yellow represents high error. For more details, see the Technical Blog and the Flip Publication.

There is an UI version of FLIP implementation, named as FLOP. Find in git with the blog explaining every step clearly. However, I guess the repository is not well-updated (2025) and there are several bugs that need to be resolved.

LPIPS

Learned Perceptual Image Patch Similarity (LPIPS) is a full-reference perceptual distance metric. The interpretation should be: Lower LPIPS score, the target is better perceptually similar to the reference and vice versa. Unlike the FLIP, LPIPS is not hard range bounded ([-1,1]) and the interpretation is different.

Install

pip install lpips torch torchvision pillow

Use

then you can use a python script as:

import torch
import lpips
from PIL import Image
import torchvision.transforms as T

def load_image_as_tensor(path):
    img = Image.open(path).convert("RGB")

    transform = T.Compose([
        T.ToTensor(),                 # [0, 1], shape: C x H x W
        T.Lambda(lambda x: x * 2 - 1) # convert to [-1, 1]
    ])

    return transform(img).unsqueeze(0) # shape: 1 x 3 x H x W

# LPIPS model
loss_fn = lpips.LPIPS(net='alex')  # good default for perceptual comparison

# Load images
ref = load_image_as_tensor("reference.png")
test = load_image_as_tensor("target.png")

# Compute LPIPS
with torch.no_grad():
    score = loss_fn(ref, test)

print("LPIPS:", score.item())

ShiftTolerant-LPIPS (ST-LPIPS)

If pip install pyiqa, then ST-LPIPS is already installed. Or, can upgrade python -m pip install --upgrade pyiqa

1.1. python -m pip install stlpips_pytorch torch torchvision pillow

check if it is correctly installed through pyiqa

python -c "import pyiqa; models=pyiqa.list_models(); print('ST-LPIPS:', 'stlpips' in models)

If output is True, everything is fine.

``` python -m pip install –upgrade pip python -m pip install IPython python -m pip install torchsummary python -c “import IPython; import stlpips_pytorch; print(‘Imports successful’)”

4. run `python -u .\stlpips_batch.py`

***
## R-LPIPS
Robust learned Perceptual Image Patch Similarity (R-LPIPS) used the adversarilly trained deep features (See the full [paper](https://arxiv.org/abs/2307.15157) and [git repo](https://github.com/SaraGhazanfari/R-LPIPS)). 

Similar to the LPIPS, the R-LPIPS is not formally restricted to the range [0.0,1.0] where lower is better. 0 represents identical inputs. Larger values represents greater predicted perceptual difference. Values above 1.0 are possible. Also, a very small negative value is theoretically possible because the learned linear weights are not explicitly constrained to be positive, although normal evaluations with the released checkpoint will generally produce nonnegative values. For example if a R-LPIPS score is __0.02429617__, that indicates a relatively small feature-space perceptual difference between the target and reference, but it does not mean __2.43% error__ or __97.57%__ quality. There is, probably, no universal interpretation table. 

Unlike LPIPS, R-LPIPS does not provide a `pip install` option. Therefore, a possible use could be in this way:

#### Install
- `git clone --recursive https://github.com/SaraGhazanfari/R-LPIPS.git`
- install the required packages `python -m pip install torch torchvision numpy scipy pillow tqdm`
- `python -c "import lpips; print(lpips.__file__)"`
- check whether GPU version torchvision available `python -c "import torch, torchvision; print('Torch:', torch.__version__); print('TorchVision:', torchvision.__version__); print('Compiled CUDA:', torch.version.cuda); print('CUDA available:', torch.cuda.is_available()); print('Device count:', torch.cuda.device_count())"`
- a custom inference script can help to evaluate one target/reference pair (TODO: rlpips_test.py)
- run `python -u .\rlpips_test.py --device cuda:0` while the reference and target included in the python script. Or, 

python -u .\rlpips_test.py ` –target .\test_img\breakfast_fov_1.png ` –reference .\test_img\breakfast_uni_1.png ` –checkpoint .\checkpoints\latest_net_linf_x0.pth ` –device cuda:0

### Graphics-LPIPS
- `git clone --recursive https://github.com/MEPP-team/Graphics-LPIPS.git`
- `cd Graphics-LPIPS` -> `code .` -> open terminal and create virtual environment `python -m venv .venv` -> `.\.venv\Scripts\Activate.ps1`
- Now, packages installation:
  - `python -m pip install --upgrade pip setuptools wheel`
  - `python -m pip install statsmodels`
  - `python -m pip install torch==2.13.0 torchvision==0.28.0 --index-url https://download.pytorch.org/whl/cu126`
  - `python -m pip install numpy scipy scikit-image opencv-python matplotlib tqdm jupyter pillow`
- check GPU (as I have) installation is successful by `python -c "import torch, torchvision, PIL; print('PyTorch:', torch.__version__); print('CUDA available:', torch.cuda.is_available()); print('GPU:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU')"`. My output as:

PyTorch: 2.13.0+cu126 CUDA available: True GPU: NVIDIA GeForce RTX 3090

- now, run `python -u graphics_lpips_batch.py`
  

***
### DISTS
- Deep Image Structure and Texture Similarity Metric (
- The full paper can be found [here](https://ieeexplore.ieee.org/document/9298952) and publicly git repository: [DISTS](https://github.com/dingkeyan93/DISTS). 
- DISTS is in the range __[0,1]__, where closer to 0 represents better evaluation score. 


***
### [TOPIQ](https://github.com/chaofengc/iqa-pytorch)
TOPIQ: A Top-Down Approach From Semantics to Distortions for Image Quality Assessment. It is using range __[0.0,1.0]__. Unlike previous metrics (FLIP, LPIPS, DISTS), in TOPIQ higher score represents better perceptual quality. 

### Install

pip install pyiqa

// or upgrade pip install –upgrade pyiqa

Then run the command `python -c "import pyiqa; print('topiq_fr' in pyiqa.list_models())"`. If output is __True__, installation is success, else there is a problem.

### Run
- `python -u .\topiq_sample_code.py`

## PieAPP

1. If `pip install pyiqa`, then __ST-LPIPS__ is already installed. Or, can upgrade `python -m pip install --upgrade pyiqa`

1.1. `python -m pip install --upgrade pyiqa torch torchvision pillow`

2. check if it is correctly installed through `pyiqa`

python -c “import pyiqa; models=pyiqa.list_models(); print(‘PieAPP:’, ‘pieapp’ in models)”

If output is _True_, everything is fine.


Lower is better. Near 0 means low perceptual error. It does not have a strict universal upper bound

***
## MILO
__MILO: A Lightweight Perceptual Quality Metric for Image and Latent-Space Optimization__ is lightweight multiscale perceptual metric that outputs both a global score and spatial distortion map. The public implementation uses an error scale where _0.0_ denotes perceptually identical images and values toward _1.0_ indicate stronger disruption. This would be my first recommendation for your static molecular-rendering images.

1. 

git clone https://github.com/ugurcogalan06/MILO.git cd MILO

2. while CUDA-enabled PyTorch is already installed and working, install the remaining dependencies

python -m pip install torchvision pillow numpy

3. repository currently specifies CUDA-enabled PyTorch and provides this installation example

python -m pip install torch torchvision –index-url https://download.pytorch.org/whl/cu118 python -m pip install pillow numpy

create a "batch script" if required 

run

python -u .\sample_milo_batch_code.py ```

For more details, see the project page, and git repo

Share on

Twitter Facebook LinkedIn

Bipul Mohanto

Perceptual Graphics Quality Evaluation Metrics 01

NVIDIA’s FLIP

Discussion

LPIPS

Install

Use

ShiftTolerant-LPIPS (ST-LPIPS)

Share on

You May Also Enjoy

Commonly Used Libraries in 3D Graphics Rendering

Unreal Engine: Multi-Display Rendeirng with nDisplay Plugin

tauray

Tamashii Beginner Tutorial