Pairwise Comparison for Vendor Assessment & Risk Rating

Published on

A photo of two apples, indicating a useful pairwise comparison

Fluvial now includes Pairwise Comparison scoring for vendor assessment and risk rating workflows. This feature combines large language model (LLM) automation with structured human oversight, delivering AI efficiency whilst maintaining the auditability and accuracy that high-stakes assessments demand.

The Problem: AI Efficiency Without AI Opacity

LLMs can process vendor questionnaire responses at scale, extracting insights and identifying patterns across thousands of data points. However, directly using LLM-generated scores for critical assessments presents two problems:

Accuracy: LLMs can misinterpret nuanced responses, particularly those involving domain-specific context, implicit trade-offs, or subtle indicators of risk. A plausible-sounding score may miss critical details that specialist humans would catch.

Auditability: When an LLM assigns a score of 6.5/10, the reasoning is opaque. Regulators, clients, and internal stakeholders cannot trace the decision path. For compliance-sensitive assessments, this opacity is unacceptable.

The traditional alternative—asking humans to assign subjective scores using 1–10 scales—solves the AI problem but introduces inconsistency. Different evaluators interpret scales differently. Cognitive biases creep in.

The Solution: LLM Pairwise Analysis + Human Seeding

Fluvial’s approach applies pairwise comparison at two levels—first for AI analysis, then for human validation:

Human Seeding: Evaluators begin by identifying the best and worst vendor responses from the full set. These anchors establish the boundaries and calibrate the LLM’s understanding of quality for this specific criterion.

LLM Pairwise Analysis: The system instructs the LLM to perform pairwise comparisons across all vendor responses, explicitly logging its reasoning for each preference. Rather than asking the LLM to assign absolute scores—which encourages vague, uncalibrated outputs—the pairwise approach forces the LLM to articulate specific, comparative judgements: “Vendor A’s response demonstrates concrete incident response procedures, whilst Vendor B’s remains abstract.”

This comparative reasoning is more accurate than direct scoring. It’s also auditable—each LLM preference includes documented rationale that can be reviewed by humans or challenged by stakeholders.

Selective Human Validation: The system identifies ambiguous or contested comparisons—cases where the LLM’s reasoning seems weak, where vendors are closely ranked, or where responses diverge significantly from the human-seeded anchors. These cases are presented to human evaluators for direct comparison.

Algorithmic Integration: An Elo-based ranking algorithm integrates both LLM and human comparisons into consistent, continuous scores. Human decisions override LLM preferences where conflicts arise. For a typical assessment of 10 vendors, the LLM handles the bulk of comparisons whilst humans validate 10–15 critical pairs, resulting in stable rankings that combine AI efficiency with human oversight.

Every final score traces to a combination of LLM reasoning (logged and reviewable) and human decisions (timestamped and attributed). The audit trail is complete.

flowchart TD
    A[Vendor Responses] --> B[Human: Find Best & Worst]
    B --> C[LLM: Compare All Pairs]
    C --> D[Log Reasoning for Each]
    D --> E{Comparison Ambiguous?}
    E -->|Yes| F[Human Validation]
    E -->|No| G[Accept LLM Decision]
    F --> H[Elo Algorithm]
    G --> H
    H --> I[Normalised Score 1-10]
    
    style B fill:#e1f5ff
    style F fill:#e1f5ff
    style C fill:#fff4e1
    style D fill:#fff4e1
    style H fill:#e8f5e9

How It Works

Identify Tricky Criteria: Administrators flag questions that require pairwise comparison—typically those involving strategic fit, cultural alignment, or complex trade-offs that resist simple rule-based automation.

Seed with Human Judgement: The evaluator reviews all responses and identifies the best and worst examples. These anchors calibrate the assessment.

LLM Comparative Analysis: The system instructs the LLM to compare response pairs, logging specific reasoning for each preference. The LLM processes all pairwise combinations, building initial rankings.

Human Validation: The system surfaces ambiguous or contested comparisons for human review. The evaluator confirms, overrides, or refines the LLM’s judgements on critical pairs.

Generate Scores: The Elo algorithm integrates LLM and human comparisons, with human decisions taking precedence. Ratings converge quickly as the system combines AI breadth with human depth.

Normalise and Integrate: Final Elo ratings are normalised to a 1–10 scale and integrated into the overall assessment. The pairwise component receives its own weight, allowing organisations to balance automated scoring with expert judgement.

Mathematical Foundation

The system assumes transitivity: if Vendor A outperforms Vendor B, and B outperforms C, then A should outperform C. This assumption—reasonable for consistent evaluation criteria—allows the algorithm to infer broader rankings from limited comparisons.

The Elo rating update follows standard formulation:

Rnew=Rold+K(SE)R_{new} = R_{old} + K(S - E)

Where SS is the actual outcome (1 for win, 0 for loss, 0.5 for draw) and EE is the expected outcome based on current ratings. This approach naturally handles ties and produces continuous scores rather than discrete ranks.

Final scores are normalised linearly:

Spairwise=1+(101)×RiRminRmaxRminS_{pairwise} = 1 + (10-1) \times \frac{R_i - R_{min}}{R_{max} - R_{min}}

Benefits

Maximum AI Leverage: The LLM handles the combinatorial explosion of pairwise comparisons (45 pairs for 10 vendors), whilst humans focus on validating 10–15 critical decisions. This division of labour dramatically reduces human workload without sacrificing oversight.

Improved Accuracy: Pairwise comparison improves both LLM and human performance. LLMs generate more precise reasoning when comparing specific alternatives rather than assigning abstract scores. Humans make more consistent decisions when validating concrete comparisons rather than rating in isolation. Human seeding provides crucial calibration.

Complete Auditability: Every comparison—whether performed by LLM or human—is logged with explicit reasoning. LLM preferences include documented rationale. Human decisions include timestamps and evaluator identity. Regulatory reviews can trace any score through its complete decision chain and challenge any step.

Consistency: The Elo algorithm ensures mathematical consistency across all comparisons. Transitivity is preserved: if A beats B, and B beats C, then A will rank above C in the final scoring, regardless of whether those comparisons came from LLM or human decisions.

Explainability: When stakeholders question a score, you can present the specific comparisons that drove it: “Vendor X ranked 7.2 because the LLM determined it outperformed Vendors Y and Z on incident response detail (reasoning logged), whilst the human evaluator confirmed it fell short of Vendor W on cultural alignment (comparison dated 15 Nov 2025).”

Integration: Pairwise scores blend seamlessly with existing weighted criteria. Organisations retain their established scoring frameworks whilst addressing the limitations of fully automated assessment.

Use Cases

This feature particularly benefits organisations conducting:

  • Third-party risk assessments where vendor responses reveal nuanced differences in security culture or operational maturity that require specialist interpretation
  • Due diligence reviews requiring expert judgement on strategic alignment or management quality, with full audit trails for investment committees
  • Compliance evaluations where regulatory interpretation varies, context matters, and decisions must be defensible to auditors

The combination of LLM efficiency with human-validated, auditable decision-making addresses the central challenge of AI adoption in high-stakes assessments: gaining speed without sacrificing accountability.