How Pairwise Comparison Reveals Fuzzy Questions

Fluvial's new consistency tracking reveals when assessment criteria are insufficiently fine-grained, transforming pairwise comparison from a scoring tool into a question-design diagnostic.

Three ice creams to represent preference consistency - if you prefer vanilla over chocolate and chocolate over strawberry, you should prefer vanilla over strawberry

Circular Preferences Indicate Unfocused Questions

Pairwise comparison delivers more consistent and auditable scoring than direct ranking. We announced this feature in September, emphasising its benefits for AI-assisted evaluation. But the system produces a second, less obvious insight: it identifies questions that are asking too much at once.

When evaluators consistently judge Vendor A superior to Vendor B, B superior to C, yet C superior to A, something is wrong. That circular pattern reveals that the question conflates multiple unrelated concerns. The evaluators are not being inconsistent; the question is forcing them to compare incomparable things.

The ice cream principle

Suppose Alice is choosing ice cream. She prefers pistachio over mango, and mango over vanilla. We would expect her to prefer pistachio over vanilla as well. These preferences show consistency: if A is better than B, and B is better than C, then A should be better than C.

This property is sometimes called transitivity, which simply means that preferences should not loop back on themselves. If they do — if Alice somehow prefers vanilla over pistachio despite her other choices — something deeper is at work. Perhaps she is judging on multiple dimensions: flavour intensity, sweetness, texture. Different pairs invoke different trade-offs, and no single “best” exists.

In competitive games — chess, tennis — this consistency emerges naturally. The stronger player wins more often, and rankings converge. But in subjective assessments, circular preferences can arise for an instructive reason: the criterion being judged is not actually one thing.

When circular preferences appear: a diagnostic, not an error

Consider a due diligence question:

"Do you implement an information security programme?"
 - Yes
 - No
 - Yes, with qualifications

In practice, most questionnaire responses are not simple yes/no answers. Vendors typically respond ‘Yes—with qualifications,’ and the substance lies in those qualifications. When comparing three vendors who all answered ‘yes’ to the same security question, the evaluator must judge whose qualifications are most acceptable. This is where circular preferences often emerge: Vendor A’s technical mitigations outweigh B’s process gaps, B’s governance structure addresses C’s documentation weaknesses, yet C’s modern architecture compensates for A’s incident-response limitations.

An evaluator compares three vendors:

  • Vendor A has robust technical controls (encryption, monitoring, patching) but poor documentation and weak incident-response procedures.
  • Vendor B has exemplary policies, thorough documentation, and clear escalation paths, but relies on outdated infrastructure.
  • Vendor C has modern infrastructure and reasonable documentation, but technical controls are incomplete.

Asked to choose the “better” security programme in pairwise comparisons, the evaluator faces an impossible task:

  • A over B: Technical capability matters more than paperwork.
  • B over C: Clear processes and escalation outweigh infrastructure modernity.
  • C over A: Modern infrastructure with reasonable processes beats strong controls hampered by weak incident response.

The result: A beats B, B beats C, C beats A. A perfect circle.

The evaluator is not confused. The question is. It conflates at least three distinct concerns: technical controls, operational maturity, and infrastructure modernity. Each comparison forces a subjective trade-off between dimensions that cannot be reconciled into a single preference.

Recording circular preferences as questionnaire diagnostics

Fluvial now records these circular patterns for each pairwise-scored question. When the system detects consistent circles — patterns where evaluators produce looping preferences — it flags the question for review.

This is not an error in the scoring process. It is a signal that the question needs to be split. Instead of asking evaluators to assess “information security programme” as a single monolith, break it into components:

  • Technical controls: Encryption, access management, vulnerability scanning.
  • Operational maturity: Documentation, training, incident response.
  • Infrastructure modernity: Architecture, patching, lifecycle management.

Each sub-question now has a coherent focus. Evaluators can express consistent preferences within each category, and the system can apply weights to combine them. Circular preferences disappear because each question isolates a single dimension.

Why this matters for AI-assisted evaluation

When LLMs perform pairwise comparisons, circular preferences become even more revealing. The LLM is not subject to cognitive fatigue, personal bias, or inconsistent interpretation of scales. If the LLM produces circular judgements, the most likely explanation is that the question genuinely contains multiple conflicting dimensions.

The LLM’s comparative reasoning — logged for each pairwise decision — makes this explicit:

“Vendor A preferred over Vendor B: stronger technical controls outweigh weaker documentation."
"Vendor B preferred over Vendor C: comprehensive processes more valuable than infrastructure age."
"Vendor C preferred over Vendor A: modern architecture with adequate processes preferable to strong controls with poor incident response.”

These notes reveal the trade-offs being made. The system can analyse them across multiple comparisons, identify the recurring tensions, and suggest how the question should be split.

Human evaluators produce the same trade-offs, but rarely articulate them so consistently. The LLM makes the pattern explicit, turning a vague sense that “this question is hard to answer” into a structured diagnostic.

Measuring consistency

Fluvial measures preference consistency using a standard coefficient that quantifies how often preferences form circles versus consistent rankings. The system produces a score between 0 and 1:

  • Score of 1: Perfect consistency. All preferences form a clear ranking with no circles.
  • Score of 0: Maximum inconsistency. Preferences are essentially random.
  • Between 0 and 1: Partial consistency. Some structure exists, but circular patterns persist.

When the score falls below 0.7, the question is flagged for review. Persistent low scores indicate that the criterion conflates multiple independent dimensions.

For those interested, the metric is Kendall’s coefficient of consistency (ζ), calculated as:

ζ=14dn(n1)(n2)\zeta = 1 - \frac{4d}{n(n-1)(n-2)}

where dd is the number of circular triples and nn is the number of items being compared. The formula simply counts how many three-way comparisons form circles rather than consistent orderings.

The broader implication: pairwise comparison as question validation

Most assessment methodologies treat question design as a one-time activity: define criteria, write questions, deploy the questionnaire. Quality control focuses on response completeness and evaluator consistency, not on whether the questions themselves are well-formed.

Pairwise comparison with consistency tracking inverts this. The scoring process becomes a continuous test of question quality. Each round of comparisons reveals whether the criterion isolates a single dimension or inadvertently conflates several. Circular preferences are not noise to be suppressed; they are signal to be heeded.

This turns pairwise comparison into something more than a scoring technique. It becomes a question-design diagnostic — a system that reveals which criteria need further decomposition and where evaluators are being forced to make impossible trade-offs.

Practical workflow: from circular patterns to refinement

When Fluvial flags a question for low consistency:

  1. Review LLM reasoning logs: Examine the comparative notes for that question. What trade-offs are being articulated? Which dimensions appear repeatedly?

  2. Interview evaluators: For human-validated comparisons, ask evaluators to describe difficult decisions. What made certain pairs hard to judge?

  3. Decompose the question: Identify the independent dimensions being conflated. Split the question into sub-criteria, each addressing a single concern.

  4. Re-score with refined questions: Re-run the assessment with the decomposed criteria. Consistency should improve as each question becomes more focused.

  5. Weight and combine: Use Fluvial’s weighted scoring to combine the sub-criteria into a composite score, making the trade-offs explicit rather than forcing evaluators to resolve them implicitly.

This workflow transforms circular preferences from a problem to be fixed into a quality-improvement mechanism. The system does not just score responses; it improves the questions being asked.

Complementing pairwise comparison’s other benefits

Consistency tracking complements the core advantages of pairwise comparison:

  • Better judgements: Evaluators make more reliable decisions when comparing two concrete examples rather than rating in isolation.
  • AI efficiency: LLMs produce more accurate reasoning when comparing alternatives rather than assigning abstract scores.
  • Complete audit trails: Every comparison is logged with explicit reasoning, creating a complete decision trail.

Consistency analysis adds a fourth benefit: criterion validation. The system does not assume that the questions are well-formed; it tests that assumption continuously and flags problems.

Conclusion: scoring as feedback

Traditional scoring treats the questionnaire as fixed and the responses as variable. Pairwise comparison with consistency tracking flips this: the responses reveal problems with the questionnaire itself.

When evaluators struggle to produce consistent rankings, it is often because they are being asked to judge multiple things at once. Circular preferences make this visible, transforming pairwise comparison from a method for generating scores into a tool for refining the assessment framework.

Fluvial’s implementation records these patterns, quantifies them, and exposes them for review. The result is a system that not only produces better scores but also produces better questions — questions that isolate single dimensions, permit consistent judgement, and generate rankings that reflect genuine, defensible preferences.

That is the deeper promise of pairwise comparison: not just more accurate assessment, but more intelligent question design.

Share this article:

Related posts

How Pairwise Comparison Reveals Fuzzy Questions

November 18, 2025

How Pairwise Comparison Reveals Fuzzy Questions

Fluvial's new consistency tracking reveals when assessment criteria are insufficiently fine-grained, transforming pairwise comparison from a scoring tool into a question-design diagnostic.