Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity
Abstract
Research examines how large language models can be manipulated through preference-undermining attacks that exploit alignment objectives, revealing model vulnerabilities and proposing a factorial evaluation method for diagnosing alignment risks.
Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled 2 times 2^4 design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.
Community
This paper treats preference undermining as an experimental object, not a vibe. A clean factorial design isolates manipulation factors and quantifies when truth yields to compliance. Conclusion, stated politely: yes, a large model can be PUA-ed, and it may rationalize the outcome as alignment.
Interesting
ummm,I guess someone has hurt the author's heart
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PARROT: Persuasion and Agreement Robustness Rating of Output Truth - A Sycophancy Robustness Benchmark for LLMs (2025)
- Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias (2025)
- Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation (2025)
- Epistemic Fragility in Large Language Models: Prompt Framing Systematically Modulates Misinformation Correction (2025)
- Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA (2025)
- Are LLM Decisions Faithful to Verbal Confidence? (2026)
- BiasLab: A Multilingual, Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper