About This Work
What guidelines should scholars follow when reporting LLM-integrated systemsSoftware systems that query a large language model (LLM) at some point during program execution. in HCI?
Distilled from a study with 18 authors and reviewers and refined through additional feedback from 6 expert HCI researchers, we offer the following eight guidelines that authors should follow when reporting LLM-integrated systems in HCI research papers. Authors can use these guidelines to improve the clarity and rigor of their reporting, while reviewers can use them to assess whether papers provide sufficient detail to validate claims involving LLM components.
Guidelines for Reporting LLM-Integrated Systems in HCI
As an author, the central problem of reporting an LLM-integrated system is cultivating trust with reviewers who have imperfect information on the quality of your system and development process. While trust-building was always important, LLMs make it significantly harder due to their unpredictability and variability. In response, authors must typically do more work, both in validating their system, and in reporting details to reviewers. The following guidelines summarize the specific components that your reporting should include, to repair trust in the face of LLM uncertainty:
Authors might use these guidelines to improve the quality of their reporting, while reviewers can use them to calibrate expectations. Compared to previous HCI norms for systems papers, one essential difference is the reporting of technical evaluations (i.e., outside user studies), which nearly all of our participants stated was expected when LLM components are central to contributions. This is a departure from traditional HCI system papers without ML components, where user studies often sufficed to validate system claims.
The centrality of the LLM component(s) to author claims dictates the "weight" of these guidelines. Papers reporting systems where LLMs are peripheral features (e.g., simple summary buttons) need not follow all guidelines with equal rigor. The sensitivity of the topic also mediates rigor (e.g., health).
Justifying and Framing the Contribution
1. Justify why an LLM is appropriate.
Briefly justify the choice to use an LLM in the system, compared to alternative technical approaches for the
same purpose. This justification can remain speculative and persuasive (i.e., in most cases, there is no need
to compare to alternative methods like decision trees, trained classifiers, etc.). Note that this guideline
refers to language models in general, rather than a specific model. While authors may also justify choosing a
specific model, for the vast majority of HCI work, justifying model selection is not required—"it
worked well enough for our prototype" or "we had credits with this provider" can be justifications enough.
Example:
"We chose an LLM approach because our system needed to handle open-ended user queries and generate contextually
appropriate responses in natural language—capabilities difficult to achieve with rule-based systems or traditional
classifiers."
Ian's Comment
Ian:
Many participants and senior researchers emphasized this guideline.
I also take this recommendation to mean authors should situate their work inside a broader landscape and history of technical approaches to the problem at hand. "AI" isn't a magic word that magically sets a system apart from all prior art. While it would be impractical to expect authors to compare their LLM approach to every possible alternative, authors should at least acknowledge prior approaches and explain why an LLM is a suitable choice for their specific use case.
2. De-emphasize LLMs/AI in paper framing when not relevant to the contribution.
If LLMs are not central to the contribution, but only an enabling implementation detail, consider
de-emphasizing the terms LLM/AI in the title, abstract, and introduction. Overemphasis can contribute to
"AI fatigue" and distort the paper's primary contribution. The exception is when the research directly advances the
understanding of LLMs themselves (e.g., prompt engineering tools).
Example:
Remove LLM term from title: "FoodChainer: An
LLM-powered Mobile Game for Teaching Ecosystem Dynamics"
Remove AI term from abstract: "We present FoodChainer, a mobile game that uses an LLM to generate dynamic food webs based on user input.
By leveraging AI, the game engages users in learning about ecosystem interactions through interactive storytelling."
Ian's Comment
Ian:
At UIST 2025, over 1 in 3 papers reported an LLM-integrated system. At this point, noting a system is "LLM-based" and "AI-powered" is simply redundant: it's increasingly expected that systems will at some point query an LLM. Papers that overemphasize AI/LLMs when they are not central can also come across as trying too hard to ride AI hype, which can backfire with reviewers. Today, it's rarely surprising or interesting that "LLMs can help address X problem."
3. Consider how future improvements in AI may render some claimed contributions obsolete.
How "future-proof" is the research? Consider whether claimed contributions would still be relevant in the near
future, when AI models improve. Although speculative, this is especially relevant for claims of novel technical
contributions with complex architectures. In our study, authors employed three framing strategies to help future-proof their research: 1) centering a
novel interaction design/paradigm as their chief contribution, 2) framing systems as probes to generate insights
into user behavior, and 3) tackling a niche problem domain that has received considerably less attention.
Ian's Comment
Ian:
This is also a note about paper framing. While you can claim a technical contribution in LLM-integrated work, in my experience that is not an effective strategy, since the spectre of "LLMs will get better soon" looms in reviewers' minds. Researchers tolds us they framed their systems as "probes" instead, or prototypes of novel interaction paradigms—deliberately avoiding claims of technical novelty.
Reporting System Engineering and Development
4. Report prompts and configuration details that directly affect claimed contributions.
Report prompts and configuration details that directly affect readers' ability to validate core claims around
system or user behavior. Include suggestive input-output examples for each reported prompt or LLM component.
For systems with many prompts, prioritize those essential for understanding and evaluating key claims. Non-critical
components (e.g., a text summarization button) need not be exhaustively documented; however, it should be noted
that these components were straightforward to engineer. Always report exact model name and version.
Example:
"Our system uses GPT-4 (gpt-4-0613) at default settings to generate conflict resolution feedback. The prompt template operationalizes Fisher and Ury's principled negotiation framework, and is structured as follows:
'Given the following conflict scenario and conversation history, identify: (1) the parties' interests vs. positions, (2) objective criteria that could resolve the dispute, and (3) a mutually beneficial option. Respond in 2-3 sentences appropriate for an undergraduate student. Scenario: {scenario} Conversation history: {history}' Appendix A provides the full prompt alongside an example input and output."
Ian's Comment
Ian:
Fairly uncontroversial, yet the most commonly violated guideline in my reviewing experience, and easily fixable. One note here is that, we found you don't need to report every single prompt in the system, just the ones that are central to claims. Experienced developers are satisfied with notes like "other prompts were straightforward and used default settings."
5. Document the engineering methodology of LLM components.
Include a short "engineering methodology" subsection near system implementation, describing how LLM components
were developed and refined. Explain design choices and the iterative process used for prompt and architecture
development. This practice parallels how HCI papers document qualitative data analysis methods, provides clarity on the effort and care that went into building the system (as it cannot be deduced by reading the final chosen prompt), and provides insights for developers seeking to replicate or build upon the work.
Example:
"We iteratively refined our prompts through three stages: (1) exploratory testing with 15 example inputs from
our formative study, (2) team-based review where three researchers independently evaluated outputs and discussed
failure modes, and (3) validation with five domain experts who provided feedback on appropriateness and accuracy."
Ian's Comment
Ian:
Reviewers in our study appreciated some insight into how authors engineered their LLM components—was it "thoughtful"? Or was it slapped together? Including a brief description of the iterative process authors used to refine prompts and system architecture helps reviewers understand the effort behind LLM component design. This also should cause authors to reflect on whether they're adopting iterative design practices and seriously engaging with the LLM's behavior, rather than just throwing something together for a paper.
6. Provide a concrete sense of LLM integration without overwhelming detail.
Provide readers with a concrete understanding of how each LLM component interacts with other system components.
Use representative input-output examples or prompt sketches that give readers a concrete sense of LLM component
design, input format, and output behavior, in relation to its integration in the larger system. These may be presented concisely in data-flow diagrams or tables, with further details like exact prompts relegated to an appendix or supplementary material.
Example:
Create a system architecture diagram showing how user input flows through your interface, gets processed by the
LLM, and returns to the user. Include a truncated example like:
Input: "User query: ..." → Prompt template X with Y inputs to LLM → Output: "It is clear that the..."
Ian's Comment
Ian:
Tricky to get right in limited space, but effective when done well. Reviewers expect to see the "big picture" of how LLM components fit into the overall system architecture, in the main text and not an appendix, but don't want to be overwhelmed with minutiae.
Evaluating Robustness and Generalizability
7. Conduct a small technical evaluation of LLM components that are central to claims.
LLMs are stochastic and never 100% correct. When an LLM component is integral to claims, authors should perform a small technical evaluation of the LLM component on a dataset of representative inputs, outside of a user study, to provide insight into the robustness and
generalizability of its behavior. In most cases,
datasets should be custom-tailored to the use case and smaller than benchmarks (e.g., 30-100 samples could be
appropriate), consistent with the notion of "evals" in LLM-integrated software engineering. Metrics can either be automated, expert ratings, or both. Unlike ML/NLP, the aim is to provide a sanity check that contextualizes the LLM component's robustness without the confounders of a user study, not to rigorously benchmark model behavior.
Example:
"We evaluated our LLM-generated feedback on 50 representative student qualitative coding excerpts from prior studies. Two expert qualitative researchers independently rated each piece of feedback on accuracy (Cohen's κ=0.78) and actionability (κ=0.82). The system achieved 86% accuracy (43/50 cases) and provided actionable suggestions in 92% of cases (46/50). Common failure modes included overgeneralizing from limited context (6 cases) and paraphrasing codes already present (4 cases)."
Ian's Comment
Ian:
This is the biggest difference for authors used to writing traditional HCI system papers without ML components. Reviewers now expect some form of technical evaluation of LLM components, especially when they are central to the claims. We have personally experienced reviewers demanding this (e.g., for our paper ChainBuddy, CHI 2025). In my experience, a small-scale evaluation with 30-100 representative inputs, evaluated with automated metrics or expert ratings, is usually sufficient to satisfy reviewer skepticism and ground performance claims.
8. Describe failure modes and limitations of LLM components.
Briefly report errors, biases, or limitations of LLM components in the system. Summarize these failure modes
qualitatively, providing concrete examples by ideally drawing on results from a technical evaluation. As LLMs
are stochastic, they can always err for some inputs; thus, avoid making sweeping claims of positive performance
that imply the system is infallible. Reporting failure modes clarifies system reliability, provides transparency
about the boundary conditions of claims, and aligns with ongoing efforts to provide transparency on potential
negative impacts.
Example:
"We observed that the system occasionally generates overly formal language that participants found stilted. In 8% of our test cases (4/50 inputs), the LLM also failed to correctly interpret the domain-specific meaning of terms, requiring
users to rephrase their queries."
Ian's Comment
Ian:
Being upfront about limitations and failure modes not only builds trust with reviewers but also demonstrates a mature, humble understanding of technology's capabilities. Yet, so many submitted papers aren't upfront about limitations, and it's a big red flag. In papers I have reviewed, authors can make sweeping claims like "our system provides accurate feedback" or "our component understands user intent for XYZ types of users" without any evidence backing their claims. Why are authors so smitten with their systems? Research shows that people tend to over-trust AI, so authors might be prone to this too. Also, rhetoric around AI has become so hype-laden that authors might adopt similar language, influenced by social media.
What Does "LLM Wrapper" Mean?
The term "LLM wrapper" is often used pejoratively by reviewers. Our interviewees gave varied, often contradictory definitions. However, some common dimensions were:
- A system that appears trivial to build
- A paper whose primary contribution is "an LLM solved X problem"
- An interface or interaction that closely resembles a chatbot UI like ChatGPT
- A paper where you don't learn anything new
- A paper that doesn't engage in past HCI literature, especially previous to the current AI era
We recommend that reviewers avoid using "LLM wrapper" in formal reviews since its meaning is highly subjective.
For Authors: To avoid some of this stigma, position your work well in pre-LLM-era HCI literature and frame your contributions around insights, design principles, or user understanding rather than technical novelty. Follow the guidelines above—especially de-emphasizing LLM/AI in framing when appropriate (Guideline 2)—to demonstrate substantive contribution. Consider what the reader learns from your work beyond "you can use an LLM to address X problem."
Citation
If you find these guidelines useful, please cite our paper:
@article{navarro2026reportingllm,
title={Reporting and Reviewing LLM-Integrated Systems in HCI: Challenges and Recommendations},
author={Felix Navarro, Karla and Syriani, Eugene and Arawjo, Ian},
year={2026}
}