Guidelines for Reporting and Reviewing
LLM-Integrated Systems in HCI

Karla Felix Navarro, Eugene Syriani, Ian Arawjo
Department of Computer Science and Operations Research,
Université de Montréal, Montréal QC, Canada
Conditionally accepted to CHI 2026

About This Work

What guidelines should scholars follow when reporting LLM-integrated systemsSoftware systems that query a large language model (LLM) at some point during program execution. in HCI? Distilled from a study with 18 authors and reviewers and refined through additional feedback from 6 expert HCI researchers, we offer the following eight guidelines that authors should follow when reporting LLM-integrated systems in HCI research papers. Authors can use these guidelines to improve the clarity and rigor of their reporting, while reviewers can use them to assess whether papers provide sufficient detail to validate claims involving LLM components.

Guidelines for Reporting LLM-Integrated Systems in HCI

As an author, the central problem of reporting an LLM-integrated system is cultivating trust with reviewers who have imperfect information on the quality of your system and development process. While trust-building was always important, LLMs make it significantly harder due to their unpredictability and variability. In response, authors must typically do more work, both in validating their system, and in reporting details to reviewers. The following guidelines summarize the specific components that your reporting should include, to repair trust in the face of LLM uncertainty:

Authors might use these guidelines to improve the quality of their reporting, while reviewers can use them to calibrate expectations. Compared to previous HCI norms for systems papers, one essential difference is the reporting of technical evaluations (i.e., outside user studies), which nearly all of our participants stated was expected when LLM components are central to contributions. This is a departure from traditional HCI system papers without ML components, where user studies often sufficed to validate system claims.

The centrality of the LLM component(s) to author claims dictates the "weight" of these guidelines. Papers reporting systems where LLMs are peripheral features (e.g., simple summary buttons) need not follow all guidelines with equal rigor. The sensitivity of the topic also mediates rigor (e.g., health).

Justifying and Framing the Contribution

1. Justify why an LLM is appropriate.

Briefly justify the choice to use an LLM in the system, compared to alternative technical approaches for the same purpose. This justification can remain speculative and persuasive (i.e., in most cases, there is no need to compare to alternative methods like decision trees, trained classifiers, etc.). Note that this guideline refers to language models in general, rather than a specific model. While authors may also justify choosing a specific model, for the vast majority of HCI work, justifying model selection is not required—"it worked well enough for our prototype" or "we had credits with this provider" can be justifications enough.
Example:
"We chose an LLM approach because our system needed to handle open-ended user queries and generate contextually appropriate responses in natural language—capabilities difficult to achieve with rule-based systems or traditional classifiers."
Ian's Comment
Ian: Many participants and senior researchers emphasized this guideline. I also take this recommendation to mean authors should situate their work inside a broader landscape and history of technical approaches to the problem at hand. "AI" isn't a magic word that magically sets a system apart from all prior art. While it would be impractical to expect authors to compare their LLM approach to every possible alternative, authors should at least acknowledge prior approaches and explain why an LLM is a suitable choice for their specific use case.

2. De-emphasize LLMs/AI in paper framing when not relevant to the contribution.

If LLMs are not central to the contribution, but only an enabling implementation detail, consider de-emphasizing the terms LLM/AI in the title, abstract, and introduction. Overemphasis can contribute to "AI fatigue" and distort the paper's primary contribution. The exception is when the research directly advances the understanding of LLMs themselves (e.g., prompt engineering tools).
Example:
Remove LLM term from title: "FoodChainer: An LLM-powered Mobile Game for Teaching Ecosystem Dynamics"
Remove AI term from abstract: "We present FoodChainer, a mobile game that uses an LLM to generate dynamic food webs based on user input. By leveraging AI, the game engages users in learning about ecosystem interactions through interactive storytelling."
Ian's Comment
Ian: At UIST 2025, over 1 in 3 papers reported an LLM-integrated system. At this point, noting a system is "LLM-based" and "AI-powered" is simply redundant: it's increasingly expected that systems will at some point query an LLM. Papers that overemphasize AI/LLMs when they are not central can also come across as trying too hard to ride AI hype, which can backfire with reviewers. Today, it's rarely surprising or interesting that "LLMs can help address X problem."

3. Consider how future improvements in AI may render some claimed contributions obsolete.

How "future-proof" is the research? Consider whether claimed contributions would still be relevant in the near future, when AI models improve. Although speculative, this is especially relevant for claims of novel technical contributions with complex architectures. In our study, authors employed three framing strategies to help future-proof their research: 1) centering a novel interaction design/paradigm as their chief contribution, 2) framing systems as probes to generate insights into user behavior, and 3) tackling a niche problem domain that has received considerably less attention.
Ian's Comment
Ian: This is also a note about paper framing. While you can claim a technical contribution in LLM-integrated work, in my experience that is not an effective strategy, since the spectre of "LLMs will get better soon" looms in reviewers' minds. Researchers tolds us they framed their systems as "probes" instead, or prototypes of novel interaction paradigms—deliberately avoiding claims of technical novelty.

Reporting System Engineering and Development

4. Report prompts and configuration details that directly affect claimed contributions.

Report prompts and configuration details that directly affect readers' ability to validate core claims around system or user behavior. Include suggestive input-output examples for each reported prompt or LLM component. For systems with many prompts, prioritize those essential for understanding and evaluating key claims. Non-critical components (e.g., a text summarization button) need not be exhaustively documented; however, it should be noted that these components were straightforward to engineer. Always report exact model name and version.
Example:
"Our system uses GPT-4 (gpt-4-0613) at default settings to generate conflict resolution feedback. The prompt template operationalizes Fisher and Ury's principled negotiation framework, and is structured as follows: 'Given the following conflict scenario and conversation history, identify: (1) the parties' interests vs. positions, (2) objective criteria that could resolve the dispute, and (3) a mutually beneficial option. Respond in 2-3 sentences appropriate for an undergraduate student. Scenario: {scenario} Conversation history: {history}' Appendix A provides the full prompt alongside an example input and output."
Ian's Comment
Ian: Fairly uncontroversial, yet the most commonly violated guideline in my reviewing experience, and easily fixable. One note here is that, we found you don't need to report every single prompt in the system, just the ones that are central to claims. Experienced developers are satisfied with notes like "other prompts were straightforward and used default settings."

5. Document the engineering methodology of LLM components.

Include a short "engineering methodology" subsection near system implementation, describing how LLM components were developed and refined. Explain design choices and the iterative process used for prompt and architecture development. This practice parallels how HCI papers document qualitative data analysis methods, provides clarity on the effort and care that went into building the system (as it cannot be deduced by reading the final chosen prompt), and provides insights for developers seeking to replicate or build upon the work.
Example:
"We iteratively refined our prompts through three stages: (1) exploratory testing with 15 example inputs from our formative study, (2) team-based review where three researchers independently evaluated outputs and discussed failure modes, and (3) validation with five domain experts who provided feedback on appropriateness and accuracy."
Ian's Comment
Ian: Reviewers in our study appreciated some insight into how authors engineered their LLM components—was it "thoughtful"? Or was it slapped together? Including a brief description of the iterative process authors used to refine prompts and system architecture helps reviewers understand the effort behind LLM component design. This also should cause authors to reflect on whether they're adopting iterative design practices and seriously engaging with the LLM's behavior, rather than just throwing something together for a paper.

6. Provide a concrete sense of LLM integration without overwhelming detail.

Provide readers with a concrete understanding of how each LLM component interacts with other system components. Use representative input-output examples or prompt sketches that give readers a concrete sense of LLM component design, input format, and output behavior, in relation to its integration in the larger system. These may be presented concisely in data-flow diagrams or tables, with further details like exact prompts relegated to an appendix or supplementary material.
Example:
Create a system architecture diagram showing how user input flows through your interface, gets processed by the LLM, and returns to the user. Include a truncated example like: Input: "User query: ..." → Prompt template X with Y inputs to LLM → Output: "It is clear that the..."
Ian's Comment
Ian: Tricky to get right in limited space, but effective when done well. Reviewers expect to see the "big picture" of how LLM components fit into the overall system architecture, in the main text and not an appendix, but don't want to be overwhelmed with minutiae.

Evaluating Robustness and Generalizability

7. Conduct a small technical evaluation of LLM components that are central to claims.

LLMs are stochastic and never 100% correct. When an LLM component is integral to claims, authors should perform a small technical evaluation of the LLM component on a dataset of representative inputs, outside of a user study, to provide insight into the robustness and generalizability of its behavior. In most cases, datasets should be custom-tailored to the use case and smaller than benchmarks (e.g., 30-100 samples could be appropriate), consistent with the notion of "evals" in LLM-integrated software engineering. Metrics can either be automated, expert ratings, or both. Unlike ML/NLP, the aim is to provide a sanity check that contextualizes the LLM component's robustness without the confounders of a user study, not to rigorously benchmark model behavior.
Example:
"We evaluated our LLM-generated feedback on 50 representative student qualitative coding excerpts from prior studies. Two expert qualitative researchers independently rated each piece of feedback on accuracy (Cohen's κ=0.78) and actionability (κ=0.82). The system achieved 86% accuracy (43/50 cases) and provided actionable suggestions in 92% of cases (46/50). Common failure modes included overgeneralizing from limited context (6 cases) and paraphrasing codes already present (4 cases)."
Ian's Comment
Ian: This is the biggest difference for authors used to writing traditional HCI system papers without ML components. Reviewers now expect some form of technical evaluation of LLM components, especially when they are central to the claims. We have personally experienced reviewers demanding this (e.g., for our paper ChainBuddy, CHI 2025). In my experience, a small-scale evaluation with 30-100 representative inputs, evaluated with automated metrics or expert ratings, is usually sufficient to satisfy reviewer skepticism and ground performance claims.

8. Describe failure modes and limitations of LLM components.

Briefly report errors, biases, or limitations of LLM components in the system. Summarize these failure modes qualitatively, providing concrete examples by ideally drawing on results from a technical evaluation. As LLMs are stochastic, they can always err for some inputs; thus, avoid making sweeping claims of positive performance that imply the system is infallible. Reporting failure modes clarifies system reliability, provides transparency about the boundary conditions of claims, and aligns with ongoing efforts to provide transparency on potential negative impacts.
Example:
"We observed that the system occasionally generates overly formal language that participants found stilted. In 8% of our test cases (4/50 inputs), the LLM also failed to correctly interpret the domain-specific meaning of terms, requiring users to rephrase their queries."
Ian's Comment
Ian: Being upfront about limitations and failure modes not only builds trust with reviewers but also demonstrates a mature, humble understanding of technology's capabilities. Yet, so many submitted papers aren't upfront about limitations, and it's a big red flag. In papers I have reviewed, authors can make sweeping claims like "our system provides accurate feedback" or "our component understands user intent for XYZ types of users" without any evidence backing their claims. Why are authors so smitten with their systems? Research shows that people tend to over-trust AI, so authors might be prone to this too. Also, rhetoric around AI has become so hype-laden that authors might adopt similar language, influenced by social media.

Some Example "Reviewer Thoughts" and How the Guidelines Address Them

Reviewer Thought Relevant Guidelines
"Where are the prompts used? I can't get a sense of how it works." G4: Report prompts that affect core claims
"Why does this system need an LLM at all? There's prior work that addresses this issue without an LLM." G1: Justify LLM use
"I don't understand what the LLM is actually doing in this system. It seems like it's doing a lot of the heavy-lifting." G6: Concrete sense of integration, G4: Report prompts that affect core claims
"The authors say they refined their LLM pipeline to be good at X, but how? Should I just take their word for it?" G7: Technical evaluation, G5: Engineering methodology
"Authors could be cherrypicking the moments that it worked. They only show one task in the screenshots, yet claim it generalizes across many tasks and users." G7: Technical evaluation, G8: Failure modes
"This agent architecture is very complex, I'm not sure whether all these LLM components were necessary to improve performance." G7: Technical evaluation (here, an ablation study)
"This feels like an LLM wrapper." G2: De-emphasize LLMs in framing, G3: Future-proof contributions, G5: Engineering methodology
"Did the authors just throw this together? It's unclear how much thought went into building this." G5: Engineering methodology
"12 study participants liked the system... OK. But they may be biased to liking an AI approach more than a non-AI one, not be experienced enough to tell when the AI is wrong, or just want to appease the researchers. How can we know the system really performs well?" G7: Technical evaluation
"This complex technical architecture won't matter once models get better." G3: Future-proof contributions

Exemplar Papers

The following papers demonstrate good practices for reporting LLM-integrated systems in HCI and follow the majority of the guidelines presented above:

Feel free to propose more by submitting a PR to our GitHub repository!

What Does "LLM Wrapper" Mean?

The term "LLM wrapper" is often used pejoratively by reviewers. Our interviewees gave varied, often contradictory definitions. However, some common dimensions were:

We recommend that reviewers avoid using "LLM wrapper" in formal reviews since its meaning is highly subjective.

For Authors: To avoid some of this stigma, position your work well in pre-LLM-era HCI literature and frame your contributions around insights, design principles, or user understanding rather than technical novelty. Follow the guidelines above—especially de-emphasizing LLM/AI in framing when appropriate (Guideline 2)—to demonstrate substantive contribution. Consider what the reader learns from your work beyond "you can use an LLM to address X problem."

Citation

If you find these guidelines useful, please cite our paper:

@article{navarro2026reportingllm,
  title={Reporting and Reviewing LLM-Integrated Systems in HCI: Challenges and Recommendations},
  author={Felix Navarro, Karla and Syriani, Eugene and Arawjo, Ian},
  year={2026}
}

Contact and Feedback

We welcome community feedback. For comments, suggestions, or other feedback, please raise a Discussion or PR on our GitHub repository. These guidelines were our best shot at distilling current best practices from a need-finding study with 18 "users" (authors and reviewers of such papers), iterated with six senior HCI researchers. Our paper also includes common pitfalls for reviewers of LLM-integrated systems, and suggestions for HCI communities, which we ommitted here for now. Please see the full paper for those details.

Acknowledgments

We thank the 18 HCI researchers who participated in our interviews and the six expert HCI researchers who provided valuable feedback on our guidelines.