FeedbackLogs: Recording and Incorporating Stakeholder Feedback into Machine Learning Pipelines

Even though machine learning (ML) pipelines affect an increasing array of stakeholders, there is little work on how input from stakeholders is recorded and incorporated. We propose FeedbackLogs, addenda to existing documentation of ML pipelines, to track the input of multiple stakeholders. Each log records important details about the feedback collection process, the feedback itself, and how the feedback is used to update the ML pipeline. In this paper, we introduce and formalise a process for collecting a FeedbackLog. We also provide concrete use cases where FeedbackLogs can be employed as evidence for algorithmic auditing and as a tool to record updates based on stakeholder feedback.


INTRODUCTION
Stakeholders, who interact with or are affected by machine learning (ML) models, should be involved in the model development process [2,22,27].Their unique perspectives, however, may not be adequately accounted for by practitioners, who are responsible for developing and deploying models (e.g., ML engineers, data scientists, UX researchers) [16].
We notice a gap in the existing literature around documenting how stakeholder input was collected and incorporated in the ML pipeline, which we define as a model's end-to-end lifecycle, from data collection to model development to system deployment and ongoing usage.A lack of documentation can create difficulties when practitioners attempt to justify why certain design decisions were made through the pipeline: this may be important for compiling defensible evidence of compliance to governance practices [6], anticipating stakeholder needs [90], or participating in the model auditing process [52].While existing documentation literature (e.g., Model Cards [51] and FactSheets [3]) focuses on providing static snapshots of an ML model, as shown in Figure 1 (Left), we propose FeedbackLogs, a systematic way of recording the iterative process of collecting and incorporating stakeholder feedback.
The FeedbackLog is constructed during the development and deployment of the ML pipeline, and updated as necessary throughout the model lifecycle.While the FeedbackLog contains a starting point and final summary to document the start and end of stakeholder involvement, the core of a FeedbackLog are the records that document practitioners' interactions with stakeholders.Each record contains the content of the feedback provided by a particular stakeholder, as well as how it was incorporated into the ML pipeline.The process for adding records to a FeedbackLog is shown in purple in Figure 1 (Right).Over time, a FeedbackLog reflects how the ML pipeline has evolved as a result of these interactions between practitioners and stakeholders.
To explore how FeedbackLogs would be used in practice, we engaged directly with ML practitioners.Through interviews, we surveyed the perceived practicality of FeedbackLogs.Furthermore, we collected three real-world examples of FeedbackLogs from practitioners across different industries.Each example FeedbackLog was recorded at a different stage in the ML model development process, demonstrating the flexibility of FeedbackLogs to account for feedback from various stakeholders.The examples show how FeedbackLogs serve as a defensibility mechanism in algorithmic auditing and as a tool for recording updates based on stakeholder feedback.
In summary, the main contributions of this work are: (1) A new documentation structure, FeedbackLogs, that captures the iterative process of collecting and incorporating stakeholder feedback (Sections 2.2 and 3).
(2) Findings from practitioner interviews on the benefits and challenges of implementing FeedbackLogs in practice (Section 4.1) and an interactive demo tool to make FeedbackLogs more accessible and easy to use for practitioners (Section 4.2).

Background
Prior work has focused on documentation that provides a snapshot of the ML pipeline at a specific stage of the ML lifecycle (Figure 1 (Left)).We discuss a few, non-exhaustive, examples of these documentation strategies below.Model Cards describe how a model was developed, including who trained the model, when it was trained, and what data was used in the learning procedure along with details of model development and performance of the model on various metrics [51].Similarly, FactSheets describe relevant information at each phase of the model's development: pre-training, during training, and post-training [4].Explainability Fact Sheets summarize key features that lead a model to be more explainable [71].Reward reports [31] frame an ML system as a reinforcement learning model, and record the decisions taken to optimise the system.Application-specific documentation aims to contextualise more general techniques for use within the domain of interest.For example, Healthsheet [63] is a questionnaire adapted from datasheets for datasets [30] to improve accountability for data collection and usage in the health domain.Unlike prior forms of documentation, we propose FeedbackLogs which provide information on the iterative process of eliciting and incorporating multiple stakeholder feedback throughout the model's lifecycle (Figure 1 (Right)).To the best of our knowledge, this is the first work that introduces a systematic way to record how stakeholder feedback has been incorporated into an ML pipeline.We note that FeedbackLogs can be used alongside existing documentation tools, which we describe further in Section 3.3.
The rise of participatory ML [43] has resulted in the incorporation of feedback from a diverse set of stakeholders.This raises issues such as "participation washing" [70] and a lack of clarity as to what is expected from stakeholders [9].
FeedbackLogs aim to clarify exactly what is expected from stakeholders and the effect of their participation.In addition to documenting model development, previous work has argued for a comprehensive understanding of the usage of a system, including algorithmic auditing [21,56] and critical refusal [29].By tracking the reasons for decisions prompted by feedback, FeedbackLogs address the accountability gap [59] in the development of ML systems that elicit feedback from numerous stakeholders.A FeedbackLog provides more information than a one-off certification [36] and captures the iterative development process rather than a static snapshot [69].

FeedbackLog Components
To motivate the design of FeedbackLogs, we set out three desiderata which can be used to evaluate their value added to the documentation process.
(1) Completeness: FeedbackLogs should provide comprehensive details about stakeholder feedback and subsequent practitioner updates.
(2) Flexibility: FeedbackLogs should be able to be integrated into the ML pipeline at any point.FeedbackLogs should also be able to handle the variability in the types and amount of stakeholder feedback as well as the types of updates a practitioner may consider.
(3) Ease of Use: FeedbackLogs should come with minimal overhead for practitioners to adopt.
We propose a template-like design for FeedbackLogs with three distinct components (shown in Figure 2): a starting point, one or more records, and a final summary.We describe both the starting point and final summary now and the records in greater detail in the subsequent section.To illustrate how FeedbackLogs can be instantiated, we provide practical examples that have been completed by practitioners in Section 4.3.
The starting point describes the state of the ML pipeline before the practitioner reaches out to any relevant stakeholders.The starting point might contain information on the objectives, assumptions, and current plans of the practitioner.More generally, a starting point may consist of descriptions of the data, such as Data Sheets [30]; metrics used to evaluate the models; or policies regarding deployment of the system [79].This component provides Metric performance: Description of the metrics to evaluate the model(s) and their performance after the above updates.
Fig. 2. The FeedbackLog includes three sections: a starting point, one or more records, and a final summary.The records section is further divided into the interactions with the stakeholder(s) (elicitation and feedback) and the resulting updates taken by the practitioners (incorporation and summary).
point allows auditors and practitioners to understand when in the development process the gathered feedback was incorporated, and defensibly demonstrates how specific feedback led to changes in the metrics.
The feedback from stakeholders is contained in the records section, which can house multiple records.A single record logs how the stakeholder was requested for feedback, the stakeholder's response, and how the practitioner used the stakeholder input to update the ML pipeline.Figure 2 shows the structure of one record, which contains the elicitation, feedback, incorporation and summary sections for one source of feedback.Each record conveys enough information to satisfy completeness while not being excessively burdensome to hinder ease of use.

Manuscript submitted to ACM
The final summary consists of the same questions as the starting points, i.e. which dataset(s) and models are used after the updates, as well as the metrics used to track model performance.This component provides completeness by encapsulating the net effect as a result of feedback from all the relevant experts.Proper documentation of the finishing point of the FeedbackLog allows reviewers to clearly establish how the feedback documented leads to concrete and quantifiable changes within the ML pipeline.

RECORDS
Each record in a FeedbackLog is a self-contained interaction between the practitioner and a relevant stakeholder.It consists of how the stakeholder was requested for feedback (elicitation), the stakeholder's response (feedback), and how the practitioner used the stakeholder input to update the ML pipeline (incorporation).

Elicitation
Every record in a FeedbackLog begins with a practitioner's request for feedback.Tracking how the request was made gives vital context for deciding on how to act on the advice [74], and surface potential downstream issues, such as use of leading prompts, omission of key information, or other problems in the feedback collection process.

Which stakeholder(s) are being consulted and why?
There are many stakeholders who can provide feedback to improve models [75].Stakeholders may be internal to a practitioner's organisation (e.g., senior leadership, compliance officers, account executives) or external (e.g., regulators, auditors, review boards, end users) [7,25].Acknowledging the stakeholder who was consulted is important to document in the feedback procedure, since credit attribution is key to responsible innovation [39,72].Crediting the source of feedback also helps stakeholders gauge if and when their comments are incorporated into the pipeline [46,55].Additionally, it may be important to document why the particular stakeholder is being asked for feedback.For example, experts from different fields may be consulted to see whether something noteworthy (e.g., fairness considerations in a specific jurisdiction) has been overlooked.When many stakeholders are consulted for the same reason, as is the case in participatory ML, it is up to the practitioner's discretion whether each stakeholder should be in a separate record, or combined into the same record.
How is the relevant model information presented to stakeholders?While acquiring stakeholder feedback over a series of interactions [45], practitioners will need to decide on what information about a model should be shown to the stakeholder.The information should help the stakeholder develop an appropriate understanding of the current pipeline Approaches to communicate such information include socio-technical details [30,51], performance metrics [38,58], model explanations [15,26], and confidence estimates [8,41].The content and presentation of model information will affect the stakeholder's downstream feedback [66].

Feedback
The content of feedback elicited from stakeholders is tracked in each record.Different stakeholders may tend to provide different kinds of feedback, and we illustrate examples below: • End Users are individuals who may be affected by the pipeline.End users can provide feedback on desired model behaviour or feedback on the issues with existing model behaviour.For example, they might specify the kinds of behavior that a model should not exhibit (e.g., a model should not be able to generate hate speech [12,48]).
Manuscript submitted to ACM • Regulators include compliance officers, internal review boards, and independent evaluators.Their feedback may include how to be compliant with regulations [18,78], policies [17,73], or industry standards [17,53].These pieces of feedback would need to be translated into concrete actionable updates, which we soon discuss.• Domain Experts are individuals with prior experience and knowledge about the context of the ML pipeline.

Incorporation
Once stakeholders have provided feedback, practitioners can leverage their input to improve the model.It is imperative to document the update process as there are many different ways (i.e.types of updates) in which a single piece of stakeholder feedback could be incorporated.These updates to the ML pipeline can be largely clustered into model updates or ecosystem updates, which we now describe in more detail.

Model updates.
It is often feasible to incorporate targeted feedback by making direct changes to the ML model.
We focus our discussion on the common supervised learning setting, where a practitioner minimises a loss function on a dataset to learn a model that has many parameters, and any one of these aspects of the model could be changed in response to feedback provided.Common model updates include dataset, loss function, and parameter space updates (a more extensive list can be found in Chen et al. [14]): • Dataset updates: Feedback can be incorporated by adjusting the dataset of a model, i.e. by adding, modifying or removing data [23,79,85].In addition to active data collection [40], dataset updates may take place in an unsupervised way [32,37,47,61].
• Loss function updates.Feedback can also be used to update the loss function, thus changing the optimisation objective of the model.It is possible to add constraints to the model which may capture normative notions, such as fairness or transparency [44,89], as well as practical considerations, like resourcing or robustness [28].
• Parameter space updates.Feedback can be incorporated by changing the architecture or features of the model [20,62], which affects the model parameter space.These updates traditionally require more technical users, although there are user-friendly interfaces developed to allow even non-technical experts to edit the model in a more direct manner [80,87], even in models with many billions of parameters [33,49,50].
Implementing such changes to a model requires the practitioner to translate stakeholder feedback into a concrete update, which can be challenging.Not all updates naturally fit in this decomposition.For instance, in large language models [10,12], the structure and context of the prompts used to elicit generations can have a substantial impact on the model's output [81,92,93].Prompts are not necessarily "data, " nor parameters; however, their updates are worth tracking nonetheless and naturally fit within the purview of FeedbackLog.

Ecosystem updates.
In many practical settings, making model updates only may be insufficient or ineffective to account for a piece of feedback, requiring modifications to the broader ecosystem.Here, ecosystem refers to the socio-technical realm in which the ML pipeline lives.We now describe parts of the ecosystem that can be altered upon receiving feedback.
• Documentation.Feedback can increase the need for documentation.For instance, if the practitioners are made aware of audit requirements (e.g. as outlined in the drafts of the EU AI Act [19] and the Canadian AI and Data Act [54]), then practitioners might be required to log aspects of the model and its development that have not been Manuscript submitted to ACM considered before.Such aspects could be an additional metric to include in the Model Cards, properties of the dataset that should be in the Datasheet, or a set of specifications that must be reflected in policy documentation.
• Interface or UX Updates.Feedback from end users is essential to ensure a smooth user experience (UX) [86].
Insights into their perception and usability issues with the interface are required to tailor it to their needs.Changes may include considering the perceived trustworthiness of the model [57], the required level of interpretability of how the model arrived at a specific decision [75], or even the emotional relationship with a model [67].These aspects are often addressed via interface changes (e.g.providing forms of explanation [83] or recourse [65] oranthropomorphizing the model [91]).
• Accountability Structure.Stakeholders might provide insights into risks that are inherent to a pipeline's use case.Whilst it could be difficult to directly incorporate such feedback into an ML model [74], it might prompt practitioners to identify appropriate strategies to address these risks.For instance, they could establish monitoring processes that detect the manifestation of such risks early on, paired with an action plan with clearly defined responsibilities [54].This increased awareness would ensure that the practitioners are aware of the risks and their role in preventing potential harms [60,64].
• Deployment Details.It may be appropriate to update the intended usage and scope of the pipeline.This includes details of scenarios in which the model is expected to function appropriately, scenarios that should be avoided (e.g.due to data or model drift), or the recommended level of human oversight (and the required expertise of the monitoring individual) [13].This could, for instance, be realized in a guidance document that is issued with the model -similar to a manual -that details the best practices of pipeline implementation and usage, as recommended in [19,54].Such guidance could include where and why pipeline failures may occur with a higher likelihood, how to prevent such failures, what data can and cannot be used in certain circumstances, and generally how to ensure optimal model operation [69].By outlining the context of proper system operation, the operators can quickly establish best practices.
Model and ecosystem updates are not necessarily exclusive, since both forms of update may be suitable for a given source of feedback.For example, a practitioner may change both a dataset and loss function, while also adding further details regarding best practices of model use.We note that some types of feedback (e.g., subjective or qualitative feedback) may be more difficult to translate into updates, which should be noted in the record.The incorporation section of a record also tracks the following two aspects of the implemented updates: At which stage of the ML pipeline is the update located?The feasibility of updates is partly dictated by the current stage of the ML pipeline.Thus, the documentation of where in the pipeline an update is located is part of the justification for the choice of update.Common updates for each of the stages are described further: • Data Collection (pre-training): This is typically when updates are made to the ecosystem or to the dataset (e.g., adding data from underrepresented groups).Other updates might also include feature engineering or model class selection [76].
• Model Deployment (post-training): Even after the model has been developed, ecosystem-level updates (e.g., interface updates and changes to deployment details) can still occur.We note that the lifecycle of the ML pipeline is not linear; it may be necessary to return to earlier stages and consider their relevant updates.
Manuscript submitted to ACM How do we measure the impact of the update?The final part of this section is a description of how the update(s) affected downstream metrics of interest that were spelled out in the starting point.To the extent possible, practitioners should explore performing individual updates, rather than implementing multiple updates simultaneously, to disentangle the isolated effects of the individual updates.This measurement can be used in comparing multiple updates to explain the reasoning for selecting from a set of updates, thus demonstrating that other alternatives were considered and ruled out for legitimate reasons.The practitioners may choose to refrain from implementing potential updates, making the justification for inaction in the FeedbackLog even more important.

Summary
Each record contains a summary of the updates that describes what updates were considered and what their effect was on the metrics of interest.Since each record section may consider multiple potential updates, it is important to state which updates are ultimately implemented.To enhance readability, the summary should capture the impact of updates, while minimizing the amount of technical detail present about the specific update details.

TOWARDS FEEDBACKLOGS IN PRACTICE
We intend to make FeedbackLogs effective for real-world projects.The following section describes three steps that we undertook to bring the FeedbackLog concept closer to practice as well as to uncover considerations which could affect implementation and usage in real scenarios.First, we collected practitioner perspectives on the concrete implementation of FeedbackLogs.Second, we created an open-source FeedbackLog generator to make the concept accessible to practitioners, as well as to ease the collection of practitioner feedback.Third, we completed example FeedbackLogs based on consultations with practitioners working on ML pipelines.

Practitioner Perspectives & Future Developments
We conducted semi-structured interviews with three practitioners to gain insight into how FeedbackLogs could be implemented in practice (see Appendix C for the interview guide and details of the method).The responses are summarised below.
Responsibilities.All practitioners expected that a single person would be responsible for the completion of FeedbackLogs for a specific system, i.e. the FeedbackLog owner.This person might be the UX researcher, product manager, analyst, or engineering manager, depending on the type of feedback and development stage.The FeedbackLog owner would frequently draw on the expertise of other roles to provide input, e.g. on developers to establish the feasibility of technical updates, or the UX designer to propose potential UI solutions.Thus, future versions of FeedbackLogs should have the ability to assign the completion of FeedbackLog sections to a specific role or person.
Timing of FeedbackLog Completion.The timings of when to complete a FeedbackLog evoked varied responses from the practitioners.For smaller, more confined rounds of feedback collection as in the image recognition example below (Figure 4), a post-hoc completion by the analyst was deemed sufficient by a practitioner.However, they agreed that for feedback loops requiring more participating parties, the FeedbackLog should be filled out alongside the stakeholder involvement process to provide a common point of reference for everyone involved.
Expected Benefits of Implementing FeedbackLogs.The practitioners confirmed many of the benefits of FeedbackLogs mentioned in the previous sections, e.g. the predefined structure that allows for fast information gathering and the benefits regarding audits, accountability, and transparency.The practitioners also suggested that FeedbackLogs might improve communication and knowledge-sharing within organisations.One practitioner mentioned that the product Manuscript submitted to ACM team around the ML model was working with a different information management software than the technical team.
They mentioned that this was especially true for A/B tests: the technical team members often had no context around why specific versions were developed and compared, and even lost track of the different versions themselves due to distributed and contradictory information.This resulted in communication issues.FeedbackLogs could serve as a single source of truth that includes links to the other, more specific software.Additionally, an interviewee named FeedbackLogs as a repository of past mistakes, solutions, and best practices.If an issue emerged, it could be used to trace the source of the issue as well as to identify past reactions to similar issues and the (long-term) effect of these reactions.
Expected Challenges of Implementing FeedbackLogs.The practitioners anticipated several challenges during the practical implementation of FeedbackLogs that are listed below.
Log Access.It is essential to consider who would be able view a FeedbackLog, amend it, and who would be able to assign these access rights.Since one of the main benefits of FeedbackLogs is that they can increase transparency and accountability, we propose maximum internal viewing access with minimum edit rights.However, this should be customizable to the specific needs of a team.Thus, we plan to incorporate the ability to assign and restrict access in further versions of the FeedbackLogs.
Scalability: Search and Linking FeedbackLogs.FeedbackLogs will be created by different FeedbackLog owners along the entire ML pipeline.Additionally, large organizations often have numerous teams working on various ML models, each of which might require input from many stakeholders.Two practitioners mentioned concerns around organising FeedbackLogs and establishing a structure between the individual entries.Future versions of FeedbackLogs could address this concern via the ability to link and search FeedbackLog entries.In many cases, linking FeedbackLogs is essential to trace decisions: For example, initial exploratory user research often scopes product requirements first.These are refined with further user research as well as consultations of the technical team regarding feasibility, both resulting in further FeedbackLogs with more detailed technical requirements.The FeedbackLogs of these different steps should be linked, so it is clear which insights prompted which technical solution.
Logistical Trade-offs.Completing a FeedbackLog involves a compromise between detail (e.g. the number of different incorporation strategies considered or the level of description of the final update) and labour.Two practitioners mentioned that it might be a nuisance for the FeedbackLog owner to chase the different required inputs from several team members.However, they agreed that future auditing processes will require detailed process logs for many systems.
The current version of FeedbackLogs already offers a high degree of flexibility regarding the depth and detail provided, allowing practitioners to complete it following the depth-labour balance that they deem fit.We plan to maintain and further develop this flexibility in future FeedbackLog versions.

Summary. The collected practitioner perspectives offered valuable insights into aspects of the FeedbackLogs
that could be improved to increase its fit within existing ML pipelines.In addition to the concerns mentioned by the practitioners, we identified three further challenges for practical applications of the FeedbackLogs, given in Appendix D. To facilitate the collection of stakeholder insights, as well as to make FeedbackLogs accessible for first practical use cases, we introduce an online demo that allows for the quick generation of a FeedbackLog.

FeedbackLog Demo
To ease and encourage the adoption of FeedbackLogs, we provide an open-source FeedbackLog generator 1 .We acknowledge that this demo is a prototype, solely meant to illustrate the components of FeedbackLogs and to gather feedback on how they may be incorporated into existing workflows.Our tool consists of two components: a web interface for stakeholders, practitioners, and auditors to interact with; and a command-line interface (CLI), shown in Figure 7, to enable practitioners to track updates at the source code level2 .The FeedbackLog generator addresses the three desiderata described in Section 2.2: (1) Completeness: The tool covers the spectrum of possible update types: all feedback and ecosystem-level updates are logged in the web interface, while model-level updates are tracked by the CLI.
(2) Flexibility: The web interface is designed to be ecosystem-agnostic, providing a universal interface that can be used alone or with other logging methods.At the time of writing, the CLI only supports Python [77].
(3) Ease of Use: The web interface contains prompts for expert feedback and structures a FeedbackLog automatically.To ensure all feedback is incorporated, the CLI has a built-in checklist that consists of the FeedbackLog components that can be integrated into a practitioner's existing workflow.
In the future, our tool can be extended to a richer interface with which both stakeholders and practitioners can interact.This would ease the creation of -and updates to -FeedbackLogs, as stakeholders could provide feedback within the tool and practitioners would update the pipeline using our CLI integration.Such a tool would also reduce the burden of maintaining a FeedbackLog.

Example Logs
We now walk through concrete examples of FeedbackLogs: three FeedbackLogs obtained from industry practitioners on ML pipelines still in development and one demonstration log using a real dataset and model that shows a completed pipeline.

4.
3.1 FeedbackLogs in Industry.We collected FeedbackLogs from three practitioners for ML pipelines that they are working on or have recently worked on.They were provided with a blank FeedbackLog template that they completed in their own time.More details on the methods can be viewed in Appendix C. Since the practitioners chose ongoing projects, we refrain from providing the Final Summary section.Additionally, to avoid sharing specific information about proprietary ML models, these FeedbackLogs focused on higher level pieces of feedback that practitioners have encountered.As such, more complete FeedbackLogs in practice may be much lengthier or messier than the examples that we provide.We describe the projects and the learnings from each FeedbackLog.
Asthma Conversation Agent: This FeedbackLog describes the project of a national healthcare body to develop a conversational agent for asthma patients, operating via WhatsApp.The aim is to help patients with the management of their condition, including the prediction of the onset of asthma attacks.The FeedbackLog (Figure 3) contains two records that demonstrate how practitioners can track the needs of stakeholders.At the starting point, there was no statistical metric was defined by the practitioners; however, the log provides evidence that the metrics eventually used in the project are informed by consultations with clinicians, who are domain experts on asthma.In case of an audit, practitioners can demonstrate how alternate methods of incorporating the feedback were considered, herein adding details to metrics or fine-tuning the model.The summary captures why a particular update (i.e., metric details) was selected.
Recommender Systems: Next, the FeedbackLog describes a model developed by a large streaming platform, aiming at increasing the user engagement of subscribed users.The FeedbackLog (Figure 5) shows how the structure of the log is capable of capturing end-user needs and translating this feedback into a concrete UX update.This update manifests as the addition of a "like" button to gauge user preferences over repeated interaction, and improves the click-through rate metric used to measure performance for this application.
Sexual Health: This FeedbackLog concerns the healthcare domain, focusing on sexual health.A national healthcare provider developed a model to automatically offer treatment for patients suffering from chlamydia symptoms, based on their answers to a questionnaire.The aim of the described stakeholder involvement was to identify accessibility and usability issues for vulnerable demographic groups, risking inaccurate treatment allocation.While both the previous records document feedback provided to projects where the ML pipelines are already set up, this FeedbackLog (Figure 6) captures changes that occur in the data collection phase before a model is even trained.The feedback collected from patients and psychologists informed practitioners that their dataset collection must better accommodate individuals from vulnerable demographic groups.This log could be used as evidence to demonstrate how the organisation took into account the conditions of vulnerable patients, who now have an alternative method for having their data collected in a way that minimises the risk of unrepresentative data.
The example FeedbackLogs provided useful insights in the template's ability to represent the feedback collection and model updating process.The FeedbackLogs concisely tracked the incorporation of feedback for each project, showcasing the flexibility of the FeedbackLog to describe changes to the pipeline at various stages.Image Recognition: This FeedbackLog (Figure 4) shows records that track non-technical, ecosystem updates as well as technical, model updates.In this case, two updates (to the parameter space and dataset) needed to be used simultaneously since no individual update was sufficient to meet the metric requirements.However, we note that individual updates are still tracked.This FeedbackLog contains a final summary, as the updates per the second record satisfy specified metrics.

CONCLUSION
Stakeholder engagement is important to consider when deploying ML pipelines.However, even when stakeholders are consulted by practitioners, their feedback is rarely tracked and incorporated in a systematic manner.In this work, we propose FeedbackLogs: a tool for practitioners to document the process of collecting and incorporating stakeholder feedback into the ML pipeline.FeedbackLogs are designed to be complete, flexible, and easy to use.Through real-world examples, we demonstrate how FeedbackLogs can record a wide variety of stakeholder feedback and capture the resulting updates made to ML pipelines.We hope FeedbackLogs usher in the development of extensible tools for practitioners to empower the voice of a diverse set of stakeholders.

C PRACTITIONER ENGAGEMENT: METHODS
The following section provides details to the method of the practitioner engagement steps described in section 4, i.e. the semi-structured interviews (section 4.1) and the example FeedbackLogs (section 4.3).Both steps were approved by the Ethics Committee of the University of Cambridge.

C.1 Semi-Structured Interviews
The ML practitioners for the semi-structured interviews were recruited via the personal or professional networks of the researchers.Each interview lasted between 45 and 60 minutes and was conducted via a video call.The three practitioners had different roles along the ML pipeline, i.e.UX researcher, developer, and engineering manager with varying levels of experience (from one year to over five years).The interviews followed an interview guide in a semi-structured manner.The guide included sections on (1) the practitioners' experience and role within ML, (2) their awareness and practices around current stakeholder involvement and their perception of this, (3) high-level feedback regarding the idea and usefulness of FeedbackLogs, and lastly (4), feedback on the timings, responsibilities, and challenges they foresee when applying FeedbackLogs in a specific scenario.There was time for the participants to ask questions and add additional thoughts.The fourth section was supported with a Miro board that displayed an empty FeedbackLog template.This template was used to discuss the order in which the different sections would be completed in practice, the responsibilities for completing the different sections, and the agency of the different roles in determining the content of these sections.The interviews were recorded, summarised, and analysed using thematic analysis [11].

C.2 Example FeedbackLogs
As for the semi-structured interviews, practitioners that were consulted for the example FeedbackLogs were recruited via personal and professional networks.Two practitioners worked as UX researcher and designer, the third practitioner was a developer.They had between three and nine years of experience in their role.The practitioners were provided with an online document that included the sections of the FeedbackLogs as headers with a short description of the content that such a section would entail.Then, they were asked to complete each section for an ML project they are working on or have recently worked on.This could be done asynchronously in their own time.The completed documents are the core of the example FeedbackLogs, with slight edits and cuts to increase conciseness.

D ADDITIONAL CONSIDERATIONS
In addition to the concerns mentioned by the practitioners, we identified three further challenges for practical applications of the FeedbackLogs.
Measurability of Impact.Assessing the impact of an update implemented in response to stakeholder feedback can be challenging.Some updates have effects which are hard to define empirically, such as trust or accessibility.In such cases, practitioners could consider expanding the tracked metrics to give a more holistic picture of the pipeline and its objectives.
Reproducibility.If third parties rely on FeedbackLogs to reproduce models and replicate a development process, it is essential that practitioners meticulously create and maintain their FeedbackLogs with sufficient detail.For some pipelines, this may include the need to track which how much of the pipeline was procured from third-party vendors.
For instance, if a practitioner fine-tunes a procured large language model [10,12] for a particular task, they should denote this in the FeedbackLog but also request thorough documentation of the base model.

Fig. 1 .
Fig. 1. (Left) Existing documentation uses static snapshots of a model to document an ML pipeline.(Right) In contrast, we propose FeedbackLogs to track the iterative development process.Herein, practitioners engage stakeholders for feedback and update the ML pipeline accordingly.While a FeedbackLog contains a starting point and final summary to bookend stakeholder involvement, the bulk of the FeedbackLog are the records that document practitioners' interactions with stakeholders (shown in purple).

4. 3 . 2
Demonstration of a Complete FeedbackLog.While the three industry examples demonstrate how FeedbackLogs can be used in the real-world, industry practitioners are prevented from sharing proprietary information about the exact models that are being developed.As such, we provide a demonstration of a complete FeedbackLog, which uses a real dataset and model, and includes details of technical updates.We consider a hypothetical scenario wherein a practitioner is developing an image recognition model for automotive vehicles.
since the FeedbackLog can capture any arbitrary starting point in the development process.A proper starting Description of the dataset(s) used to train/test/validate the model.Models: Description of the model(s) used and any existing design decisions.Metrics: Description of the metrics used to evaluate the model(s) and their performance.