Apple's Knowledge Navigator: Why Doesn't that Conversational Agent Exist Yet?

Apple's 1987 Knowledge Navigator video contains a vision of a sophisticated digital personal assistant, but the natural human-agent conversational dialog shown does not currently exist. To investigate why, the authors analyzed the video using three theoretical frameworks: the DiCoT framework, the HAT Game Analysis framework, and the Flows of Power framework. These were used to codify the human-agent interactions and classify the agent's capabilities. While some barriers to creating such agents are technological, other barriers arise from privacy, social and situational factors, trust, and the financial business case. The social roles and asymmetric interactions of the human and agent are discussed in the broader context of HAT research, along with the need for a new term for these agents that does not rely on a human social relationship metaphor. This research offers designers of conversational agents a research roadmap to build more highly capable and trusted non-human teammates.


INTRODUCTION
In 1987, before the advent of widespread mobile phones, email, or web-browsing, Apple released the Knowledge Navigator (KN) video (https://en.wikipedia.org/wiki/Knowledge_Navigator,https://youtu.be/umJsITGzXd0), a vision of what human-computer interaction might look like in the future of 2011.A salient component of that vision is a rich conversational dialogue between a professor and his digital assistant.While news media have generally praised the Knowledge Navigator video for the accuracy of Apple's forecasting Siri [62,63], the conversational ability and shared context between human and agent shown in the video still exceed those of today's digital assistants [65], even in the era of ChatGPT.In the present era of emerging human-agent teaming and human-robot interaction, it is worth analyzing why the sophisticated human-agent interaction presented in the KN video does not yet exist.The present paper considers technological barriers (computational requirements) as well as whether users would be willing to share enough personal information to support that form of interaction (cultural requirements of shared context).
Just as a service blueprint elaborates on the technical infrastructure needed to support a customer journey [6,78], quotations from the KN video can be used to envision the potential technology infrastructure required to support those utterances, as well as the user expectations and norms that would lead to the words being said.For example, Professor Mike Bradford says to his digital assistant, Phil, "Pull up all the new articles I haven't read yet."Phil replies, "Journal articles only?"This exchange suggests that Phil and Mike have a shared context of Mike's reading history.This implies that Phil has a way of monitoring everything academic that Mike reads, can integrate that knowledge with an overall model of Mike's knowledge, and that Mike willingly accepts that level of surveillance.It also implies that Phil, the agent, understands that in the academic publishing domain of many disciplines, journal papers have higher desirability than other forms of articles.This interaction is one of several intelligent conversations in the KN video (others noted below) that exceed current agent capabilities.
Because of shared context and common ground, Mike's trust in the agent Phil is very high, Phil is highly adept with ambiguous tasks, and Phil can maintain a conversational topical thread over multiple utterances.This gap between current chatbot technology and the abilities demonstrated in the KN video motivate the following research question: What constraints prevent the widespread adoption of agents that can hold these kinds of conversations?
To analyze the entire KN video more systematically, the authors adopted three theoretical frameworks that can be useful for analyzing teaming between humans and agents: the Distributed Cognition for Teamwork (DiCoT) model [31,32], the Human-Agent Team Game Analysis Framework [69], and Flows of Power (FoP) framework [51].These frameworks enabled a thorough examination of the cognitive dynamics, human-agent interactions, and power relations within the video.To contextualize the following research, it is worth briefy reviewing the KN video itself, whether it is valuable to examine future-focused media, what the existing research feld of human-agent teaming has to ofer, and why the three theoretical frameworks (DiCoT, HAT Game Analysis Framework, and PoA) have been chosen for this analysis.

Apple Knowledge Navigator Video
The KN video was produced in 1987 by Apple in only six weeks for part of their presentation at Educom, a higher education conference [17].The Educom presentation, including the KN video, was repeated at Macworld in January 1988, which was the public and the media's frst glimpse of Apple's newest vision of human-computer interaction [17].A key component of this interaction was the main character, Mike, a professor at UC Berkeley, speaking with an intelligent digital assistant, Phil.Phil appears human and exists on the Knowledge Navigator, a foldable touchscreen tablet [4].
In the video, Mike enters his ofce, opens the Knowledge Navigator tablet computer, and Phil (the agent) begins speaking: "You have three messages. .."Phil then describes Mike's daily schedule, and when "lecture at 4:15" comes up, Mike engages Phil to help him prepare slides for that talk.Along the way Mike notes, "There's an article about fve years ago, Dr. Flemson or something," and Phil fnds it: "John Fleming of Upsala University."During this search, they mention Mike's colleague Jill, who has published a recent article.Phil demonstrates the ability to pull information from multiple sources, combine it in real time, and ofer simulations per Mike's requests.Eventually, they engage Jill in a videoconference to discuss her fndings.The call has many features of Zoom, including screensharing (but no one says, "Can you see my screen?").Phil is an equal participant in the video conference.At one point, when Mike and Jill are talking, Mike forgets what time his lecture is: "That's not until, um. .." and Phil takes the initiative to fll in for him helpfully by saying, "4:15." Mike and Jill collaborate further and wrap up the call.Then Mike prepares to go to lunch and leaves Phil with several requests to fulfll while he is gone.This 5.5-minute video ofers a vision of what conversational agent collaboration might look like.Phil is dressed in a bowtie, somewhat connoting a butler appearance, and serves as the consummate trusted personal assistant.

How Analysis of Fiction Can Aid Design:
The Value of Design Fiction One may wonder whether analyzing an imaginary technology can aid in the design of a real technology, since the underlying assumptions of the imaginary system might be unrealistic.Common UX practices such as personas, user stories, and customer journeys point to the power of narrative and storytelling, and HCI researchers have beneftted from narrative as well, e.g., [45,50,75,88], especially the genre of "design fction, " to use a term coined by Bruce Sterling, as noted by Mark Blythe [7].Creating a story of a user's positive experience with a future technology challenges current designers to ask, "Why couldn't we create that now?" and "Should we create that?"This challenge is part of the motivation for the current paper.More broadly, design fction like the KN video allows users to ask themselves, "Would I want to use a system like that?" Since science fction creators are focused more on entertainment than systematically predicting the future [84], analyzing the accuracy of design fction predictions is not the goal for the HCI researcher [4]; instead, the goal is eliciting possible user desires.Although some movies, like Stanley Kubrick's 2001: A Space Odyssey (1968), are later praised for their technological accuracy and, in the case of 2001, are even used by NASA as part of training [25], science fction scenarios typically only correctly posit general trends.The Age of Spiritual Machines [41] predicted that by 2009, people would wear over a dozen portable computers embedded within their clothing and jewelry.While this prediction broadly aligns with the trends of decreasing size of mobile technology and increasing use of wearables, the timeline and extent of these trends do not align with the prediction.This gap reinforces the idea that the value of analyzing past speculation about the future lies beyond the direct evaluation of predictions.Researchers seeking design inspiration from future-focused media have proposed a collaborative relationship in which design fction infuences the trajectory of current products in development, and new technology, in turn, inspires creators' environments [68].Workshops and courses at CHI and SIGGRAPH on how HCI can beneft from science-fction have also amplifed this interest [48,53].Based on this relationship, this paper focuses on the design requirements of the attractive human-agent interaction envisioned in the KN video.To explore this area, it is important to be aware of existing research in human-agent teaming.

Human-Agent Teaming (HAT)
Human-agent teaming (HAT) explores synergistic collaboration between humans and agents.It encompasses diverse aspects like shared goals, efective communication, mutual understanding, and coordinated actions [44].Related terms for this concept include human-autonomy teaming, human-AI teaming, collaborative AI, cooperative human-machine interaction, augmented intelligence, and joint human-agent operations.Several components of HAT overlap with research in human-robot interaction, though physical interactions and issues of proximity are typically left to the robotic discipline.In HAT discourse, one central debate in the feld is whether autonomous agents should be framed as teammates or tools [10].Some researchers propose regarding autonomous agents as potential teammates, while others view these agents as smart, yet limited, assistants, like trained dogs or horses [77].Researchers taking the teammate perspective may measure the non-human agents' social and team-related skills when focusing on "human-likeness" or "teammate-likeness" [87,89].When researchers treat the collaborative agents as members of a diferent species, on the other hand, the concept of machine teammates becomes less about human-like characteristics and more about their ability to contribute efectively to team interactions [18].
If a user can adopt the familiar mental schema of a tool or teammate, the implications for interacting with an agent can be easily inferred and reduce cognitive load.If an agent is a tool, it is designed with a limited function for specifc situations, and it is not used outside of its intended function (i.e., the agent does no learning, has no initiative, has little authority, and does little sensing of context).It may also need regular maintenance, like sharpening a kitchen knife.If an agent is a teammate, it should have the capabilities to learn, make its own decisions, and coordinate with humans.This includes ensuring that its human teammates have an awareness of its relevant capabilities, plans, assumptions, and reasoning processes (automation awareness) [14].Providing an agent design with afordances that enable the human to learn these capabilities easily could lead to the agents being considered teammates.
Moreover, another critical factor that shapes how users interact with autonomous agents in HAT is anthropomorphism, which refers to the degree to which an agent is perceived as human-like.Humans tend to attribute human-like qualities to objects, even when they do not appear visually human-like [54].For example, in the domain of computing, as well as in the domain of naming ships, the smarter the machine (or the larger and more complex the ship), the more likely it will be named and ascribed some human characteristics [12,37].Naming a self-driving car and giving it a human-like voice can increase trust and competence perception [83].This anthropomorphism is complex and varied, afecting the treatment of agents [67] and likely afecting whether the agent is treated as a teammate or a tool.It is worth noting that humanlike appearance (e.g., having a face) is not the critical focus of this anthropomorphism; a key focus is human-like interactions that support information exchange [54].
As robots and software agents have progressed, people have envisioned them playing a wide variety of social roles, including security guard, butler, housecleaner, and gardener [33].The attempt to leverage these metaphors (previously known mental schemas for social relationships) makes sense, in that doing so can save people the cognitive load of establishing new schemas or metaphors [21].However, the current schemas for assistants are not even clearly defned for humans.Current human resources titles for these roles vary, ranging from administrative assistant to personal assistant, executive assistant, and even hybrid terms like executive/personal assistant [15].Also, the extent to which such an employee handles personal tasks for the employer varies, and previous research on domestic workers' relationships with employers has documented the tension resulting from unclear boundaries on the assistant's work and the conficting roles that can arise, e.g., "Am I a housecleaner or a caregiver for your children?" [1,29].An employee in an ambiguous position such as this may be concerned how much work there is to do or about the prestige of their job role.This phenomenon illustrates how difcult it can be to defne a specifc working relationship among humans.
One might consider human-agent relationships simple, since the human is typically in charge, but they often are not.Studies of soldier-robot relationships in the social robotics discipline have described extensive anthropomorphism and emotional attachment to robots [13].When a robot company produced an online video of their robotic dog product and showed an engineer kicking it to demonstrate the robot's ability to remain balanced, some people were concerned about this "violent" act [74].Similarly, one student co-author reported being asked by their parents to speak nicely to the digital assistant in their home, rather than telling the agent to "shut up." The key issue is whether a human role schema like "butler" or "assistant" is an appropriate ft for an agent.Theoretically, a software agent is indefatigable and can perform as much work as a user can assign.Also, the agent will have no concern about the prestige status of its work.Finally, because the agent is not human, with feelings that can be potentially hurt, human users can feel free to treat the agent more like a tool than a teammate.The fact that the insult "corporate tool" exists, suggesting that someone is just a gear in a giant machine-like corporation, with little initiative or agency, acknowledges that humans already periodically treat other humans as tools, and that attitude could easily apply to agents.More technically, this phenomenon could be described as having high power distance in the workplace [59], with the employer viewing the employee as beneath them in status rather than a friend or colleague.Overall, it is worth analyzing the human-agent dialogue within the KN video to explore the role that Phil plays in Mike's work and life and how that role might be described.

Analysis Frameworks
Multiple frameworks were considered for analyzing the KN video, particularly concerning the interaction dynamics between human and agent.Activity theory [5], common ground theory [52], and communication theory [76] were deemed less focused on the specifc dynamics of collaborative cognition or lacked explicit addressing of collaborative cognitive processes.The Distributed Cognition for Teamwork (DiCoT) was chosen based on its alignment with the joint problem-solving and knowledge-sharing activities observed in the video involving Mike and Phil.The HAT Game Analysis Framework was chosen based on its explicit characterization of agent characteristics and the human-agent relationship.Other HAT frameworks such as HACO [22], RL-HAT [49], and modeling using TDF-T diagrams [26], are focused more on the underlying software development and simulation aspects of human-agent interaction, rather than on the user's interactions.For examining power dynamics, the Flows of Power (FoP) framework [51] was selected for its capacity to uncover real-time power shifts not explicitly addressed by other frameworks.For example, Dugger's power framework [23] concentrated on macro-level power structures, Delnitz & Rödder's framework [20] primarily focused on static power structures, and feminist standpoint theory [3] focused on broader societal structures.

Distributed Cognition for Teamwork (DiCoT). Distributed
Cognition [34] focuses on cognition's distributed character among team members.Hutchins extended the purview of distributed cognition beyond individuals by illustrating how team members collaboratively transform and propagate the information within a system.To achieve a more organized approach to analyzing sociotechnical systems, Furniss & Blandford [28] coded the distributed cognition model and proposed a framework named Distributed Cognition for Teamwork (DiCoT).The DiCoT framework groups 18 principles of distributed cognition into three fundamental components that collectively shape the cognitive processes within a team: (1) physical layout, (2) information fow, and (3) artifacts.Physical layout pertains to the spatial arrangement and distribution of team members and resources, which infuences their interactions and access to information.Information fow focuses on how information is exchanged and shared among team members, highlighting communication patterns and channels that afect overall teamwork efciency.Artifacts refers to the external tools, technologies, and objects employed by the team to support their cognitive tasks, emphasizing how these aids enhance problem solving and decision-making capabilities.
1.4.2Human-Agent Team Game Analysis Framework.Various frameworks have been proposed to answer questions about an agent itself, such as what type of agent it is, how much autonomy it has, and what roles it plays [39,55,66,69,81].Tokadli et al. [81] introduced a human-agent interaction framework, presenting fve characteristics of HAT interaction.Building upon Tokadli et al.'s work [81], Sepich et al. [69] further expanded that approach to propose the HAT Game Analysis Framework.This framework is not limited to analyzing games; it enables characterization of HAT relationships more generally, using dimensions of team confguration, agent type, level of autonomy and intelligence, control mode, interdependence, type of interactions, and interaction timing.

Flows of Power (FoP).
To extend this analysis beyond the cognitive and agential aspects, the author used the FoP framework [51].FoP ofers a lens for understanding micro-level power dynamics within collaborative settings.Unlike traditional frameworks, FoP describes the relationship between the intended interactions within a collaboration (designed interaction orders) and the actual interactions (emerging interaction orders), sometimes characterizing "impasses" that arise and make collaboration difcult.The framework identifes several relevant "arenas" based on the actors, the environment, and the task.Additionally, FoP introduces the concept of power fows which may change the designed interactions to emerging interactions based on power dynamics of the participants.This framework can be particularly useful for explaining impasses that arise during collaboration among team members with diferent levels of power [51].

METHODS
To analyze the KN video, a log of the video's events (transcriptions of audio, notes of actors' behaviors) was placed in a spreadsheet.The dialogue, actions, and agent capabilities were then coded using the DiCoT and HAT Game Analysis Framework.Then, events were coded as being feasible and common today, feasible and not common today, or not feasible today.Feasibility was determined by comparing the demonstrated agent capabilities to those of widely adopted agents like Apple's Siri and to current trends in HCI research and development.These characterizations were then used to consider why the Phil agent difers from today's personal digital assistants.As shown in Table 1, each row in the spreadsheet contained timestamps, the identity of the speaker, the transcribed dialogue, the corresponding physical actions of the speaker, and the triggering factor for the action.Some spreadsheet cells remained empty if there was no trigger shown within the video.Two of the authors transcribed the video and segmented events into distinct rows.To ensure accuracy, two additional authors conducted quality checks on the spreadsheet rows to validate that the designations made sense and were internally consistent.The fnal spreadsheet is available at http://OSF.io/4p3fm/.

DiCoT Data Analysis
While using the DiCoT framework to analyze the video, the team focused on its second and third components, information fow and artifacts.Every speech utterance or action taken by any of the human actors, Phil, or the KN device itself (e.g., "A new fashing icon appears. ..") was logged as an event and considered.The frst DiCoT component, physical space, was excluded from this analysis because the focus of this research was on human-agent interactions.The team frst met to align on a common understanding of how to apply the DiCoT framework and coded the frst fve lines of the transcript together.To analyze the second DiCoT component, information fow, the direction and modality of information fow were coded.For example, an utterance from Phil to Mike is coded as "auditory, Phil to Mike." This part of the coding focused on Principle #8 of the DiCoT framework, Information Transfer.After coding the frst fve events collectively to align on the coding method, the remaining events were divided roughly equally among four of the authors.A ffth author independently reviewed the coded transcript to check for consistency.The third DiCoT component, artifacts, includes less tangible ideas, like knowledge and capabilities that the KN agent possesses.The same group of four authors reviewed the transcript, aligned on the coding method, and made note of each artifact.This was again independently reviewed by the ffth author for consistency.

Analysis of Agent Capabilities
Coding the KN transcript with the information fow and artifact components of the DiCoT framework also generated a list of agent capabilities.For example, coding the Information Flow identifed not just transfer of information, but also transformation of information (e.g., Phil summarizes messages for Mike) and bufering of information (e.g., Phil waits for the appropriate time to notify Mike of a missed call).Likewise, coding the artifacts identifed Phil's capabilities and shared context (e.g., Phil knows Mike's reading history).
The authors reviewed the list of agent capabilities and merged repeat entries (e.g., there were multiple instances of displaying knowledge of the user's relationships).Next, they classifed the agent capabilities into the three categories of feasibility described above.The team then discussed barriers to developing the agent capabilities that are not common today.Finally, emergent themes were identifed among this list of barriers.(Phil may or may not encode the party as for Mike's father.) Second priority seems to be reviewing schedule.

Amazon rainforest.
The research team also reviewed the agent capabilities using the HAT Game Analysis framework.Two of the authors each coded half of the transcript according to the seven dimensions in this framework.Then, they both reviewed the other half and met to discuss any discrepancies.Because the agent sometimes demonstrated behaviors that ft multiple categories within a HAT Game Analysis dimension, the authors generated a table to show which capabilities the agent displayed.To compare these capabilities with today's technology, the authors also classifed the capabilities of Apple's Siri assistant as of August 2023.The authors considered including a comparison with ChatGPT, since it is one of the few current conversational agents that understands both pronoun antecedents and detailed knowledge, but it is not designed to be a personal assistant.An attempt to have a conversational dialogue with it will result in detailed discursive responses of inappropriate length for a human conversation.

Analysis of Power Relations
The authors applied the FoP framework to analyze power relations in the Knowledge Navigator video.While a complete sociotechnical analysis would explore the relationships between all characters mentioned (Mike, Phil, Mike's mother, Jill Gilbert, Kathy, Robert Jordan, Tom Lee, John Fleming, and Mike's graduate team), the present analysis focused on Mike and Phil, asking what circumstances would need to be present to create the interaction that they demonstrate.The authors then compared the power fows with that of a Siri user and Siri.Describing the arenas in designed interaction orders from the FoP framework, the researchers systematically examined how the Knowledge Navigator's Phil and Apple's Siri assistant have diferent collaborative dynamics.Three authors started the analysis using the 10 arenas described in Molinengo's [51] paper.The three authors used each arena as a lens to examine the designed interaction orders of Phil and Siri.Then, the power fows that arose from the emerging interaction orders could be used to count the diferences between Phil and Siri.

RESULTS
The DiCoT framework was used to analyze information fow between the Mike and Phil.There were ffteen utterances spoken from the Mike to Phil, and there were twelve utterances spoken from the Phil to Mike (Figure 1).The rest of the communication occurred through touch and visual modalities, with the video showing eight instances of Mike interacting with the KN via touch and eight instances of Phil communicating with Mike via visual display updates.
Analyzing the KN video transcript with the DiCoT framework generated a list of 26 agent capabilities, such as "Knowledge of contacts and relationships" (e.g., Mike's mother) and "Can accurately extract data from a publication" (e.g., Phil summarizes the results of an academic paper using a graph).Related capabilities were then grouped (Table 2).Capabilities that were categorized as "currently feasible but not common today" or "not currently feasible" were tagged with constraints that restrict their adoption or development; capabilities that are currently feasible and common were not of interest to the current research question.Several types of feasibility constraints were noted with each infeasible agent capability.Some were based on the user, such as trust or privacy, and some were based on available technology itself.The authors used categories similar to those used in previous studies of barriers to technology adoption [36,40,43] to group the constraints into three user-centered categories (privacy, social and situational, trust and perceived reliability), and one technology category.Any constraints that included issues in computer ethics and the safety of personal or confdential information were tagged Privacy.Any constraints that included conficts that arose from environmental factors and social norms were tagged Social and Situational.Constraints that included the hesitation to give agents control over important tasks because of a lack of trust in their perceived reliability were tagged Trust and Perceived Reliability.Constraints that included capabilities that have not yet been achieved with today's technology were tagged Technological.While these four categories are not novel in themselves, they serve as useful groupings for analyzing the capabilities of any new agent, and these particular results demonstrate that process.
Using the HAT Game Analysis framework to compare Phil and Siri revealed diferences across all seven dimensions (Table 3).Phil participates in a multi-human team by engaging in the conversation between Mike and his colleague, Jill Gilbert, and it is implied that he may collaborate with other users' agents when managing messages and scheduling.In contrast, Siri responds to one user at a time and may send messages on a user's behalf but will not engage in a dialogue with outside users or agents.Phil also supports a broader range of agent types.In terms of level of automation, Phil demonstrates higher levels of intelligence and autonomy than Siri.This higher level of autonomy is refected in the diferences in control mode and interdependence as well.In the dimension of interaction, Phil operates at a dialog level, sharing responsibilities with humans, while Siri primarily carries out actions based on human inputs (direct input).Regarding timing, Phil supports realtime interactions, while Siri operates on a turn-based system.
Comparing Phil and Siri using the Flows of Power framework resulted in diferences in all the 10 mentioned arenas in Molinengo's [51] paper (Table 4).In analyzing usage with both technologies, distinctive patterns emerge across various arenas.The agenda of the Mike-Phil team, driven by the KN, aims at co-managing Mike's work life, contrasting with Siri's role in facilitating life as an informational resource for users or a tool to simplify certain procedures in life, such as making hands-free calls and setting reminders.Actors in the KN scenario depict a dynamic interplay where Mike holds control, yet Phil can initiate actions based on context.Siri primarily responds to commands but can also take initiative by ofering suggestions such as routes for a user's next meeting based on their travel habits and current trafc patterns.Forms of Interaction vary, with Phil using real-time speech and touch of almost anything in the user interface, while Siri relies more on voice commands and touch control of specifc digital objects such as calendar appointments and contacts.Contextual settings reveal Phil's rich contextual awareness of Mike's world, in contrast to Siri's limited context understanding.Facilitation materials and documentation difer more dramatically, with Phil having access to a much greater range of digital objects than Siri.The results of these interactions distinguish the collaborative accomplishments between Mike and Phil from Siri's user-centric outcomes.Expertise distribution, funding considerations, and time dynamics further highlight the nuanced disparities between Phil and Siri.
Comparing the power dynamics more broadly between a user and Siri vs. Mike and Phil, a notable power diference appears when Phil and Mike want to share information with each other.At the video's onset, when Mike enters the room, Phil initiates information sharing without Mike's explicit request.Phil's distinctive ability to decide the extent of information to share with Mike stands in contrast to current technologies, where information is typically presented uniformly per a user's request.Phil's power in sharing information is further exemplifed when Mike requests articles, and Phil summarizes Jill Gilbert's paper and John Fleming's papers automatically based on the need of the current conversation's rhetorical goals.This information sharing prowess infuences several arenas, including Agenda, Actors, Results, Expertise, and Time, distinguishing Phil from today's technology.
Appropriate and accepted interruption of each other by both Phil and Mike is another signifcant aspect of power dynamics.Mike interrupts Phil when he has the idea of a surprise birthday party, which Phil presumably notes before continuing to read Mike's daily agenda.Phil's decision to interrupt Mike while he examines the university research network with a timely inquiry ("Jill Gilbert is calling back") demonstrates both Phil's context monitoring ability and a level of authority Mike has given to Phil.This ability for both parties to interrupt each other directly impacts interaction forms and documentation arenas.While Siri can recommend entering "do not disturb" mode if a user is starting a meeting, it does not Another essential element of power in KN is the mutual awareness between team members.As Mike searches for an article, Phil shares information based on an awareness of Mike's knowledge and preferences.In contrast, in today's technology, while the user has the power to know about Siri's capabilities, Siri typically provides information irrespective of the user's prior knowledge.On the other hand, Phil knows Mike's preferences, and is thus able to make choices such as interrupting Mike to announcing Jill's incoming call but instead taking a message when Mike's mother calls.Mike's ability to confdently gives Phil directives with no confrmatory acknowledgment from Phil ("Surprise birthday party next Sunday") is part of the setting arena (context awareness) and emphasizes a unique trust dynamic between Phil and Mike.
Interaction patterns with other non-human entities within the KN scenario are also distinctive.One example is Phil's intuitive understanding of what to do when Mike inserts a memory card and says, "Copy the last 30 years at this location at 1-month intervals." Mike doesn't reference the memory card itself, and Phil knows that "this location" relates to the map on the screen.Another example is Phil's management of Jill's screen-shared data.After Jill and Mike view two diferent maps, and Mike says, "let's put these together, " Phil knows how to interpret that request for merged data.The absence of Siri's power to reference data objects implicitly (or work in multi-user context) underscores the unusually extensive facilitation material needs required for KN's functionality.
The funding arena raises the question of what fnancial business model might be required to support an agent like Phil.Even if the development of an adaptive personalize learning agent was open-sourced, the amount of storage required to record what Phil knows about Mike would be signifcant.The original intended value proposition for Siri (before it was bought by Apple) was to generate revenue by recommending users to restaurants or websites that had paid for this lead generation service [86].After the purchase by Apple, Siri became a feature that drove hardware purchases; Siri is free,

Context Responsive
Changing its behavior in response to the environment.

Levels of Automation (Autonomy)
Assistant, Associate, Partner Phil's level of autonomy varies from Assistant to Partner, where agent has an equal authority.

Servant, Assistant
The agent is not capable of actions without operator's permission.

Control Mode Agent-initiated adaptive
The agent self-regulates when it waits for inputs vs. when it collaborates with the user in real time.

Supervisory
The user assigns tasks to the agent.

High
Most responsibility is shared between the user and agent.

Low
Use of this agent is optional; most tasks can be completed without it.

Real time
Turn-based but you must buy Apple products.One could argue that Siri still facilitates fnding desired information with fewer clicks, and that information could have its own advertising-based or subscriptionbased model.But despite the frequency of business models based on personal data mining and advertising today, Phil does not advertise airport transportation services or premium subscriptions to relevant podcasts to Mike.Perhaps Phil's deep knowledge of Mike's preferences is being sold to other providers so that they can customize their oferings to Mike with more personalized sales eforts, e.g., "Mike, I'm a realtor from Acme Real Estate.I know you like to keep your mother at a distance but that she needs your care, so I have a perfect house for you with an accessory dwelling unit out back for her. .."It is unclear whether current business models could support the creation of an agent like Phil.The future may hold a pay-perknowledge-upload model; perhaps Phil began as a generic administrative assistant agent with a high school education, but Mike (or his university) purchased the "liberal arts master's degree" knowledge upload and the "academic workplace" upload, and then Phil was able to speak thoughtfully about graduate students and deans.Similarly, if Mike were a surgeon, medical uploads could be purchased for Phil.These "uploads" (using the term from the movie The Matrix) could include not only factual knowledge, but skills, attitudes, cultural awareness, and collaboration skills within a given domain.An ongoing subscription fee might be required to keep it up to date.A new marketplace of personalized agent knowledge sales could supplement the market, enabling agents such as Phil to became widespread.A risk in this domain would be the assumption that purchased knowledge is authoritative; ensuring the quality of uploads could be difcult.For example, someone with extreme viewpoints might desire an agent that supports their claims and thus purchase uploads that consist of conspiracy theories framed as truth.Also, in a new form of information economy, when Mike retires, Phil could be sold to a younger colleague as containing multiple uploads as well as deep experience on the job.

DISCUSSION
The original research question explored what barriers prevent the widespread adoption of agents that can hold these kinds of conversations.The results illustrate multiple barriers at diferent levels.First, the KN agent's capabilities pose a variety of constraints in terms of privacy, social norms, trust and perceived reliability, and technological challenges.These constraints, discussed below, ofer us design requirements for future conversational agents.The agent capability analysis ofers details that can supplement those requirements.The touch input by the human and the display control by the agent raise a discussion of asymmetric user interfaces that might be present in human-agent teaming.Finally, the analysis of the roles played by the agent suggests it is worth exploring the ramifcations of the metaphorical labels for agents such as butler, assistant, or coach.

HAT Limitations and Design Requirements
The agent in the KN video can also be used to explore today's HAT limitations and create a research roadmap of design requirements that could make this type of agent a reality.The following is a discussion of the four themes that emerged from an assessment of the KN agent's feasibility.
4.1.1Privacy.One of the defning capabilities of the KN agent is its ability to seamlessly follow the user's conversation, working style, and overall intent.The agent has a vast amount of stored knowledge about the user; it recalls what research articles he has previously read and understands his relationship with other users.Its fawless window management on the tablet display suggests that it understands the user's working style.While the ability to defne relationships (e.g., defning one of Mike's contacts as "mom") is commonly accepted in today's smart assistants, the extent of stored user history demonstrated in this video would be far more controversial for users today.They would want to know who has access to that data and whether it is being shared or sold.Recent researchers note that agents like Alexa now allow verbal consent for data sharing [70].However, users' privacy concerns could be so great that they could pose an insurmountable barrier to the development of an agent like Phil, even if all other barriers discussed in this study were removed, as noted by previous research on privacy issues with digital assistants [8,82].Data laws such as GDPR might make Phil illegal to start with, depending on how Phil's knowledge is acquired, stored, and deployed.
Even more striking is that the KN agent is always actively listening.Previous research on user preferences with digital assistants has shown that this feature raises privacy concerns [47].In this video, the user never starts an utterance with "Hey Knowledge Navigator" or "Ok, Phil!" and at times, the agent enters the conversation on its own volition.The agent understands what the user is doing well enough to know that it is appropriate to interrupt with a relevant phone call ("Excuse me; Jill Gilbert is calling back").Later, the agent notices the user pausing to remember the time of his upcoming lecture and answers his unspoken question by stating the time for the lecture.Its knowledge of the user's reading history implies that it could also know about the user's ofine activities.
Although this video demonstrates how natural and complementary an agent teammate could be, the amount of user knowledge and tracking that it requires could be a signifcant barrier to adoption for many users.These capabilities may address common HAT challenges like miscalibrated trust and lack of shared situational awareness [72], but they have also been labeled "creepy" in a discussion of technology ethics and privacy [79], and more recently, researchers have developed a validated scale just to measure creepiness of voice assistants [61].This analysis notes the slippery slope from innocent social listening and data analytics that improve users' experience to misuse of user data by both individuals and companies.
Human assistants in organizational environments and domestic workers in the home are often entrusted with signifcant private and confdential information.It will be critical for future agent designers to consider how the similarities and diferences between that human sharing and sharing information with an agent can be navigated so that intelligent listening agents are not perceived as creepy.Also, within business organizations, agents will need to be wise enough to know which information is confdential within a single workgroup, which is confdential within the company, and which information is not confdential.4.1.2Social and Situational.One striking feature of this video to a modern audience is the high dependence on verbal, auditory communication.The user speaks to the KN agent ffteen times, almost double the eight times he provides touch input to the tablet.While this style of interaction works well with the agent acting as an assistant and in a private ofce, previous research about the practicality of voice interfaces [36] suggests that voice would not be a primary input modality.In an open ofce layout, this would likely create a distracting amount of intelligible background noise, which has been found to negatively impact employees' psychological and physical well-being [16].Additionally, the spoken dialogue may be situationally inappropriate (e.g., working in a quiet library) and cause frustration for users with accents [35,58].
Another striking feature of this video is Phil's human-like appearance.While some HAT research has found that the use of realistic [19] and gendered [30] avatars is associated with increased trust, it has also been suggested that this approach may not be desirable.Realistic avatars have also been labeled as "eerie" [80] and associated with decreased trust [71], attributing the efects to the uncanny valley efect.
An additional social constraint emerges from the KN agent's scheduling and message management.Existing tools allow users to send hands-free SMS messages, view shared calendars, and set reminders.However, few have taken the next step of empowering an AI agent to interact with other users (or other users' agents) based on informal instructions, such as: "Find out if I can set up a meeting with, um, Tom Lee." Google Duplex, after its 2018 demo of digital assistant scheduling a salon appointment over the phone [42], is being deprecated and perhaps redesigned [85].This change raises the question of the social acceptance of relegating these social interactions to a computer.Some users may refuse to work with the agent, as has been observed with self-service kiosks and online ordering [9].
More broadly, the widespread presence of such agents in society (Does everyone have one, or only the wealthy?) raises questions that have been raised previously in research on the efect of ongoing automation and digitalization on society [36].It remains to be seen, for example, whether some important human connection is lost when communication is automated, or whether this ofoading of administrative tasks enables people to fnish their work more quickly and meet a friend for lunch, as Mike does.
4.1.3Trust and Perceived Reliability.Setting aside the privacy and social concerns of empowering an agent to manage one's messages, schedule, and work, a further question remains: would users trust an agent to perform this work?The user in this video appears to trust the agent as implicitly as he would a reliable human assistant.This is exemplifed in the request: "If Kathy calls, tell her I'll be there at 2:00."The user provides no additional context and does not expect an acknowledgment or confrmation from the agent; he seems to trust that Phil can handle this task.If Kathy instead calls to say that her fight was cancelled or she got a ride to the airport with a friend, the agent will presumably know to adjust its response.In comparison, in a feld study of Amber, an actual work assistant agent, users had a difcult time building trust [38].This gap between user ratings and full two-way trust illustrates how difcult it would be to achieve the level of trust and reliability demonstrated by Mike and Phil.
In existing HAT research, an agent's resilience to functioning outside normal operating conditions is associated with increased trust [56].In this case, mobile devices unanticipated by the KN video ofer an alternative solution, albeit one that places more workload on the user.While Mike leaves Phil and the KN tablet behind to act as an answering service, it would be more common today to bring a smart watch or phone along to remain connected.An agent today could handle an unexpected message from Kathy with a push notifcation to inform the user and, if needed, request clarifcation on an appropriate response.
The user in this video also appears to trust the agent to correctly summarize messages to and from his colleagues.Today's large language models like ChatGPT have advanced language capabilities, but many have urged caution in trusting the accuracy of their work because they lack a real knowledge of the user and understanding of the world (shared context).The agent in the video may have those capabilities, as it was shown to have knowledge of the user and access to academic publications.If an agent possessed this much knowledge today, would it be trusted with more advanced communication and analysis tasks?Perhaps greater transparency and shared situational awareness between the user and agent would be enough to earn this trust.Or, as Wynne and Lyons [87] proposed, a positive feedback loop may generate this trust over time.The authors suggest that an initial transitional period in which the agent explicitly explained its actions would generate enough trust over time that less transparency would then be required.
4.1.4Technological.Several technological constraints prevent such intelligent conversational agents from existing.For example, the agent's sophisticated conversational skills exceed those of today's personal assistants and text generators.The agent understands the thread of a conversation between two humans about "last minute lecture material" and reminds the user about when that lecture starts with fawless timing.At another point, the agent understands "I'd like to recheck his fgures" as a request to show a new fgure on the screen and correctly describes it: "Here's the rate of deforestation he predicted."The technological constraint underlying this gap is the ability to establish shared context.
In human-human conversation, conversation is efcient due to a shared context or common ground between the participants.Conversation between humans is a continuous process of negotiated meaning through a process called grounding [11].As humans converse, they provide the minimum detail needed to move the conversation forward.This is possible because ambiguous statements are resolved based on the participants' shared knowledge and common ground.Each utterance is assumed to be relevant and helpful.However, minor breakdowns in conversation are frequent.Redundancy in conversational cues (verbal and non-verbal) makes detecting breakdowns easy, and humans are very good at repairing conversations quickly.Current agents and humans do not share the same level of knowledge and meaning, requiring information exchange to be more detailed and specifc.Furthermore, less shared context or common ground can make conversational breakdowns more severe and repair harder.The ease with which Mike and Phil the agent understand ambiguous statements, repair conversational breakdowns, and exchange information efciently are currently beyond the ability of today's conversational agents, which cannot yet account for the complexities of natural language interactions [2].
Additionally, though they are not explicitly related to the conversational dynamics, it is worth noting other technological challenges that must be overcome to support the envisioned conversation.The frst is smart window management capability and turn-taking, in which Phil adjusts the tablet's display contents based on Phil and Mike's ongoing work and dialog (i.e., Phil knows when to minimize, maximize, or change the contents of a window and when to let Mike take control).Second, Phil's ability to digest a published academic paper and represent its data reliably is beyond current capabilities, given the current active research focused on extracting data from publishing papers [73].Third, the KN scenario demonstrated no forms of authentication that might explain how Phil knew it was Mike who opened the KN.Perhaps Mike is carrying a digital key and Phil can sense his presence, much like cars do today with a key fob.Also, while Mike says at one point, "I'm going to lunch now," he continues giving Phil instructions after that statement, and then simply leaves the ofce without closing the KN, requiring Phil to infer that it is time to enter "away" mode and respond to Mike's mother's phone call during Mike's absence.While this proximity-based form of authentication is currently feasible, effectively similar to carrying one's smart phone at all times, the security implications of this form of authentication with an agent that knows deep personal information would need to be explored.

Agent Capabilities and HAT
The HAT research community has identifed common challenges in HAT interfaces, such as a lack of shared awareness, insufcient transparency, and miscalibrated trust [72].That research proposed three opportunities to address these challenges: greater transparency, bi-directional communication, and operator directed interfaces [72].Similarly, other researchers have focused on agent capabilities that lead users to treat an agent as a teammate: agency, communication, and anthropomorphism [24,46,87].Despite being created before HAT research began in the 1990s and grew more prominent in the past decade [56] , the KN agent embodies many of these capabilities.
Analyzing the KN agent, Phil, through the HAT Game Analysis framework indicated a high degree of agency.The agent type ranged from companion to advisor throughout the video, adjusting its behavior for the task.At times, Phil waited for the user to assign it a task (e.g., "Print this paper"), while at other times it was an advisor (e.g., providing key information from a journal article in real time).
This agency can also be seen in the Phil's interaction style (primarily dialog level) and timing (primarily real-time).The rich, naturalistic, and bi-directional communication between the user and agent has been proposed as a way to address challenges in HAT interfaces, such as a lack of transparency and shared awareness [72,87].The KN video demonstrates what this bi-directional communication might look like in a sophisticated agent.The number of utterances spoken between the Mike and Phil are roughly equal (15 user to agent, and 12 agent to user), creating a real-time, dialog-level interaction.While this equality can serve as a possible indicator of a human-like conversational dialog, a pure utterance count is insufcient; an agent programmed to acknowledge every human utterance with simply "Ok" could also achieve a near equal utterance count.Also, studies of whether communication frequency is an indicator of high performing teams have yielded mixed results [27].What is more interesting within the KN video is the number of things that Mike does not have to say because he and Phil already have a shared awareness.For instance, Mike does not have to specify to Phil what "more recent literature" means in terms of dates, or what "I'd like to recheck his fgures" means in terms of a programmable data query.

Asymmetric Interactions: Touch and Authority
An exploration of asymmetry interactions, which occur when users collaborate but have access to diferent information or a diferent set of available actions, can ofer insight into the design of technology experiences [57].The asymmetric interactions between Mike and Phil based on touch vs. display control illustrate how asymmetry can be advantageous a conversational context.Mike utilizes touch-based interaction, which is asymmetric since Phil does not use hands.Although Phil can control which objects appear on screen and where they are located, Phil cannot reach out to "tap" Mike to make him be quiet, as Mike does to Phil.One could imagine a version of Phil that can "point" to screen elements to call Mike's attention to them, or perhaps a version of Phil that could vibrate a smart watch that Mike is wearing to signal him, but those do not occur in the KN video.Mike's direct-touch input gives him the ability to manipulate screen objects directly, which is useful in certain contexts, such as dragging the South American and African simulations together.Phil facilitates Mike's actions by relinquishing control to him via touch.This negotiation of screen control appears efortless in the scripted KN video, but previous research on collaborative real-world touch-screens shows that such collaboration is non-trivial [60].
While the FoP analysis showed that Mike and Phil share an ability to interrupt each other and also participate in shared awareness of each other's state, asymmetries arise as well.The diference in authority and accountability, as noted with agents previously [46], is perhaps most important.Phil can control the display but does so based on Mike's commands or Mike's previously established goals (e.g., to have his daily schedule shown and reviewed by Phil).This asymmetry leverages the agent's capability to manipulate display elements rapidly and accurately, while Mike's role focuses on highlevel decision-making.This division of responsibilities showcases an advantageous asymmetry, where the agent complements the user's cognitive strengths with its own technical aptitude.However, the design of this division is also non-trivial, as shown by the various possible levels of autonomy described in the HAT Game Analysis framework.For example, in the KN video, after Mike mutes Phil by tapping on the agent's face (Would such a gesture be socially acceptable if Phil were further anthropomorphized?),Phil has the power to unmute himself and continue speaking (as well as the knowledge of the right timing to do so).Also, Phil has the power to interrupt Mike ("Excuse me, Jill Gilbert is calling back").Finally, Phil has the power to represent Mike in interactions with other humans (e.g., giving a message to Kathy and setting up a meeting the next day with a colleague).Just as employees who work with human administrative assistants negotiate complex sets of permissions around delegation and authority, Mike likely had to establishing his preferences with Phil explicitly.

Neither Administrative Assistant Nor Butler -A New Term is Needed
Mike's strong trust in Phil and the amount of power Phil has in managing Mike's schedule, along with the fuid conversation, raise the idea that Phil could be treated similarly to an administrative assistant.Phil's level of knowledge about Mike and Phil's bowtie suggests that the interaction might even crossover into the realm of butler or other household employee.However, as noted in the Introduction, while these human role schemas may feel comfortable, they are likely not the best ft, given that Mike could ask for Phil's help endlessly without concerning himself about Phil's welfare, and that Phil knows far more about Mike's personal preferences and details than a typical human assistant.In the KN video, as Mike leaves the ofce, Phil says "Enjoy your lunch!" Was Mike supposed to reply with "You too!"?These sorts of behaviors, with humans and agents being polite to each other even when it is not required, could be viewed as a part of the psychological support provided by team members for each other in a healthy team environment [64].These issues represent another form of asymmetry in the humanagent interaction.The above discussion suggests that the human need not care about the agent's general welfare, but the human, in order to feel comfortable with the agent, might desire that the agent monitor the human's welfare.The agent must also spend computer cycles being polite to the human, as if it were human itself, even though it is not.For this reason, the authors suggest that for the conversational dialogue envisioned in the KN video to occur, we need a new mental schema that is less of a human metaphor like butler or assistant.Jarring terms like helpbot or favagent are less likely to go viral than a slang neologism like jent (short for agent).Overall, the authors recommend a term that moves away from a human metaphor.

CONCLUSIONS
In conclusion, our analysis of the Knowledge Navigator video as design fction investigated constraints preventing the widespread adoption of highly capable and trusted conversational agents.Privacy considerations emerge as the frst constraint, noting the stark diferences between a trusted human administrative assistant and a trusted agent that needs to store extensive user knowledge and track continuously.Social and situational constraints, notably the reliance on verbal communication, underscore the importance of designing agents that align with diverse communication preferences and situational changes.Trust and perceived reliability, as another constraint, requires agents to inspire confdence in users through transparent interactions and reliable performance.Furthermore, technological advancements are also necessary to fll in the gap of conversational abilities, particularly in the realm of tracking a conversational thread through a series of ambiguous utterances.Underlying these constraints is the extensive range of digital objects (facilitation materials) that a KN agent must be able to process.Also, the exploration of asymmetric interactions highlights the potential advantages of dynamic power process in human-agent collaborations.Lastly, the recognition of a psychological need for a novel, less human-centric term for agents like Phil ("jent"?) is another important factor to consider.
HAT research often connects human-likeness with the agent-asa-teammate paradigm, e.g., [77].The utmost goal seems to be an agent that emulates a human teammate, which would then enable the type of rich communication, trust, and teamwork that is seen in efective human-human teams.Long before the teammate versus tool dialogue gained the traction it has today, the KN video showed an example of how such an agent might be experienced.However, the results of this analysis suggest that the human teammate is perhaps not the best analogy and not the ideal standard for comparison.Instead, we want tireless agents that help us while treating us like teammates, even if we don't treat them like teammates.The results of this research ofer a set of guidelines to make that happen.

Figure 1 :
Figure 1: Analysis of information fow showed a roughly equal exchange of spoken utterances from Mike to Phil (15) and Phil to Mike (12), as well several asymmetric information exchanges using touch input by Mike (8 instances) and display control by Phil (8 instances).

Table 1 :
Excerpt from the KN dataset showing speech or action events in each row.

Table 2 :
Agent capabilities and barriers to feasibility, based on DiCoT analysis.

Table 3 :
Agent capabilities per the HAT Game Analysis Framework of KN's Phil vs. Siri.

Table 4 :
Flows of Power arenas in designed interaction orders of KN's Phil vs. Siri