Generative Agents: Interactive Simulacra of Human Behavior
生成代理:人类行为的交互模拟体
Abstract. 摘要。
Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents: computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent’s experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty-five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors. For example, starting with only a single user-specified notion that one agent wants to throw a Valentine’s Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture—observation, planning, and reflection—each contribute critically to the believability of agent behavior. By fusing large language models with computational interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.
可信的人类行为代理可以赋予互动应用程序强大的功能,从沉浸式环境到人际交流排练空间再到原型工具。在本文中,我们介绍生成代理:模拟可信人类行为的计算软件代理。生成代理会起床、做早餐,然后去工作;艺术家绘画,作家写作;他们形成观点,注意彼此,并开始对话;他们回忆和反思过去的日子,计划下一天。为了实现生成代理,我们描述了一种架构,扩展了一个大型语言模型,以自然语言存储代理的完整经历记录,随着时间将这些记忆综合成更高层次的反思,并动态检索它们以规划行为。我们实例化生成代理,以填充一个受《模拟人生》启发的互动沙盒环境,最终用户可以使用自然语言与二十五个代理居住的小镇互动。在评估中,这些生成代理产生可信的个体和新兴社会行为。 例如,从仅有一个单一用户指定的概念开始,即一个代理想要举办情人节派对,代理们自主地在接下来的两天里传播派对的邀请,结识新朋友,邀请彼此参加派对,并协调在正确的时间一起出现在派对上。我们通过消融实验证明了我们代理架构的组件——观察、规划和反思——每个都对代理行为的可信度起着至关重要的作用。通过将大型语言模型与计算交互代理相融合,这项工作引入了用于实现人类行为可信模拟的架构和交互模式。
人工智能交互,代理,生成式人工智能,大型语言模型
会议:第 36 届 ACM 用户界面软件和技术研讨会;2023 年 10 月 29 日至 11 月 1 日;美国加利福尼亚州旧金山††booktitle: The 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), October 29-November 1, 2023, San Francisco, CA, USA
booktitle: 第 36 届 ACM 用户界面软件和技术研讨会(UIST '23),2023 年 10 月 29 日至 11 月 1 日,美国加利福尼亚州旧金山††doi: 10.1145/3586183.3606763††isbn: 979-8-4007-0132-0/23/10††ccs: Human-centered computing Interactive systems and tools
ccs: 人类中心计算 交互系统和工具††ccs: Computing methodologies Natural language processing
ccs: 计算方法 自然语言处理
图 1.生成代理是人类行为的可信赖的模拟体,用于交互应用程序。在这项工作中,我们通过在一个沙盒环境中生成代理来展示,这个环境让人想起《模拟人生》,有 25 个代理。用户可以观察和干预代理计划他们的一天,分享新闻,建立关系,协调团体活动。
A screenshot of the game world populated by generative agents.
游戏世界的屏幕截图,由生成式代理人填充。
1. Introduction 1.介绍
How might we craft an interactive artificial society that reflects believable human behavior? From sandbox games such as The Sims to applications such as cognitive models (Card et al., 1983) and virtual environments (Laird and VanLent, 2001; Bates, 1994), for over four decades, researchers and practitioners have envisioned computational agents that can serve as believable proxies of human behavior. In these visions, computationally-powered agents act consistently with their past experiences and react believably to their environments. Such simulations of human behavior could populate virtual spaces and communities with realistic social phenomena (Dill and Martin, 2011; Park et al., 2022), train people on how to handle rare yet difficult interpersonal situations (Tambe et al., 1995; Jones et al., 1999; Hollan et al., 1984), test social science theories (Binz and Schulz, 2023; Horton, 2023), craft model human processors for theory and usability testing (Card et al., 1983; John and Kieras, 1996; Hämäläinen et al., 2023), power ubiquitous computing applications (Fast et al., 2016) and social robots (Bates, 1994; Bledsoe, 1986), and underpin non-playable game characters (Laird and VanLent, 2001; Riedl, 2012) that can navigate complex human relationships in an open world.
我们如何打造一个反映可信人类行为的互动人工社会?自沙盒游戏如模拟人生到应用如认知模型(Card 等人,1983 年)和虚拟环境(Laird 和 VanLent,2001 年;Bates,1994 年)以来,研究人员和实践者已经设想了能够作为人类行为可信代理的计算机代理,已有四十多年。在这些愿景中,计算机驱动的代理根据他们过去的经验一贯行事,并对环境做出可信反应。这种人类行为的模拟可以在虚拟空间和社区中填充具有现实社会现象的人物(Dill 和 Martin,2011 年;Park 等人,2022 年),培训人们如何处理罕见但困难的人际关系情况(Tambe 等人,1995 年;Jones 等人,1999 年;Hollan 等人,1984 年),测试社会科学理论(Binz 和 Schulz,2023 年;Horton,2023 年),打造模型人类处理器用于理论和可用性测试(Card 等人,1983 年;John 和 Kieras,1996 年;Hämäläinen 等人,2023 年),为普适计算应用提供动力(Fast 等人)。2016) and social robots (Bates, 1994; Bledsoe, 1986), and underpin non-playable game characters (Laird and VanLent, 2001; Riedl, 2012) that can navigate complex human relationships in an open world.
However, the space of human behavior is vast and complex (Riedl, 2012; Yannakakis, 2012). Despite striking progress in large language models (Brown et al., 2020) that can simulate human behavior at a single time point (Park et al., 2022; Hämäläinen et al., 2023), fully general agents that ensure long-term coherence would be better suited by architectures that manage constantly-growing memories as new interactions, conflicts, and events arise and fade over time while handling cascading social dynamics that unfold between multiple agents. Success requires an approach that can retrieve relevant events and interactions over a long period, reflect on those memories to generalize and draw higher-level inferences, and apply that reasoning to create plans and reactions that make sense in the moment and in the longer-term arc of the agent’s behavior.
然而,人类行为的空间是广阔而复杂的(Riedl,2012 年;Yannakakis,2012 年)。尽管在大型语言模型方面取得了显著进展(Brown 等,2020 年),这些模型可以模拟单个时间点的人类行为(Park 等,2022 年;Hämäläinen 等,2023 年),但确保长期连贯性的完全通用代理程序更适合通过管理不断增长的记忆来处理新的互动、冲突和事件,同时处理在多个代理程序之间展开的级联社会动态。成功需要一种方法,可以在长时间内检索相关事件和互动,反思这些记忆以概括和得出更高层次的推断,并将这种推理应用于创建在当下和代理程序行为的长期轨迹中有意义的计划和反应。
In this paper, we introduce generative agents—agents that draw on generative models to simulate believable human behavior—and demonstrate that they produce believable simulacra of both individual and emergent group behavior. Generative agents draw a wide variety of inferences about themselves, other agents, and their environment; they create daily plans that reflect their characteristics and experiences, act out those plans, react, and re-plan when appropriate; they respond when the end user changes their environment or commands them in natural language. For instance, generative agents turn off the stove when they see that their breakfast is burning, wait outside the bathroom if it is occupied, and stop to chat when they meet another agent they want to talk to.111When referring to generative agents engaging in actions or going to places, this is a shorthand for readability and not a suggestion that they are engaging in human-like agency. The behaviors of our agents, akin to animated Disney characters, aim to create a sense of believability, but they do not imply genuine agency.
当涉及生成代理参与行动或前往某地时,这是为了可读性而使用的简写,并不意味着它们具有类似人类代理的行为。我们的代理行为类似于迪士尼动画角色,旨在营造一种可信度,但并不意味着具有真正的代理能力。
在本文中,我们介绍生成式代理——这些代理利用生成模型来模拟可信的人类行为,并展示它们能够产生可信的个体和新兴群体行为的模拟。生成式代理对自己、其他代理和环境进行各种推断;它们制定反映自身特征和经验的日常计划,执行这些计划,做出反应,并在适当时重新规划;当最终用户改变他们的环境或用自然语言命令他们时,它们会做出响应。例如,生成式代理会在看到早餐烧焦时关闭炉子,在浴室被占用时等在外面,当遇到想要交谈的其他代理时停下来聊天。 A society full of generative agents is marked by emergent social dynamics where new relationships are formed, information diffuses, and coordination arises across agents.
一个充满生成代理的社会以新兴社会动态为特征,其中形成新关系,信息扩散,并在代理之间产生协调。
To enable generative agents, we describe an agent architecture that stores, synthesizes, and applies relevant memories to generate believable behavior using a large language model. Our architecture comprises three main components. The first is the memory stream, a long-term memory module that records, in natural language, a comprehensive list of the agent’s experiences. A memory retrieval model combines relevance, recency, and importance to surface the records needed to inform the agent’s moment-to-moment behavior. The second is reflection, which synthesizes memories into higher-level inferences over time, enabling the agent to draw conclusions about itself and others to better guide its behavior. The third is planning, which translates those conclusions and the current environment into high-level action plans and then recursively into detailed behaviors for action and reaction. These reflections and plans are fed back into the memory stream to influence the agent’s future behavior.
为了启用生成式代理,我们描述了一个代理架构,该架构存储、合成和应用相关记忆,以使用大型语言模型生成可信行为。我们的架构包括三个主要组件。第一个是记忆流,一个长期记忆模块,以自然语言记录代理的经验的全面列表。记忆检索模型结合相关性、新近性和重要性,以呈现需要记录以通知代理每时每刻行为的记录。第二个是反思,它将记忆合成为随时间推断的更高级别,使代理能够得出关于自身和他人的结论,以更好地指导其行为。第三个是规划,它将这些结论和当前环境转化为高级行动计划,然后递归地转化为详细的行为以进行行动和反应。这些反思和计划被反馈到记忆流中,以影响代理的未来行为。
This architecture suggests applications in multiple domains, from role-play and social prototyping to virtual worlds and games. In social role-play scenarios (e.g., interview preparation), a user could safely rehearse difficult, conflict-laden conversations. When prototyping social platforms, a designer could go beyond temporary personas to prototype dynamic, complex interactions that unfold over time. For this paper, we focus on the ability to create a small, interactive society of agents inspired by games such as The Sims.222A demonstration of an actual simulation of the generative agent society can be viewed at the following link: https://reverie.herokuapp.com/UIST_Demo/. A public repository for the simulation code is located here: https://github.com/joonspk-research/generative_agents
一个实际模拟生成代理社会的演示可以在以下链接查看:https://reverie.herokuapp.com/UIST_Demo/. 模拟代码的公共存储库位于此处:https://github.com/joonspk-research/generative_agents
这种架构建议在多个领域中应用,从角色扮演和社交原型到虚拟世界和游戏。在社交角色扮演场景中(例如,面试准备),用户可以安全地排练困难、充满冲突的对话。在原型社交平台时,设计师可以超越临时角色,原型动态、复杂的互动,随着时间的推移展开。在本文中,我们专注于创建受到《模拟人生》等游戏启发的一小部分互动代理社会的能力。 2 By connecting our architecture to the ChatGPT large language model (OpenAI, 2022), we manifest a society of twenty-five agents in a game environment. End users can observe and interact with these agents. If an end user or developer wanted the town to host an in-game Valentine’s Day party, for example, traditional game environments would require scripting tens of characters’ behavior manually. We demonstrate that, with generative agents, it is sufficient to simply tell one agent that she wants to throw a party. Despite many potential points of failure—the party planner must remember to invite other agents to the party, attendees must remember the invitation, those who remember must decide to actually show up, and more—our agents succeed. They spread the word about the party and then show up, with one agent even asking another on a date to the party, all from a single user-generated seed suggestion.
通过将我们的架构连接到 ChatGPT 大型语言模型(OpenAI,2022),我们在游戏环境中展示了一个由二十五个代理组成的社会。最终用户可以观察并与这些代理互动。例如,如果最终用户或开发人员希望该城镇举办游戏内的情人节派对,传统的游戏环境将需要手动编写数十个角色的行为脚本。我们展示了,通过生成式代理,只需简单告诉一个代理她想要举办派对即可。尽管存在许多潜在的失败点——派对策划者必须记得邀请其他代理参加派对,参与者必须记得邀请,记得的人必须决定是否真的出席等等——我们的代理成功了。他们传播了关于派对的消息,然后出席了,甚至有一个代理甚至向另一个代理邀请参加派对的约会,所有这些都来自一个单一的用户生成的种子建议。
We conducted two evaluations of generative agents: a controlled evaluation to test whether the agents produce believable individual behaviors in isolation, and an end-to-end evaluation where the agents interacted with each other in open-ended ways over two days of game time to understand their stability and emergent social behaviors. In the technical evaluation, we leverage a methodological opportunity to evaluate an agent’s knowledge and behavior by “interviewing” it in natural language to probe the agents’ ability to stay in character, remember, plan, react, and reflect accurately. We compared several ablations that limit agents’ access to memory, reflection, and planning. We observe that each of these components is critical to strong performance across these interview tasks. Across the technical and end-to-end evaluation, the most common errors arose when the agent failed to retrieve relevant memories, fabricated embellishments to the agent’s memory, or inherited overly formal speech or behavior from the language model.
我们进行了两次生成代理的评估:一次受控评估,测试代理是否能够在孤立环境中产生可信的个体行为,以及一次端到端评估,在游戏时间为期两天的过程中,代理相互交互以了解它们的稳定性和新兴社会行为。在技术评估中,我们利用了一种方法论机会,通过“面试”自然语言来评估代理的知识和行为,以探究代理保持角色、记忆、计划、反应和反思的能力。我们比较了几种限制代理访问记忆、反思和规划的消融方法。我们观察到这些组件对于在这些面试任务中表现出色至关重要。在技术和端到端评估中,最常见的错误是代理未能检索相关记忆,为代理的记忆添加虚构的修饰,或者从语言模型中继承过于正式的言谈或行为。
In sum, this paper makes the following contributions:
总的来说,本文的贡献如下:
-
•
Generative agents, believable simulacra of human behavior that are dynamically conditioned on agents’ changing experiences and environment.
• 生成代理,逼真的人类行为模拟,动态地根据代理的不断经验和环境条件。 -
•
A novel architecture that makes it possible for generative agents to remember, retrieve, reflect, interact with other agents, and plan through dynamically evolving circumstances. The architecture leverages the powerful prompting capabilities of large language models and supplements those capabilities to support longer-term agent coherence, the ability to manage dynamically evolving memory, and recursively produce higher-level reflections.
一种新颖的架构,使生成代理能够记住、检索、反思、与其他代理互动,并通过动态演变的情况进行规划。 该架构利用大型语言模型强大的提示功能,并补充这些功能以支持更长期的代理一致性,管理动态演变的记忆的能力,并递归产生更高层次的反思。 -
•
Two evaluations, a controlled evaluation and an end-to-end evaluation, that establish causal effects of the importance of components of the architecture, as well as identify breakdowns arising from, e.g., improper memory retrieval.
• 两个评估,一个受控评估和一个端到端评估,建立了架构组件重要性的因果效应,并确定由于例如不正确的内存检索而产生的故障。 -
•
Discussion of the opportunities and ethical and societal risks of generative agents in interactive systems. We argue that these agents should be tuned to mitigate the risk of users forming parasocial relationships, logged to mitigate risks stemming from deepfakes and tailored persuasion, and applied in ways that complement rather than replace human stakeholders in design processes.
• 讨论互动系统中生成式代理的机会、道德和社会风险。我们认为这些代理应该被调整以减轻用户形成偏爱关系的风险,记录以减轻源自深度伪造和定制说服的风险,并以一种补充而不是取代设计过程中的人类利益相关者的方式应用。
2. Related Work 2.相关工作
In this section, we reflect on the prior literature in human-AI interaction and situate, within its canon, the agenda of building believable proxies of human behavior. This agenda, once hailed as a north star in the interaction, game, and artificial intelligence communities (Laird and VanLent, 2001; Riedl, 2012; Riedl and Young, 2005; Bates, 1994), has remained challenging due to the complexity of human behavior (Brooks et al., 2000; Yannakakis, 2012). We synthesize this research to suggest that large language models, though not sufficient by themselves, open up a new angle for creating believable agents when leveraged using the appropriate architecture.
在这一部分,我们反思了人工智能与人类互动的先前文献,并将建立可信人类行为代理的议程置于其经典之中。这一议程曾被誉为互动、游戏和人工智能社区的北极星(Laird 和 VanLent,2001 年;Riedl,2012 年;Riedl 和 Young,2005 年;Bates,1994 年),由于人类行为的复杂性,一直具有挑战性(Brooks 等,2000 年;Yannakakis,2012 年)。我们综合这一研究,建议大型语言模型虽然本身并不足够,但在使用适当的架构时,可以开辟一个新的角度来创建可信代理。
2.1. Human-AI Interaction 2.1. 人工智能交互
Interactive artificial intelligence systems aim to combine human insights and capabilities in computational artifacts that can augment their users (Amershi et al., 2014; Fails and Olsen Jr, 2003). A long line of work has explored ways to enable users to interactively specify model behavior. For instance, Crayons demonstrated an early vision of interactive machine learning, allowing non-expert users to train classifiers (Fails and Olsen Jr, 2003). Further work helped to articulate how end users might describe their classification goals to the system through examples (Fogarty et al., 2008) or demonstration (Fiebrink and Cook, 2010). Recent advancements have extended these explorations to deep learning (Lam et al., 2023) and prompt-based authoring (Jiang et al., 2022; Wu et al., 2022b; Liu et al., 2022).
交互式人工智能系统旨在将人类的见解和能力与计算工件相结合,从而增强其用户(Amershi 等,2014 年;Fails 和 Olsen Jr,2003 年)。长期以来,人们一直在探索使用户能够交互地指定模型行为的方法。例如,Crayons 展示了交互式机器学习的早期愿景,允许非专家用户训练分类器(Fails 和 Olsen Jr,2003 年)。进一步的工作有助于阐明最终用户如何通过示例(Fogarty 等,2008 年)或演示(Fiebrink 和 Cook,2010 年)描述其分类目标给系统。最近的进展将这些探索扩展到深度学习(Lam 等,2023 年)和基于提示的创作(Jiang 等,2022 年;Wu 等,2022b;Liu 等,2022 年)。
Meanwhile, a persistent thread of research has advanced the case for language- and agent-based interaction in human-computer interaction. Formative work such as SHRDLU (Winograd, 1971) and ELIZA (Weizenbaum, 1966) demonstrated the opportunities and the risks associated with natural language interaction with computing systems. As research progressed, it became evident that autonomous agents could offer new metaphors for delegation and interaction (Maes, 1995), but the boundaries of delegation between humans and agents have remained the subject of ongoing debate and refinement (Shneiderman and Maes, 1997; Horvitz, 1999; Shneiderman, 2022). Recently, this technology has reached a level of stability that enables agents to interact via natural language in large and complex online social environments (e.g., (Krishna et al., 2022)). Natural language interaction provides a novel modality that can enhance user abilities in domains such as photo editing (Linder et al., 2013; Fourney et al., 2011; Adar et al., 2014) and code editing (Rong et al., 2016).
与此同时,一系列持续的研究已经推动了语言和基于代理的人机交互的案例。形成性工作,如 SHRDLU(Winograd,1971 年)和 ELIZA(Weizenbaum,1966 年)展示了与计算系统进行自然语言交互相关的机会和风险。随着研究的进展,人们逐渐意识到自主代理可以为委托和交互提供新的隐喻(Maes,1995 年),但人类和代理之间的委托边界仍然是持续辩论和完善的主题(Shneiderman 和 Maes,1997 年;Horvitz,1999 年;Shneiderman,2022 年)。最近,这项技术已经达到了一种稳定水平,使代理能够通过自然语言在大型和复杂的在线社交环境中进行交互(例如,(Krishna 等,2022 年))。自然语言交互提供了一种新颖的模式,可以增强用户在诸如照片编辑(Linder 等,2013 年;Fourney 等,2011 年;Adar 等,2014 年)和代码编辑(Rong 等,2016 年)等领域的能力。
We convene these threads of work to show that we can now create agents that proxy human behavior for interactive systems, and interact with them using natural language. In doing so, this work reopens the door to examining foundational human-computer interaction questions around cognitive models such as GOMS and Keystroke-Level Model (KLM) (Card et al., 1983, 1980), around prototyping tools (Park et al., 2022), and around ubiquitous computing applications (Weiser, 1991; Dey, 2001; Fast et al., 2016).
我们召开这些工作线索,以展示我们现在可以为交互系统创建代理人,代理人可以模拟人类行为,并使用自然语言与它们互动。这样做,这项工作重新打开了研究基础人机交互问题的大门,如 GOMS 和 Keystroke-Level Model(KLM)(Card 等,1983 年,1980 年),原型工具(Park 等,2022 年)以及普遍计算应用(Weiser,1991 年;Dey,2001 年;Fast 等,2016 年)。
2.2. Believable Proxies of Human Behavior
2.2.可信的人类行为代理
Prior literature has described believability, or believable agents, as a central design and engineering goal. Believable agents are designed to provide an illusion of life and present a facade of realism in the way they appear to make decisions and act on their own volition, similar to the characters in Disney movies (Bates, 1994; Thomas and Johnston, 1981). These agents can populate and perceive an open world environment like the one we inhabit (Laird and VanLent, 2001; Bates, 1994), and strive to behave in ways that exhibit emergent behaviors grounded in social interactions with users or other agents with the aim of becoming believable proxies of our behavior in hypothetical simulations of individuals and communities (McCoy et al., 2012; Burkinshaw, 2009; Francis, 2010). Historically, these agents were developed in the context of intelligent game non-player characters (NPCs) (Laird and VanLent, 2001; Riedl, 2012). Creating NPCs with believable behavior, if possible, could enhance player experiences in games and interactive fictions by enabling emergent narratives (Swartout et al., 2006; Aylett, 1999; Brenner, 2010; Ibister and Nass, 2000) and social interactions with the agents (Zubek, 2002). However, more importantly, game worlds provide increasingly realistic representations of real-world affordances, and as observed by Laird and van Lent in 2001, these simulated worlds offer accessible testbeds for developers of believable agents to finesse the agents’ cognitive capabilities without worrying about implementing robotics in the real world or creating simulation environments from scratch (Laird and VanLent, 2001; Riedl, 2012).
先前的文献已经描述了可信度,或者说可信的代理,作为一个中心设计和工程目标。可信的代理被设计为提供一种生命的幻觉,并以一种看似自主地做出决策和行动的现实幻觉出现,类似于迪士尼电影中的角色(Bates,1994 年;Thomas 和 Johnston,1981 年)。这些代理可以在我们所居住的开放世界环境中生存和感知(Laird 和 VanLent,2001 年;Bates,1994 年),并努力以展现出社交互动中的新兴行为为基础的方式行为,以成为我们在个体和社区的假设模拟中行为的可信代理的目标(McCoy 等人,2012 年;Burkinshaw,2009 年;Francis,2010 年)。从历史上看,这些代理是在智能游戏非玩家角色(NPCs)的背景下开发的(Laird 和 VanLent,2001 年;Riedl,2012 年)。如果可能的话,创造出具有可信行为的 NPCs 可以通过启用新兴叙事来增强玩家在游戏和互动虚构中的体验(Swartout 等人)。, 2006; Aylett, 1999; Brenner, 2010; Ibister and Nass, 2000) and social interactions with the agents (Zubek, 2002). However, more importantly, game worlds provide increasingly realistic representations of real-world affordances, and as observed by Laird and van Lent in 2001, these simulated worlds offer accessible testbeds for developers of believable agents to finesse the agents’ cognitive capabilities without worrying about implementing robotics in the real world or creating simulation environments from scratch (Laird and VanLent, 2001; Riedl, 2012).
A diverse set of approaches to creating believable agents emerged over the past four decades.
In implementation, however, these approaches often simplified the environment or dimensions of agent behavior to make the effort more manageable (Brooks et al., 2000; Minsky and Papert, 1970). Rule-based approaches, such as finite-state machines (Siu et al., 2021; Umarov et al., 2012) and behavior trees (Knafla, 2011; Pillosu, 2009; Hecker, 2011) account for the brute force approach of human-authoring the agent’s behavior (McCoy et al., 2012). They provide a straightforward way of creating simple agents that is still the most dominant approach today (McCoy et al., 2009; Miyashita et al., 2017; Yannakakis, 2012), and can even handle rudimentary social interactions, as shown in games such as Mass Effect (BioWare, 2007) and The Sims (Arts, 2009) series. Nonetheless, manually crafting behavior that can comprehensively address the breadth of possible interactions in an open world is untenable. This means that the resulting agent behaviors may not fully represent the consequences of their interactions (McCoy et al., 2012, 2011a, 2011b), and cannot perform new procedures that were not hard-coded in their script (Siu et al., 2021; Umarov et al., 2012). On the other hand, prevalent learning-based approaches for creating believable agents, such as reinforcement learning, have overcome the challenge of manual authoring by letting the agents learn their behavior, and have achieved superhuman performance in recent years in games such as AlphaStar for Starcraft (Vinyals et al., 2019) and OpenAI Five for Dota 2 (Berner et al., 2019). However, their success has largely taken place in adversarial games with readily definable rewards that a learning algorithm can optimize for. They have not yet addressed the challenge of creating believable agents in an open world (Siu et al., 2021; Miyashita et al., 2017; Hausknecht et al., 2020).
一系列不同的方法出现在过去的四十年中,用于创建可信的代理。然而,在实施过程中,这些方法通常简化了环境或代理行为的维度,以使工作更易管理。基于规则的方法,如有限状态机和行为树,解释了人类编写代理行为的蛮力方法。它们提供了一种简单的方式来创建简单的代理,这仍然是今天最主流的方法,甚至可以处理基本的社交互动,如《质量效应》和《模拟人生》系列游戏中所示。然而,手动制作可以全面解决开放世界中可能互动的广泛范围的行为是不可行的。这意味着最终的代理行为可能无法完全代表它们互动的后果。, 2012, 2011a, 2011b), and cannot perform new procedures that were not hard-coded in their script (Siu et al., 2021; Umarov et al., 2012). On the other hand, prevalent learning-based approaches for creating believable agents, such as reinforcement learning, have overcome the challenge of manual authoring by letting the agents learn their behavior, and have achieved superhuman performance in recent years in games such as AlphaStar for Starcraft (Vinyals et al., 2019) and OpenAI Five for Dota 2 (Berner et al., 2019). However, their success has largely taken place in adversarial games with readily definable rewards that a learning algorithm can optimize for. They have not yet addressed the challenge of creating believable agents in an open world (Siu et al., 2021; Miyashita et al., 2017; Hausknecht et al., 2020).
Cognitive architectures in computation, pioneered by Newell, aimed to build the infrastructure for supporting a comprehensive set of cognitive functions (Newell, 1990) that suited the all-encompassing nature of believable agents held in its original vision. They fueled some of the earliest examples of believable agents. For instance, Quakebot-SOAR (Laird, 2000) and ICARUS (Langley et al., 2005; Choi et al., 2021) generated NPCs in first-person shooter games, while TacAir-SOAR (Pew and Mavor, 1998) generated pilots in aerial combat training simulations. The architectures used by these agents differed (Quakebot- and TacAir-SOAR relied on SOAR (Laird, 2012), while ICARUS relied on its own variation that was inspired by SOAR and ACT-R (Anderson, 1993)), but they shared the same underlying principle (Laird et al., 2017). They maintained short-term and long-term memories, filled these memories with symbolic structures, and operated in perceive-plan-act cycles, dynamically perceiving the environment and matching it with one of the manually crafted action procedures (Umarov et al., 2012; Laird, 2001). Agents created using cognitive architectures aimed to be generalizable to most, if not all, open world contexts and exhibited robust behavior for their time. However, their space of action was limited to manually crafted procedural knowledge, and they did not offer a mechanism through which the agents could be inspired to seek new behavior. As such, these agents were deployed mostly in non-open world contexts such as first-person shooter games (Laird, 2000; Choi et al., 2021) or blocks worlds (Langley et al., 2005).
认知架构在计算中的先驱者纽厄尔旨在构建支持全面认知功能集合的基础设施(纽厄尔,1990),以适应其最初愿景中所持有的全方位可信代理的性质。它们推动了一些最早的可信代理的例子。例如,Quakebot-SOAR(Laird,2000)和 ICARUS(Langley 等,2005;Choi 等,2021)在第一人称射击游戏中生成 NPC,而 TacAir-SOAR(Pew 和 Mavor,1998)在空中战斗训练模拟中生成飞行员。这些代理使用的架构不同(Quakebot-和 TacAir-SOAR 依赖于 SOAR(Laird,2012),而 ICARUS 则依赖于受 SOAR 和 ACT-R(Anderson,1993)启发的自身变体),但它们共享相同的基本原则(Laird 等,2017)。它们保持短期和长期记忆,用符号结构填充这些记忆,并在感知-规划-行动循环中运行,动态感知环境并将其与手工制作的行动程序之一匹配(Umarov 等,2012;Laird,2001)。 使用认知架构创建的代理人旨在通用化到大多数,如果不是全部,开放世界环境,并展示了他们所处时代的强大行为。然而,他们的行动空间仅限于手工制作的程序化知识,并没有提供一种机制,通过该机制代理人可以被激发寻求新行为。因此,这些代理人主要部署在非开放世界环境中,如第一人称射击游戏(Laird,2000;Choi 等,2021)或方块世界(Langley 等,2005)。
Today, creating believable agents as described in its original definition remains an open problem (Yannakakis, 2012; Riedl, 2012). Many have moved on, arguing that although current approaches for creating believable agents might be cumbersome and limited, they are good enough to support existing gameplay and interactions (Yannakakis, 2012; Champandard, 2012; Nareyek, 2007). Our argument is that large language models offer an opportunity to re-examine these questions, provided that we can craft an effective architecture to synthesize memories into believable behavior. We offer a step toward such an architecture in this paper.
今天,根据其原始定义描述的创建可信代理仍然是一个悬而未决的问题(Yannakakis,2012 年;Riedl,2012 年)。许多人已经前进,认为尽管当前用于创建可信代理的方法可能繁琐且有限,但它们已足以支持现有的游戏玩法和互动(Yannakakis,2012 年;Champandard,2012 年;Nareyek,2007 年)。我们的论点是,大型语言模型提供了重新审视这些问题的机会,前提是我们能够构建一个有效的架构,将记忆合成为可信行为。我们在本文中提供了朝着这样一个架构迈出的一步。
2.3. Large Language Models and Human Behavior
2.3.大型语言模型和人类行为
Generative agents leverage a large language model to power their behavior. The key observation is that large language models encode a wide range of human behavior from their training data (Brown et al., 2020; Bommasani et al., 2022). If prompted with a narrowly defined context, the models can be used to generate believable behavior. Recent work has demonstrated the efficacy of this approach. For instance, social simulacra used a large language model to generate users that would populate new social computing systems to prototype their emergent social dynamics (Park et al., 2022). This approach used a prompt chain (Wu et al., 2022a, b) to generate short natural language descriptions of personas and their behaviors as they appear in the system being prototyped. Other empirical studies have replicated existing social science studies (Horton, 2023), political surveys (Sorensen et al., 2022), and generated synthetic data (Hämäläinen et al., 2023). Large language models have also been used to generate interactive human behavior for users to engage with. In gaming, for instance, these models have been employed to create interactive fiction (Freiknecht and Effelsberg, 2020) and text adventure games (Callison-Burch et al., 2022). With their ability to generate and decompose action sequences, large language models have also been used in planning robotics tasks (Huang et al., 2022). For example, when presented with a task, such as picking up a bottle, the model is prompted to break down the task into smaller action sequences, such as heading to the table where the bottle is located and picking it up.
生成代理利用大型语言模型来驱动它们的行为。关键观察是,大型语言模型从其训练数据中编码了广泛的人类行为范围(Brown 等,2020 年;Bommasani 等,2022 年)。如果在一个明确定义的上下文中进行提示,这些模型可以用来生成可信的行为。最近的工作已经证明了这种方法的有效性。例如,社交模拟使用了一个大型语言模型来生成用户,这些用户将填充新的社交计算系统,以原型化它们新兴的社会动态(Park 等,2022 年)。这种方法使用了一个提示链(Wu 等,2022a,b)来生成关于人物角色及其行为的简短自然语言描述,这些描述是在正在原型化的系统中出现的。其他实证研究已经复制了现有的社会科学研究(Horton,2023 年)、政治调查(Sorensen 等,2022 年)和生成了合成数据(Hämäläinen 等,2023 年)。大型语言模型还被用来生成用户可以与之互动的交互式人类行为。 在游戏中,例如,这些模型已被用来创建交互式小说(Freiknecht 和 Effelsberg,2020)和文本冒险游戏(Callison-Burch 等,2022)。由于它们能够生成和分解动作序列,大型语言模型也被用于规划机器人任务(Huang 等,2022)。例如,当面临任务时,比如拿起一个瓶子,模型被提示将任务分解为更小的动作序列,比如前往放置瓶子的桌子并拿起它。
We posit that, based on the work summarized above, large language models can become a key ingredient for creating believable agents. The existing literature largely relies on what could be considered first-order templates that employ few-shot prompts (Gao et al., 2020; Liu et al., 2021) or chain-of-thought prompts (Wei et al., 2023). These templates are effective in generating behavior that is conditioned solely on the agent’s current environment (e.g., how would a troll respond to a given post, what actions would a robot need to take to enter a room given that there is a door). However, believable agents require conditioning not only on their current environment but also on a vast amount of past experience, which is a poor fit (and as of today, impossible due to the underlying models’ limited context window) using first-order prompting. Recent studies have attempted to go beyond first-order prompting by augmenting language models with a static knowledge base and an information retrieval scheme (Khattab et al., 2023) or with a simple summarization scheme (Wu et al., 2021). This paper extends these ideas to craft an agent architecture that handles retrieval where past experience is dynamically updated at each time step and mixed with agents’ current context and plans, which may either reinforce or contradict each other.
根据上述总结的工作,我们认为大型语言模型可以成为创造可信代理的关键因素。现有文献主要依赖于可以被视为一阶模板的模板,这些模板使用少量提示(高等,2020 年; 刘等,2021 年)或思维链提示(魏等,2023 年)。这些模板在生成仅取决于代理当前环境的行为方面是有效的(例如,一个巨魔如何回应给定的帖子,一个机器人需要采取什么行动才能进入一个有门的房间)。然而,可信代理不仅需要根据其当前环境进行调节,还需要根据大量的过去经验,这与一阶提示不符(并且截至今天,由于基础模型的有限上下文窗口,这是不可能的)。最近的研究尝试通过增加静态知识库和信息检索方案(Khattab 等,2023 年)或简单的总结方案(吴等,2021 年)来超越一阶提示。 这篇论文将这些想法扩展到一个代理架构,该架构处理检索,其中过去的经验在每个时间步动态更新,并与代理的当前上下文和计划混合,这些计划可能相互加强或相互矛盾。
图 2. Smallville 沙盒世界,带有标签的区域。根节点描述整个世界,子节点描述区域(例如,房屋,咖啡馆,商店),叶节点描述对象(例如,桌子,书架)。代理记住一个反映他们所看到的世界部分的子图,保持这些部分的状态,就像他们观察到的那样。
The labeled map of the game world.
游戏世界的标记地图。
3. Generative Agent Behavior and Interaction
3.生成式代理行为和交互
To illustrate the affordances of generative agents, we instantiate them as characters in a simple sandbox world reminiscent of The Sims (Arts, 2009). This sprite-based sandbox game world, Smallville, evokes a small town environment. In this section, we will walk through the affordances and interactions with generative agents in Smallville and describe how the agents behave within it. Then, in Section 4, we will introduce our generative agent architecture that powers these affordances and interactions. In Section 5, we will describe the implementation of the sandbox environment and how the agents interact with the underlying engine of the sandbox world.
展示生成代理的便利性,我们将它们实例化为简单沙盒世界中的角色,这个世界让人想起《模拟人生》(Arts, 2009)。这个基于精灵的沙盒游戏世界,Smallville,唤起了一个小镇环境。在本节中,我们将介绍在 Smallville 中与生成代理的便利性和互动,并描述这些代理在其中的行为。然后,在第 4 节中,我们将介绍支持这些便利性和互动的生成代理架构。在第 5 节中,我们将描述沙盒环境的实现以及代理与沙盒世界底层引擎的互动。
3.1. Agent Avatar and Communication
3.1.代理人头像和沟通
A community of 25 unique agents inhabits Smallville. Each agent is represented by a simple sprite avatar. We authored one paragraph of natural language description to depict each agent’s identity, including their occupation and relationship with other agents, as seed memories. For example, John Lin has the following description:
一个由 25 个独特代理组成的社区居住在 Smallville。每个代理都由一个简单的精灵头像代表。我们撰写了一个自然语言描述段落来描绘每个代理的身份,包括他们的职业和与其他代理的关系,作为种子记忆。例如,John Lin 有以下描述:
John Lin is a pharmacy shopkeeper at the Willow Market and Pharmacy who loves to help people. He is always looking for ways to make the process of getting medication easier for his customers; John Lin is living with his wife, Mei Lin, who is a college professor, and son, Eddy Lin, who is a student studying music theory; John Lin loves his family very much; John Lin has known the old couple next-door, Sam Moore and Jennifer Moore, for a few years; John Lin thinks Sam Moore is a kind and nice man; John Lin knows his neighbor, Yuriko Yamamoto, well; John Lin knows of his neighbors, Tamara Taylor and Carmen Ortiz, but has not met them before; John Lin and Tom Moreno are colleagues at The Willows Market and Pharmacy; John Lin and Tom Moreno are friends and like to discuss local politics together; John Lin knows the Moreno family somewhat well — the husband Tom Moreno and the wife Jane Moreno.
约翰·林是柳市场和药店的药店店主,他热爱帮助人们。他一直在寻找让顾客取药过程更加简单的方法;约翰·林与妻子梅林一起生活,梅林是一名大学教授,儿子艾迪·林正在学习音乐理论;约翰·林非常爱他的家人;约翰·林认识隔壁老夫妇萨姆·摩尔和詹妮弗·摩尔已经几年了;约翰·林认为萨姆·摩尔是一个善良和友善的人;约翰·林很了解他的邻居,山本百合子;约翰·林知道他的邻居塔玛拉·泰勒和卡门·奥尔蒂兹,但之前没有见过他们;约翰·林和汤姆·莫雷诺是柳市场和药店的同事;约翰·林和汤姆·莫雷诺是朋友,喜欢一起讨论当地政治;约翰·林对莫雷诺一家有一定了解 —— 丈夫汤姆·莫雷诺和妻子简·莫雷诺。
Each semicolon-delimited phrase is entered into the agent’s initial memory as memories at the start of the simulation.
每个以分号分隔的短语都被输入到代理的初始记忆中,作为模拟开始时的记忆。
3.1.1. Inter-Agent Communication
3.1.1.代理间通信
The agents interact with the world by their actions, and with each other through natural language. At each time step of the sandbox engine, the agents output a natural language statement describing their current action, such as “Isabella Rodriguez is writing in her journal”, “Isabella Rodriguez is checking her emails”, “Isabella Rodriguez is talking with her family on the phone”, or “Isabella Rodriguez is getting ready for bed.” This statement is then translated into concrete movements that affect the sandbox world. The action is displayed on the sandbox interface as a set of emojis, providing an abstract representation of the action from an overhead view. To achieve this, the system utilizes a language model to translate the action into a set of emojis, which appear above each avatar’s head in a speech bubble. For example, “Isabella Rodriguez is writing in her journal” is displayed as , while “Isabella Rodriguez is checking her emails” appears as
. The complete natural language description of the action can be accessed by clicking on the agent’s avatar.
代理通过他们的行动与世界互动,并通过自然语言与彼此互动。在沙盒引擎的每个时间步骤中,代理输出描述其当前行动的自然语言语句,例如“伊莎贝拉·罗德里格斯正在日记本上写字”,“伊莎贝拉·罗德里格斯正在查看她的电子邮件”,“伊莎贝拉·罗德里格斯正在与家人通电话”,或“伊莎贝拉·罗德里格斯正在准备睡觉。”然后,该语句被翻译成影响沙盒世界的具体动作。该行动显示在沙盒界面上,以一组表情符号的形式呈现,从俯视图中提供行动的抽象表示。为了实现这一点,系统利用语言模型将行动翻译成一组表情符号,这些表情符号出现在每个头像上方的对话气泡中。例如,“伊莎贝拉·罗德里格斯正在日记本上写字”显示为,而“伊莎贝拉·罗德里格斯正在查看她的电子邮件”显示为。可以通过单击代理的头像来访问行动的完整自然语言描述。
Agents communicate with each other in full natural language. They are aware of other agents in their local area, and the generative agent architecture determines whether they walk by or engage in conversation. Here, a sample in the middle of a conversation between the agents Isabella Rodriguez and Tom Moreno about the upcoming election:333We note that the conversational style of these agents can feel overly formal, likely a result of instruction tuning in the underlying models. We expect that the writing style will be better controllable in future language models.
我们注意到这些代理人的对话风格可能会感觉过于正式,这很可能是由于基础模型中的指导调整所致。我们预计未来语言模型的写作风格将更易控制。
代理人之间使用完全自然语言进行交流。他们知道本地区其他代理人的存在,生成式代理架构决定他们是经过还是参与对话。这里是代理人伊莎贝拉·罗德里格斯和汤姆·莫雷诺关于即将到来的选举的对话中的一个样本。 3
Isabella: I’m still weighing my options, but I’ve been discussing the election with Sam Moore. What are your thoughts on him?
伊莎贝拉:我仍在权衡我的选择,但我已经和萨姆·摩尔讨论了选举。你对他有什么看法?
Tom: To be honest, I don’t like Sam Moore. I think he’s out of touch with the community and doesn’t have our best interests at heart.
汤姆:老实说,我不喜欢山姆·摩尔。我认为他脱离了社区,没有我们的最佳利益在心。
3.1.2. User Controls 3.1.2.用户控件
The user communicates with the agent through natural language by specifying a persona that the agent should perceive them as. For example, if the user specifies that they are a news “reporter” and asks about the upcoming election by saying, “Who is running for office?”, the John agent replies:
用户通过指定代理应该将其视为的角色,通过自然语言与代理进行交流。例如,如果用户指定自己是一名新闻“记者”,并询问即将到来的选举,比如说,“谁在竞选?”,John 代理会回复:
John: My friends Yuriko, Tom and I have been talking about the upcoming election and discussing the candidate Sam Moore. We have all agreed to vote for him because we like his platform.
约翰:我的朋友 Yuriko、汤姆和我一直在讨论即将到来的选举,并讨论候选人萨姆·摩尔。我们都同意投票给他,因为我们喜欢他的政纲。
To directly command one of the agents, the user takes on the persona of the agent’s “inner voice”—this makes the agent more likely to treat the statement as a directive. For instance, when told “You are going to run against Sam in the upcoming election” by a user as John’s inner voice, John decides to run in the election and shares his candidacy with his wife and son.
直接指挥其中一名特工,用户扮演特工的“内心声音”——这会使特工更有可能将陈述视为指令。例如,当用户以约翰的内心声音告诉他“你将在即将到来的选举中对阵山姆”时,约翰决定参加选举,并与妻子和儿子分享他的候选人资格。
3.2. Environmental Interaction
3.2.环境互动
Smallville features the common affordances of a small village, including a cafe, bar, park, school, dorm, houses, and stores. It also defines subareas and objects that make those spaces functional, such as a kitchen in a house and a stove in the kitchen (Figure 2). All spaces serving as agents’ primary living quarters feature a bed, desk, closet, shelf, as well as a bathroom and a kitchen.444This environment design is not the focus of our work, so we generated this environment manually, not automatically. Future work can continue to expand the richness of the agents’ environments.
这个环境设计不是我们工作的重点,所以我们手动生成了这个环境,而不是自动的。未来的工作可以继续扩展代理环境的丰富性。
Smallville 具有小村庄的常见功能,包括咖啡馆、酒吧、公园、学校、宿舍、房屋和商店。它还定义了使这些空间功能正常的子区域和对象,例如房屋中的厨房和厨房中的炉灶(图 2)。所有作为代理人主要居住区的空间都配备了床、书桌、衣柜、架子,以及浴室和厨房。 4
Agents move around Smallville as one would in a simple video game, entering and leaving buildings, navigating its map, and approaching other agents. Agent movements are directed by the generative agent architecture and the sandbox game engine: when the model dictates that the agent will move to a location, we calculate a walking path to the destination in the Smallville environment, and the agent begins moving. In addition, users can also enter the sandbox world of Smallville as an agent operating within it. The agent that the user embodies can be an agent already present in the world, such as Isabella and John, or it can be an outside visitor with no prior history in Smallville. The inhabitants of Smallville will treat the user-controlled agent no differently than they treat each other. They recognize its presence, initiate interactions, and remember its behavior before forming opinions about it.
代理在 Smallville 中移动,就像在简单的视频游戏中一样,进入和离开建筑物,浏览地图,并接近其他代理。代理的移动由生成代理架构和沙盒游戏引擎指导:当模型规定代理将移动到某个位置时,我们会计算到达 Smallville 环境中目的地的步行路径,然后代理开始移动。此外,用户还可以作为一个在其中运行的代理进入 Smallville 的沙盒世界。用户所扮演的代理可以是已经存在于世界中的代理,比如伊莎贝拉和约翰,也可以是一个没有在 Smallville 中有过历史的外部访客。Smallville 的居民将不会对用户控制的代理有任何不同的对待。他们会意识到其存在,发起互动,并在形成对其看法之前记住其行为。
Users and agents can influence the state of the objects in this world, much like in sandbox games such as The Sims. For example, a bed can be occupied when an agent is sleeping, and a refrigerator can be empty when an agent uses up the ingredients to make breakfast. End users can also reshape an agent’s environment in Smallville by rewriting the status of objects surrounding the agent in natural language. For instance, when Isabella is making breakfast in the morning, the user can change the status of the kitchen stove from “turned on” to “burning” by inputting a command to the system that chooses the object and illustrates its new status, like this: “¡Isabella’s apartment: kitchen: stove¿ is burning.” Isabella will notice this in the next moment and go to turn off the stove and remake her breakfast. Likewise, if the user sets the status of Isabella’s shower to “leaking water” when she enters the bathroom, she will gather tools from her living room and try to fix the leak.
用户和代理可以影响这个世界中物体的状态,就像沙盒游戏《模拟人生》中一样。例如,当代理正在睡觉时,床可以被占用,当代理用完食材做早餐时,冰箱可以是空的。最终用户还可以通过自然语言重塑小镇中代理的环境,重新编写周围物体的状态。例如,当伊莎贝拉早上在做早餐时,用户可以通过向系统输入命令来改变厨房炉灶的状态,从“打开”变为“着火”,就像这样:“¡伊莎贝拉的公寓:厨房:炉灶¿正在燃烧。”伊莎贝拉会在下一刻注意到这一点,然后去关掉炉灶并重新做早餐。同样,如果用户将伊莎贝拉的淋浴状态设置为“漏水”当她进入浴室时,她会从客厅拿工具来修理漏水。
图 3.A 一天中生成代理 John Lin 的早晨。 约翰大约在早上 6 点醒来并完成他的早晨例行程序,包括刷牙、洗澡和吃早餐。 他简要地与妻子梅和儿子艾迪交流一下,然后出门开始他的工作日。
The Lin family’s morning routines.
林家的早晨日常。
在模拟开始时,一个代理被初始化为组织情人节派对的意图。尽管在随后的事件链中存在许多可能的失败点——代理可能不会采取行动,可能会忘记告诉其他人,可能会忘记出席——情人节派对实际上发生了,有许多代理聚集并互动。
The Valentine’s day party at Hobbs Cafe.
情人节派对在霍布斯咖啡厅举行。
3.3. Example “Day in the Life”
Starting from the single-paragraph description, generative agents begin planning their days. As time passes in the sandbox world, their behaviors evolve as these agents interact with each other and the world, building memories and relationships, and coordinating joint activities.
从单段描述开始,生成性代理开始规划他们的日常。随着时间在沙盒世界中流逝,他们的行为会随着这些代理之间以及与世界的互动而演变,建立记忆和关系,并协调共同活动。
We demonstrate the behavior of generative agents by tracing the output of our system over the course of one day for the agent John Lin (Figure 3). In the Lin family, John is the first to wake up at 7 a.m. He brushes his teeth, takes a shower, gets dressed, eats breakfast, and checks the news at the dining table in the living room. At 8 a.m., Eddy follows suit, rushing out of bed to prepare for his classes. He catches up with John just as he is about to head out:
我们通过跟踪我们系统在一天内对代理人约翰·林(图 3)的输出来展示生成代理的行为。在林家中,约翰是第一个在早上 7 点起床的人。他刷牙,洗澡,穿衣服,吃早餐,并在客厅的餐桌上查看新闻。早上 8 点,埃迪紧随其后,匆忙起床准备上课。他赶上约翰正要出门时:
John: Good morning Eddy. Did you sleep well?
约翰:早上好艾迪。你睡得好吗?
Eddy: Good morning dad. Yeah, I slept great.
Eddy: 早上好爸爸。是的,我睡得很好。
John: That’s good. What are you working on today?
约翰:那很好。你今天在做什么工作?
Eddy: I’m working on a new music composition for my class. It’s due this week, so I’m trying to get it finished. But I’m having so much fun with it!
Eddy: 我正在为我的课程创作一首新音乐作品。这周就要交了,所以我正在努力完成它。但我玩得太开心了!
John: That sounds great! 约翰:听起来不错!
Soon after Eddy heads out, Mei wakes up and joins John. Mei asks about her son, and John recalls the conversation they just had:
艾迪出门后不久,梅醒来并加入约翰。梅询问她的儿子,约翰回忆起他们刚刚进行的对话:
Mei: Did Eddy already leave for school?
梅:艾迪已经去上学了吗?
John: Yes, he just left. He’s working on a music composition for his class.
约翰:是的,他刚刚离开。他正在为他的课程创作音乐作品。
Mei: Oh, that’s wonderful! Tell me about it.
Mei: 哦,太棒了!告诉我关于它的事情。
John: I think he’s really enjoying it! He said he’s having a lot of fun with it.
约翰:我觉得他真的很享受!他说他玩得很开心。
Mei: That’s great! I’m so proud of him.
Mei: 那太棒了!我为他感到非常自豪。
As they finish their conversation, Mei and John pack up. Mei goes on to teach and work on her research paper, while John opens his pharmacy counter at the Willow Market and Pharmacy by 9 a.m.
当他们结束谈话时,梅和约翰收拾行李。梅继续教书并着手研究论文,而约翰在早上 9 点前在柳树市场和药房开设他的药房柜台。
图 5.我们的生成式代理架构。代理感知其环境,并将所有感知保存在称为记忆流的代理经历的全面记录中。根据他们的感知,架构检索相关的记忆,并使用这些检索到的动作来确定一个动作。这些检索到的记忆也用于形成长期计划和创建更高层次的反思,这两者都被输入到记忆流中以供将来使用。
The architecture of generative agents.
生成代理的架构。
3.4. Emergent Social Behaviors
3.4.紧急社会行为
By interacting with each other, generative agents in Smallville exchange information, form new relationships, and coordinate joint activities. Extending prior work (Park et al., 2022), these social behaviors are emergent rather than pre-programmed.
通过相互交互,Smallville 中的生成代理交换信息,建立新关系,并协调共同活动。延续之前的工作(Park 等人,2022 年),这些社会行为是 emergent 而不是预先编程的。
3.4.1. Information Diffusion
3.4.1.信息传播
As agents notice each other, they may engage in dialogue—as they do so, information can spread from agent to agent. For instance, in a conversation between Sam and Tom at the grocery store, Sam tells Tom about his candidacy in the local election:
当代理人注意到彼此时,他们可能会进行对话——在这个过程中,信息可以从一个代理人传播到另一个代理人。例如,在杂货店的 Sam 和 Tom 之间的对话中,Sam 告诉 Tom 他参加了地方选举的候选人资格:
Sam: Hey Tom, how’s it going?
Sam: 嘿汤姆,最近怎么样?
Tom: Good, thanks. What’s up?
汤姆:好的,谢谢。有什么事吗?
Sam: Well, I wanted to talk to you about something. I’m actually running for mayor in the upcoming local election.
山姆:嗯,我想和你谈谈一些事情。我实际上正在参加即将到来的地方选举竞选市长。
Tom: Really? That’s great news! Why are you running?
汤姆:真的吗?太好了!你为什么在跑步?
Sam: I’ve been involved in local politics for years now, and I really feel like I can make a difference in our community. Plus, I want to bring new ideas and perspectives to the table.
山姆:我参与地方政治已经多年了,我真的觉得我可以在我们社区中产生影响。此外,我想带来新的想法和观点。
Later that day, after Sam left, Tom and John, who heard the news from another source, discuss Sam’s chances of winning the election:
当天晚些时候,萨姆离开后,汤姆和约翰从另一个消息来源得知消息,讨论萨姆赢得选举的机会
John: I heard that Sam Moore is running for mayor in the local election. Do you think he has a good chance of winning?
约翰:我听说萨姆·摩尔参加了地方选举竞选市长。你认为他有很大机会赢得选举吗?
Tom: I do think he has a good chance. He’s been working hard in the community and I think he will get a lot of support. What do you think?
汤姆:我确实认为他有很大的机会。他在社区里努力工作,我认为他会得到很多支持。你觉得呢?
John: I think it’s great that he’s running. I’m curious to see how the election will turn out.
约翰:我认为他参选很棒。我很好奇选举结果会是怎样。
Gradually, Sam’s candidacy becomes the talk of the town, with some supporting him and others remaining undecided.
逐渐地,山姆的候选人身份成为镇上的话题,一些人支持他,而另一些人则持观望态度。
3.4.2. Relationship Memory
3.4.2.关系记忆
Agents in Smallville form new relationships over time and remember their interactions with other agents. For example, at the start, Sam does not know Latoya Williams. While taking a walk in Johnson Park, Sam runs into Latoya, and they introduce themselves. Latoya mentions that she is working on a photography project: “I’m here to take some photos for a project I’m working on.” In a later interaction, Sam’s interactions with Latoya indicate a memory of that interaction, as he asks “Hi, Latoya. How is your project going?” and she replies “Hi, Sam. It’s going well!”
在斯莫维尔,特工们随着时间建立新关系,并记得与其他特工的互动。例如,在一开始,山姆不认识拉托亚·威廉姆斯。在约翰逊公园散步时,山姆遇到了拉托亚,他们互相介绍。拉托亚提到她正在进行一个摄影项目:“我在这里拍照,为我正在进行的一个项目拍照。”在后续互动中,山姆与拉托亚的互动表明他记得那次互动,因为他问道:“嗨,拉托亚。你的项目进行得怎么样?”她回答说:“嗨,山姆。进展顺利!”
3.4.3. Coordination 3.4.3.协调
Generative agents coordinate with each other. Isabella Rodriguez, at Hobbs Cafe, is initialized with an intent to plan a Valentine’s Day party from 5 to 7 p.m. on February 14th. From this seed, the agent proceeds to invite friends and customers when she sees them at Hobbs Cafe or elsewhere. Isabella then spends the afternoon of the 13th decorating the cafe for the occasion. Maria, a frequent customer and close friend of Isabella’s, arrives at the cafe. Isabella asks for Maria’s help in decorating for the party, and Maria agrees. Maria’s character description mentions that she has a crush on Klaus. That night, Maria invites Klaus, her secret crush, to join her at the party, and he gladly accepts.
生成代理相互协调。在霍布斯咖啡馆,伊莎贝拉·罗德里格斯被初始化为计划在 2 月 14 日下午 5 点至 7 点举办情人节派对的意图。从这个种子开始,代理人开始在霍布斯咖啡馆或其他地方看到朋友和顾客时邀请他们。伊莎贝拉然后在 13 日下午为这个场合装饰咖啡馆。玛丽亚,伊莎贝拉的常客和亲密朋友,来到了咖啡馆。伊莎贝拉请求玛丽亚帮忙装饰派对,玛丽亚同意了。玛丽亚的角色描述提到她暗恋克劳斯。那天晚上,玛丽亚邀请她的暗恋对象克劳斯参加派对,他欣然接受。
On Valentine’s Day, five agents, including Klaus and Maria, show up at Hobbs Cafe at 5 pm, and they enjoy the festivities (Figure 4). In this scenario, the end user only set Isabella’s initial intent to throw a party and Maria’s crush on Klaus: the social behaviors of spreading the word, decorating, asking each other out, arriving at the party, and interacting with each other at the party were initiated by the agent architecture.
情人节那天,包括克劳斯和玛丽亚在内的五名特工在下午 5 点出现在霍布斯咖啡厅,并享受节日活动(图 4)。在这种情况下,最终用户只设置了伊莎贝拉的初衷是举办派对,而玛丽亚对克劳斯的喜欢:传播消息、装饰、邀约、到达派对以及在派对上互动等社交行为是由特工架构发起的。
4. Generative Agent Architecture
4.生成式代理架构
Generative agents aim to provide a framework for behavior in an open world: one that can engage in interactions with other agents and react to changes in the environment. Generative agents take their current environment and past experiences as input and generate behavior as output. Underlying this behavior is a novel agent architecture that combines a large language model with mechanisms for synthesizing and retrieving relevant information to condition the language model’s output. Without these mechanisms, large language models can output behavior, but the resulting agents may not react based on the agent’s past experiences, may not make important inferences, and may not maintain long-term coherence. Challenges with long-term planning and coherence remain (Bubeck et al., 2023) even with today’s most performant models such as GPT-4. Because generative agents produce large streams of events and memories that must be retained, a core challenge of our architecture is to ensure that the most relevant pieces of the agent’s memory are retrieved and synthesized when needed.
生成式代理旨在为开放世界中的行为提供框架:一个可以与其他代理进行交互并对环境变化做出反应的框架。生成式代理将其当前环境和过去经验作为输入,并生成行为作为输出。支撑这种行为的是一种新颖的代理架构,它将大型语言模型与用于合成和检索相关信息以调节语言模型输出的机制相结合。没有这些机制,大型语言模型可以输出行为,但生成的代理可能不会根据代理的过去经验做出反应,可能不会进行重要推断,也可能无法保持长期的连贯性。即使在今天最高性能的模型(如 GPT-4)中,长期规划和连贯性仍然存在挑战(Bubeck 等,2023 年)。由于生成式代理产生大量事件和记忆流,这些事件和记忆必须被保留,我们架构的核心挑战是确保在需要时检索和合成代理记忆的最相关部分。
At the center of our architecture is the memory stream, a database that maintains a comprehensive record of an agent’s experience. From the memory stream, records are retrieved as relevant to plan the agent’s actions and react appropriately to the environment. Records are recursively synthesized into higher- and higher-level reflections that guide behavior. Everything in the architecture is recorded and reasoned over as a natural language description, allowing the architecture to leverage a large language model.
我们架构的中心是内存流,这是一个数据库,记录了代理人的经验的全面记录。从内存流中检索记录,以计划代理人的行动并适当地对环境做出反应。记录被递归地合成为更高级别的反思,指导行为。架构中的一切都被记录并作为自然语言描述进行推理,使架构能够利用大型语言模型。
Our current implementation utilizes the gpt3.5-turbo version of ChatGPT (OpenAI, 2022). We expect that the architectural basics of generative agents—memory, planning, and reflection—will likely remain the same as language models improve. Newer language models (e.g., GPT-4) will continue to expand the expressive power and performance of the prompts that underpin generative agents. As of writing, however, GPT-4’s API was invitation-only, so our agents use ChatGPT.
我们目前的实现采用了 ChatGPT(OpenAI,2022)的 gpt3.5-turbo 版本。我们预计生成式代理的基本架构——记忆、规划和反思——在语言模型改进时可能会保持不变。新的语言模型(例如 GPT-4)将继续扩展支撑生成式代理的提示的表现力和性能。然而,截至撰写本文时,GPT-4 的 API 仅限邀请使用,因此我们的代理使用 ChatGPT。
图 6.内存流包含大量与代理当前情况相关和不相关的观察。检索识别出这些观察的子集,应将其传递给语言模型,以调整其对情况的响应。
On the left, a large list of events such as “refrigerator is idle”. On the right, the question “What are you looking forward to the most right now?”, followed by retrieval calculations that rank “ordering decorations for the party” and “researching ideas for the party” highly. Based on these memories, Isabella responds, “I’m looking forward to the Valentine’s Day party that I’m planning at Hobbs Cafe!”
在左侧,有一个大型事件列表,例如“冰箱处于闲置状态”。在右侧,有一个问题“你现在最期待的是什么?”,随后是排名较高的“为派对订购装饰品”和“为派对研究想法”的检索计算。根据这些记忆,伊莎贝拉回答说:“我期待在霍布斯咖啡馆筹划的情人节派对!”
4.1. Memory and Retrieval 4.1.记忆和检索
Challenge: 挑战:
Creating generative agents that can simulate human behavior requires reasoning about a set of experiences that is far larger than what should be described in a prompt, as the full memory stream can distract the model and does not even currently fit into the limited context window. Consider the Isabella agent answering the question, “What are you passionate about these days?” Summarizing all of Isabella’s experiences to fit in the limited context window of the language model produces an uninformative response, where Isabella discusses topics such as collaborations for events and projects and cleanliness and organization in a cafe. Instead of summarizing, the memory stream described below surfaces relevant memories, resulting in a more informative and specific response that mentions Isabella’s passion for making people feel welcome and included, planning events and creating an atmosphere that people can enjoy, such as the Valentine’s Day party.
创建能够模拟人类行为的生成式代理需要对一系列经验进行推理,这些经验远远超出了应该在提示中描述的范围,因为完整的记忆流可能会分散模型的注意力,甚至目前也无法完全适应有限的上下文窗口。考虑到伊莎贝拉代理回答问题:“最近你对什么充满激情?”将所有伊莎贝拉的经验总结以适应语言模型的有限上下文窗口会产生一个无信息量的回答,伊莎贝拉会讨论诸如活动和项目的合作、咖啡馆的清洁和组织等主题。与其总结,下面描述的记忆流会呈现相关的记忆,从而产生一个更具信息量和具体的回答,提到伊莎贝拉对让人们感到受欢迎和包容的热情,策划活动并营造人们可以享受的氛围,比如情人节派对。
Approach: 方法:
The memory stream maintains a comprehensive record of the agent’s experience. It is a list of memory objects, where each object contains a natural language description, a creation timestamp, and a most recent access timestamp. The most basic element of the memory stream is an observation, which is an event directly perceived by an agent. Common observations include behaviors performed by the agent themselves or behaviors that agents perceive being performed by other agents or non-agent objects. For instance, Isabella Rodriguez, who works at a coffee shop, might accrue the following observations over time: (1) Isabella Rodriguez is setting out the pastries, (2) Maria Lopez is studying for a Chemistry test while drinking coffee, (3) Isabella Rodriguez and Maria Lopez are conversing about planning a Valentine’s day party at Hobbs Cafe, (4) The refrigerator is empty.
记忆流维护了代理人经历的全面记录。它是一个记忆对象列表,其中每个对象包含自然语言描述、创建时间戳和最近访问时间戳。记忆流的最基本元素是观察,这是代理人直接感知到的事件。常见的观察包括代理人自己执行的行为,或者代理人感知到其他代理人或非代理对象执行的行为。例如,在咖啡店工作的伊莎贝拉·罗德里格斯可能随着时间积累以下观察:(1)伊莎贝拉·罗德里格斯正在摆放糕点,(2)玛丽亚·洛佩兹正在喝咖啡备考化学考试,(3)伊莎贝拉·罗德里格斯和玛丽亚·洛佩兹正在谈论在霍布斯咖啡馆举办情人节派对的计划,(4)冰箱是空的。
Our architecture implements a retrieval function that takes the agent’s current situation as input and returns a subset of the memory stream to pass on to the language model. There are many possible implementations of a retrieval function, depending on what is important for the agent to consider when deciding how to act. In our context, we focus on three main components that, together, produce effective results.
我们的架构实现了一个检索功能,该功能以代理的当前情况作为输入,并返回一个子集的记忆流以传递给语言模型。检索功能有许多可能的实现方式,取决于代理在决定如何行动时需要考虑什么是重要的。在我们的背景下,我们专注于三个主要组件,这些组件共同产生有效的结果。
Recency assigns a higher score to memory objects that were recently accessed, so that events from a moment ago or this morning are likely to remain in the agent’s attentional sphere. In our implementation, we treat recency as an exponential decay function over the number of sandbox game hours since the memory was last retrieved. Our decay factor is .
最近性为最近访问的记忆对象分配更高的分数,因此一刻钟前或今天早上的事件很可能会保留在代理人的注意力范围内。在我们的实现中,我们将最近性视为自上次检索记忆以来沙盒游戏小时数的指数衰减函数。我们的衰减因子是 。
Importance distinguishes mundane from core memories by assigning a higher score to memory objects that the agent believes to be important. For instance, a mundane event, such as eating breakfast in one’s room, would yield a low importance score, whereas a breakup with one’s significant other would yield a high score. There are many possible implementations of an importance score; we find that directly asking the language model to output an integer score is effective. The full prompt appears below:
重要性通过为代理人认为重要的记忆对象分配更高的分数来区分平凡记忆和核心记忆。例如,一个平凡的事件,比如在自己的房间吃早餐,将得到一个较低的重要性分数,而与自己的另一半分手将得到一个较高的分数。重要性分数有许多可能的实现方式;我们发现直接询问语言模型输出一个整数分数是有效的。完整提示如下:
On the scale of 1 to 10, where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a break up, college acceptance), rate the likely poignancy of the following piece of memory.
在 1 到 10 的尺度上,其中 1 是纯粹平凡的(例如刷牙,整理床铺),而 10 是极其心酸的(例如分手,大学录取),请评估以下记忆片段的可能心酸程度。
Memory: buying groceries at The Willows Market and Pharmacy
记忆:在柳树市场和药店购买杂货
Rating: <fill in> 评分:<填写>
This prompt returns an integer value of 2 for “cleaning up the room” and 8 for “asking your crush out on a date.” The importance score is generated at the time the memory object is created.
这个提示返回整数值 2,表示“整理房间”,以及整数值 8,表示“约会时向你心仪的人表白”。重要性评分是在创建内存对象时生成的。
Relevance assigns a higher score to memory objects that are related to the current situation. What is relevant depends on the answer to, “Relevant to what?”, so we condition relevance on a query memory. If the query, for example, is that a student is discussing what to study for a chemistry test with a classmate, memory objects about their breakfast should have low relevance, whereas memory objects about the teacher and schoolwork should have high relevance. In our implementation, we use the language model to generate an embedding vector of the text description of each memory. Then, we calculate relevance as the cosine similarity between the memory’s embedding vector and the query memory’s embedding vector.
相关性将更高的分数分配给与当前情况相关的记忆对象。相关性取决于答案,“相关于什么?”,因此我们将相关性条件设置为查询记忆。例如,如果查询是学生正在与同学讨论化学考试要学习什么,那么关于他们早餐的记忆对象应该具有较低的相关性,而关于老师和学校工作的记忆对象应该具有较高的相关性。在我们的实现中,我们使用语言模型生成每个记忆的文本描述的嵌入向量。然后,我们将相关性计算为记忆的嵌入向量与查询记忆的嵌入向量之间的余弦相似度。
To calculate the final retrieval score, we normalize the recency, relevance, and importance scores to the range of using min-max scaling. The retrieval function scores all memories as a weighted combination of the three elements: . In our implementation, all s are set to 1. The top-ranked memories that fit within the language model’s context window are included in the prompt.
计算最终检索分数时,我们将最近性、相关性和重要性分数归一化到范围 中,使用最小-最大缩放。检索函数将所有记忆作为三个元素的加权组合进行评分: 。在我们的实现中,所有 均设置为 1。符合语言模型上下文窗口的排名靠前的记忆将包含在提示中。
4.2. Reflection 4.2.反思
The reflection tree. 反射树。
Challenge: 挑战:
Generative agents, when equipped with only raw observational memory, struggle to generalize or make inferences. Consider a scenario in which Klaus Mueller is asked by the user: “If you had to choose one person of those you know to spend an hour with, who would it be?” With access to only observational memory, the agent simply chooses the person with whom Klaus has had the most frequent interactions: Wolfgang, his college dorm neighbor. Unfortunately, Wolfgang and Klaus only ever see each other in passing, and do not have deep interactions. A more desirable response requires that the agent generalize from memories of Klaus spending hours on a research project to generate a higher-level reflection that Klaus is passionate about research, and likewise recognize Maria putting in effort into her own research (albeit in a different field), enabling a reflection that they share a common interest. With the approach below, when Klaus is asked who to spend time with, Klaus chooses Maria instead of Wolfgang.
生成式代理,当仅配备原始观察记忆时,很难泛化或进行推理。考虑这样一个情景,用户问 Klaus Mueller:“如果你必须选择一个你认识的人和他共度一个小时,你会选择谁?”只有观察记忆的情况下,代理人简单地选择 Klaus 与之互动最频繁的人:他的大学宿舍邻居 Wolfgang。不幸的是,Wolfgang 和 Klaus 只是偶尔见面,没有深入的互动。更理想的回应需要代理人从 Klaus 在研究项目上花费时间的记忆中泛化,生成一个更高层次的反思,即 Klaus 对研究充满激情,并且认识到 Maria 也在努力进行自己的研究(尽管在不同领域),从而使他们有共同兴趣的反思。通过以下方法,当 Klaus 被问及要和谁共度时间时,Klaus 选择 Maria 而不是 Wolfgang。
Approach: 方法:
We introduce a second type of memory, which we call a reflection. Reflections are higher-level, more abstract thoughts generated by the agent. Because they are a type of memory, they are included alongside other observations when retrieval occurs. Reflections are generated periodically; in our implementation, we generate reflections when the sum of the importance scores for the latest events perceived by the agents exceeds a threshold (150 in our implementation). In practice, our agents reflected roughly two or three times a day.
我们引入了第二种记忆类型,我们称之为反思。反思是由代理生成的更高层次、更抽象的思维。因为它们是一种记忆类型,所以在检索发生时,它们与其他观察一起被包括进来。反思是定期生成的;在我们的实现中,当代理感知到的最新事件的重要性分数之和超过阈值(在我们的实现中为 150)时,我们生成反思。在实践中,我们的代理大约每天反思两到三次。
The first step in reflection is for the agent to determine what to reflect on, by identifying questions that can be asked given the agent’s recent experiences. We query the large language model with the 100 most recent records in the agent’s memory stream (e.g., “Klaus Mueller is reading a book on gentrification”, “Klaus Mueller is conversing with a librarian about his research project”, “desk at the library is currently unoccupied”) and prompt the language model, “Given only the information above, what are 3 most salient high-level questions we can answer about the subjects in the statements?” The model’s response generates candidate questions: for example, What topic is Klaus Mueller passionate about? and What is the relationship between Klaus Mueller and Maria Lopez? We use these generated questions as queries for retrieval, and gather relevant memories (including other reflections) for each question. Then we prompt the language model to extract insights and cite the particular records that served as evidence for the insights. The full prompt is as follows:
反思的第一步是让代理确定要反思的内容,通过识别可以根据代理最近的经验提出的问题。我们使用代理的记忆流中最近的 100 条记录(例如,“Klaus Mueller 正在阅读一本关于新市区化的书”,“Klaus Mueller 正在与图书馆员讨论他的研究项目”,“图书馆的桌子目前空着”)来查询大型语言模型,并提示语言模型:“仅根据上述信息,我们可以回答关于陈述中主题的 3 个最显著的高层问题是什么?”模型的响应生成候选问题:例如,Klaus Mueller 对哪个主题充满热情?Klaus Mueller 和 Maria Lopez 之间的关系是什么?我们将这些生成的问题用作检索查询,并为每个问题收集相关的记忆(包括其他反思)。然后,我们提示语言模型提取见解,并引用作为见解证据的特定记录。完整提示如下:
Statements about Klaus Mueller
关于 Klaus Mueller 的声明
1. Klaus Mueller is writing a research paper
1. Klaus Mueller 正在撰写一篇研究论文
2. Klaus Mueller enjoys reading a book
on gentrification
3. Klaus Mueller is conversing with Ayesha Khan about exercising [...]
What 5 high-level insights can you infer from the above statements? (example format: insight (because of 1, 5, 3))
2. Klaus Mueller 喜欢阅读一本关于新生派的书籍 3. Klaus Mueller 正在与 Ayesha Khan 谈论锻炼 [...] 从上述陈述中,您可以推断出哪 5 个高层次的见解? (示例格式:见解 (因为 1、5、3))
This process generates statements such as Klaus Mueller is dedicated to his research on gentrification (because of 1, 2, 8, 15). We parse and store the statement as a reflection in the memory stream, including pointers to the memory objects that were cited.
这个过程生成类似 Klaus Mueller 致力于他对翻新的研究(因为 1、2、8、15)的陈述。我们将该陈述解析并存储为内存流中的反映,包括指向被引用的内存对象的指针。
Reflection explicitly allows the agents to reflect not only on their observations but also on other reflections: for example, the second statement about Klaus Mueller above is a reflection that Klaus previously had, not an observation from his environment. As a result, agents generate trees of reflections: the leaf nodes of the tree represent the base observations, and the non-leaf nodes represent thoughts that become more abstract and higher-level the higher up the tree they are.
反思明确允许代理人不仅反思他们的观察,还可以反思其他反思:例如,上面关于 Klaus Mueller 的第二个声明是 Klaus 以前有过的反思,而不是来自他环境的观察。因此,代理人生成反思树:树的叶节点代表基本观察,非叶节点代表随着树向上移动变得更抽象和更高层次的思维。
4.3. Planning and Reacting
4.3.规划和反应
Challenge: 挑战:
While a large language model can generate plausible behavior in response to situational information (e.g., (Park et al., 2022; Horton, 2023)), agents need to plan over a longer time horizon to ensure that their sequence of actions is coherent and believable. If we prompt a language model with Klaus’s background, describe the time, and ask what action he ought to take at the given moment, Klaus would eat lunch at 12 pm, but then again at 12:30 pm and 1 pm, despite having already eaten his lunch twice. Optimizing for believability in the moment sacrifices believability over time. To overcome this issue, planning is essential. With the approach described below, Klaus’s afternoon plan is less gluttonous: he has lunch at Hobbs Cafe while reading at 12pm, works on his research paper at the school library at 1pm, and takes a break for a walk in the park at 3pm.
大型语言模型可以根据情境信息生成合理的行为(例如(Park 等人,2022 年;Horton,2023 年)),但代理需要在更长的时间范围内进行规划,以确保他们的行动序列连贯且可信。如果我们以克劳斯的背景提示一个语言模型,描述时间,并询问他在特定时刻应该采取什么行动,克劳斯会在中午 12 点吃午餐,但在 12:30 和下午 1 点再次吃午餐,尽管他已经吃过两次午餐。在当下追求可信度会牺牲长期的可信度。为了解决这个问题,规划是必不可少的。通过下面描述的方法,克劳斯的下午计划不那么贪吃:他在中午 12 点在霍布斯咖啡馆吃午餐,同时阅读,下午 1 点在学校图书馆写研究论文,下午 3 点在公园散步休息。
Approach: 方法:
Plans describe a future sequence of actions for the agent, and help keep the agent’s behavior consistent over time. A plan includes a location, a starting time, and a duration. For instance, Klaus Mueller, who is dedicated in his research and has an impending deadline,555And, in this way, bears at least a passing resemblance to the authors of this paper.
以这种方式,至少与本文的作者有些相似。
计划描述了代理人未来的一系列行动,并有助于保持代理人的行为一致性。计划包括一个位置、一个开始时间和持续时间。例如,克劳斯·穆勒,他在研究上专注并且有一个迫在眉睫的截止日期。 5 may choose to spend his day working at his desk drafting his research paper. An entry in a plan might state, for example: for 180 minutes from 9am, February 12th, 2023, at Oak Hill College Dorm: Klaus Mueller’s room: desk, read and take notes for research paper. Like reflections, plans are stored in the memory stream and are included in the retrieval process. This allows the agent to consider observations, reflections, and plans all together when deciding how to behave. Agents may change their plans midstream if needed.
他可能选择花一天的时间坐在办公桌前起草他的研究论文。计划中的一项可能会说明,例如:在 2023 年 2 月 12 日上午 9 点开始,持续 180 分钟,在橡树山学院宿舍:克劳斯·穆勒的房间:办公桌,阅读并做研究论文的笔记。与反思类似,计划被存储在记忆流中,并包含在检索过程中。这使得代理人在决定如何行动时可以考虑观察、反思和计划。代理人可能会在需要时更改他们的计划。
It would be unrealistic and uninteresting for an artist agent to plan on painting while sitting at a pharmacy counter for four hours without moving. A more desirable plan would involve the agent taking the necessary time to gather materials, mix paint, take breaks, and clean up during the four-hour period in their home studio. To create such plans, our approach starts top-down and then recursively generates more detail. The first step is to create a plan that outlines the day’s agenda in broad strokes. To create the initial plan, we prompt the language model with the agent’s summary description (e.g., name, traits, and a summary of their recent experiences) and a summary of their previous day. A full example prompt is below, which is unfinished at the bottom for the language model to complete:
这对艺术家经纪人来说是不现实和无趣的,计划在药房柜台坐着四个小时不动就开始绘画。更理想的计划应该包括经纪人花费必要的时间收集材料、调色、休息和在家工作室里进行四个小时的清理。要创建这样的计划,我们的方法是自上而下开始,然后递归生成更多细节。第一步是创建一个概述当天议程的计划。为了创建初始计划,我们用经纪人的摘要描述(例如姓名、特点和最近经历的摘要)以及前一天的摘要来提示语言模型。下面是一个完整的示例提示,底部未完成,供语言模型完成:
Name: Eddy Lin (age: 19)
姓名:林艾迪(年龄:19 岁)
Innate traits: friendly, outgoing, hospitable
天生特质:友好、外向、好客
Eddy Lin is a student at Oak Hill College studying music theory and composition. He loves to explore different musical styles and is always looking for ways to expand his knowledge. Eddy Lin is working on a composition project for his college class. He is taking classes to learn more about music theory. Eddy Lin is excited about the new composition he is working on but he wants to dedicate more hours in the day to work on it in the coming days
On Tuesday February 12, Eddy 1) woke up and completed the morning routine at 7:00 am, […] 6) got ready to sleep around 10 pm.
Today is Wednesday February 13. Here is Eddy’s plan today in broad strokes: 1)
林艾迪是奥克希尔学院的学生,主修音乐理论和作曲。他喜欢探索不同的音乐风格,总是在寻找扩展知识的方法。林艾迪正在为他的大学课程进行作曲项目。他正在上音乐理论课程。林艾迪对他正在进行的新作曲项目感到兴奋,但他希望在未来几天里能够花更多时间来完成它。2 月 12 日星期二,林艾迪 1)在早上 7 点起床并完成了早晨的日常程序,[…] 6)准备在晚上 10 点左右睡觉。今天是 2 月 13 日星期三。以下是林艾迪今天的整体计划:1)
This generates a rough sketch of the agent’s plan for a day, divided into five to eight chunks: “1) wake up and complete the morning routine at 8:00 am, 2) go to Oak Hill College to take classes starting 10:00 am, […] 5) work on his new music composition from 1:00 pm to 5:00 pm, 6) have dinner at 5:30 pm, 7) finish school assignments and go to bed by 11:00 pm.”
这产生了代理人一天计划的粗略草图,分为五到八个部分:“1) 早上 8:00 起床并完成早晨例行程序,2) 前往橡树山学院,从上午 10:00 开始上课,[…] 5) 下午 1:00 至 5:00 工作于他的新音乐作品,6) 下午 5:30 吃晚餐,7) 完成学校作业并在晚上 11:00 前上床睡觉。”
The agent saves this plan in the memory stream and then recursively decomposes it to create finer-grained actions, first into hour-long chunks of actions—Eddy’s plan to work on his new music composition from 1:00 pm to 5:00 pm becomes 1:00 pm: start by brainstorming some ideas for his music composition […] 4:00 pm: take a quick break and recharge his creative energy before reviewing and polishing his composition. We then recursively decompose this again into 5–15 minute chunks: e.g., 4:00 pm: grab a light snack, such as a piece of fruit, a granola bar, or some nuts. 4:05 pm: take a short walk around his workspace […] 4:50 pm: take a few minutes to clean up his workspace. This process can be adjusted to match the desired granularity.
代理将此计划保存在内存流中,然后递归地将其分解以创建更精细的动作,首先将其分解为一小时的动作块 - Eddy 的计划是从下午 1:00 到下午 5:00 专注于他的新音乐作品,变成了下午 1:00:开始构思一些关于他音乐作品的想法[...]下午 4:00:休息片刻,恢复创造力,然后审查和润色他的作品。然后我们再次将其递归分解为 5-15 分钟的块:例如,下午 4:00:吃点轻食,比如一块水果、燕麦棒或一些坚果。下午 4:05:在工作区周围走一小段路[...]下午 4:50:花几分钟整理工作区。这个过程可以调整以匹配所需的粒度。
4.3.1. Reacting and Updating Plans
4.3.1. 反应和更新计划
Generative agents operate in an action loop where, at each time step, they perceive the world around them and those perceived observations are stored in their memory stream. We prompt the language model with these observations to decide whether the agent should continue with their existing plan, or react. Standing at an easel and painting, for example, might trigger an observation of the easel, but this is unlikely to prompt a reaction. However, if Eddy’s father John records that he sees Eddy taking a short walk in the house garden, the outcome is different. The prompt is below, with [Agent’s Summary Description] standing in for a dynamically-generated, paragraph-long summary of the agent’s overall goals and disposition, which is described in Appendix A:
生成代理在一个行动循环中运作,在每个时间步骤中,它们感知周围的世界,这些感知观察结果被存储在它们的记忆流中。我们用这些观察结果提示语言模型,以决定代理是否应该继续执行其现有计划,还是做出反应。例如,站在画架前作画可能会触发对画架的观察,但这不太可能引发反应。然而,如果埃迪的父亲约翰记录说他看到埃迪在花园里散步,结果就会不同。下面是提示,[代理的摘要描述]代表着一个动态生成的、长段的代理整体目标和性格描述,这在附录 A 中有描述:
[Agent’s Summary Description]
It is February 13, 2023, 4:56 pm.
2023 年 2 月 13 日下午 4 点 56 分。
John Lin’s status: John is back home early from work.
Observation: John saw Eddy taking a short walk around his workplace.
Summary of relevant context from John’s memory: Eddy Lin is John’s Lin’s son. Eddy Lin has been working on a music composition for his class. Eddy Lin likes to walk around the garden when he is thinking about or listening to music.
Should John react to the observation, and if so, what would be an appropriate reaction?
约翰林的状态:约翰从工作中提前回家。观察:约翰看到艾迪在他的工作场所附近散步。约翰记忆中相关背景的总结:艾迪林是约翰林的儿子。艾迪林一直在为他的课程作曲。艾迪林喜欢在想音乐或听音乐时在花园里散步。约翰应该对这个观察做出反应吗?如果是,什么样的反应是合适的?
The context summary is generated through two prompts that retrieve memories via the queries “What is [observer]’s relationship with the [observed entity]?” and “[Observed entity] is [action status of the observed entity]”, and their answers summarized together. The output suggests that John could consider asking Eddy about his music composition project. We then regenerate the agent’s existing plan starting from the time when the reaction takes place. Finally, if the action indicates an interaction between agents, we generate their dialogue.
上下文摘要是通过两个提示生成的,这两个提示通过查询“[观察者]与[被观察实体]的关系是什么?”和“[被观察实体]是[被观察实体的动作状态]”来检索记忆,并将它们的答案总结在一起。输出建议约翰考虑询问埃迪关于他的音乐创作项目。然后,我们重新生成代理人现有计划,从反应发生的时间开始。最后,如果行动表明代理人之间有互动,我们生成他们的对话。
4.3.2. Dialogue 4.3.2.对话
Agents converse as they interact with each other. We generate agents’ dialogue by conditioning their utterances on their memories about each other. For example, when John initiates his conversation with Eddy, we generate John’s first utterance by using his summarized memory about Eddy and the intended reaction when he decided to ask Eddy about his composition project:
代理人在彼此互动时进行对话。我们通过将代理人的话语与他们对彼此的记忆联系起来来生成代理人的对话。例如,当约翰与埃迪开始对话时,我们会使用约翰对埃迪的总结记忆以及他决定询问埃迪有关他作品项目的反应来生成约翰的第一句话。
[Agent’s Summary Description]
It is February 13, 2023, 4:56 pm.
2023 年 2 月 13 日下午 4 点 56 分。
John Lin’s status: John is back home early from work.
Observation: John saw Eddy taking a short walk around his workplace.
Summary of relevant context from John’s memory: Eddy Lin is John’s Lin’s son. Eddy Lin has been working on a music composition for his class. Eddy Lin likes to walk around the garden when he is thinking about or listening to music.
John is asking Eddy about his music composition
project. What would he say to Eddy?
约翰林的状态:约翰从工作中提前回家了。观察:约翰看到艾迪在他的工作场所附近散步。约翰记忆中相关背景的总结:艾迪林是约翰林的儿子。艾迪林一直在为他的课程作曲。艾迪林喜欢在想音乐或听音乐时在花园里散步。约翰正在问艾迪关于他的音乐作曲项目。他会对艾迪说什么?
The result: “Hey Eddy, how’s the music composition project for your class coming along?” From Eddy’s perspective, John initiating the dialogue is seen as an event to which he may want to react. So, just as John did, Eddy retrieves and summarizes his memory about his relationship with John, as well as his memory that may be related to John’s last utterance in the dialogue. If he decides to respond, we generate Eddy’s utterance using his summarized memory and the current dialogue history:
“嘿,艾迪,你们班的音乐创作项目进展如何?”从艾迪的角度来看,约翰发起对话被视为一个事件,他可能想要做出反应。因此,就像约翰所做的那样,艾迪检索并总结了他与约翰的关系,以及可能与约翰在对话中最后一次发言相关的记忆。如果他决定回应,我们将使用他总结的记忆和当前对话历史生成艾迪的发言:
[Agent’s Summary Description]
It is February 13, 2023, 4:56 pm.
2023 年 2 月 13 日下午 4 点 56 分。
Eddy Lin’s status: Eddy is taking a short walk around his workplace.
Observation: John is initiating a conversation with Eddy.
Summary of relevant context from Eddy’s memory: John Lin is Eddy Lin’s father. John Lin is caring and is interested to learn more about Eddy Lin’s school work. John Lin knows that Eddy Lin is working on a music composition.
Here is the dialogue history:
John: Hey Eddy, how’s the music composition project for your class coming along?
How would Eddy respond to John?
父亲:嘿,艾迪,你班上的音乐作曲项目进行得怎么样?艾迪会如何回应约翰?
This generates Eddy’s response: “Hey Dad, it’s going well. I’ve been taking walks around the garden to clear my head and get some inspiration.” The continuation of this dialogue is generated using the same mechanism until one of the two agents decides to end the dialogue.
这产生了 Eddy 的回应:“嘿,爸爸,一切都很好。我一直在花园里散步,让头脑清晰,获得一些灵感。”直到两个代理中的一个决定结束对话,这段对话的延续是使用相同的机制生成的。
5. Sandbox Environment Implementation
5. 沙盒环境实施
The Smallville sandbox game environment is built using the Phaser web game development framework (Labs, 2023). The visual environment sprites, including agent avatars, as well as an environment map and collision map that we authored, are imported into Phaser.
Smallville 沙盒游戏环境是使用 Phaser 网络游戏开发框架(Labs,2023 年)构建的。我们编写的视觉环境精灵,包括代理头像,以及环境地图和碰撞地图,都被导入到 Phaser 中。
We supplement the sandbox development framework with a server that makes the sandbox information available to generative agents and enables generative agents to move and influence the sandbox environment. The server maintains a JSON data structure that contains information about each agent in the sandbox world, including their current location, a description of their current action, and the sandbox object they are interacting with. At each sandbox time step, the sandbox server parses the JSON for any changes coming from the generative agents, moves the agents to their new positions, and updates the status of any sandbox objects that the agents are interacting with (e.g., changing the status of the coffee machine from “idle” to “brewing coffee” if an agent’s action is “making espresso for a customer @ Hobbs Cafe: counter: coffee machine”). The sandbox server is also responsible for sending all agents and objects that are within a preset visual range for each agent to that agent’s memory, so the agent can react appropriately. The agent’s output action then updates the JSON, and the process loops for the next time step.
我们通过一个服务器来补充沙盒开发框架,使沙盒信息对生成代理可用,并使生成代理能够移动和影响沙盒环境。服务器维护一个 JSON 数据结构,其中包含有关沙盒世界中每个代理的信息,包括他们当前的位置、当前动作的描述以及他们正在交互的沙盒对象。在每个沙盒时间步骤中,沙盒服务器解析 JSON,以获取来自生成代理的任何更改,将代理移动到他们的新位置,并更新代理正在交互的任何沙盒对象的状态(例如,如果代理的动作是“为 Hobbs 咖啡馆的客人制作浓缩咖啡:柜台:咖啡机”,则将咖啡机的状态从“空闲”更改为“冲泡咖啡”)。沙盒服务器还负责将每个代理的预设视觉范围内的所有代理和对象发送到该代理的内存中,以便代理可以做出适当的反应。代理的输出动作然后更新 JSON,并且该过程循环进行下一个时间步骤。
End users initialize a new agent with a brief natural language description, as in the paragraph about John Lin in Section 3.1. In our implementation, we split this semicolon-delimited list of characteristics up into a set of memories. These serve as the initial memories that determine the agent’s behavior.
These memories are initial starting points: as the agents gain more experience in the sandbox world, and as more records saturate the memory stream, the agent’s summary and behavior will evolve.
终端用户使用简短的自然语言描述初始化新代理,就像第 3.1 节关于约翰·林的段落中所述。在我们的实现中,我们将这些以分号分隔的特征列表拆分为一组记忆。这些记忆作为决定代理行为的初始记忆。这些记忆是初始起点:随着代理在沙盒世界中获得更多经验,并且更多记录填满记忆流,代理的摘要和行为将会发展。
5.1. From Structured World Environments to Natural Language, and Back Again
5.1.从结构化世界环境到自然语言,再返回
The architecture of generative agents operates using natural language. Therefore, we need a mechanism to ground the agent’s reasoning to the sandbox world. To achieve this, we represent the sandbox environment—areas and objects—as a tree data structure, with an edge in the tree indicating a containment relationship in the sandbox world. We convert this tree into natural language to pass to the generative agents. For instance, “stove” being a child of “kitchen” is rendered into “there is a stove in the kitchen.”
生成代理的架构使用自然语言进行操作。因此,我们需要一种机制来将代理的推理与沙盒世界联系起来。为了实现这一点,我们将沙盒环境——区域和物体——表示为树形数据结构,树中的边表示沙盒世界中的包含关系。我们将这棵树转换为自然语言,传递给生成代理。例如,“炉灶”作为“厨房”的子节点被渲染为“厨房里有一个炉灶”。
Agents build individual tree representations of the environment as they navigate it — subgraphs of the overall sandbox environment tree. We initialize each agent with an environment tree capturing the spaces and objects that the agent should be aware of: the rooms and objects in their living quarters, their workplace, and commonly visited stores and shops. As the agents navigate the sandbox world, they update this tree to reflect newly perceived areas. Agents are not omniscient: their tree may get out of date as they leave an area, and is updated when they re-enter the area.
代理在导航环境时构建环境的个体树表示——整体沙盒环境树的子图。我们使用一个环境树来初始化每个代理,捕捉代理应该注意的空间和物体:他们居住区域的房间和物体、工作场所,以及常去的商店和店铺。当代理在沙盒世界中导航时,他们会更新这棵树以反映新感知到的区域。代理并非全知全能:当他们离开一个区域时,他们的树可能会过时,并在重新进入该区域时进行更新。
To determine the appropriate location for each action, we traverse the agent’s stored environment tree and flatten a portion of it into natural language to prompt the language model. Recursively starting at the root of the agent’s environment tree, we prompt the model to find the most suitable area. For example, if Eddy’s agent indicated that he should take a short walk around his workspace:
确定每个动作的适当位置,我们遍历代理存储的环境树,并将其中的一部分转换为自然语言以提示语言模型。从代理环境树的根部递归开始,我们提示模型找到最合适的区域。例如,如果 Eddy 的代理指示他应该在工作区周围短暂散步:
[Agent’s Summary Description]
Eddy Lin is currently in The Lin family’s house: Eddy Lin’s bedroom: desk) that has Mei and John Lin’s
Eddy Lin 目前在林家的房子里:Eddy Lin 的卧室:桌子)上有梅和约翰林的
bedroom, Eddy Lin’s bedroom, common room, kitchen, bathroom, and garden.
Eddy Lin knows of the following areas: The Lin
family’s house, Johnson Park, Harvey Oak Supply Store, The Willows Market and Pharmacy, Hobbs Cafe, The Rose and Crown Pub.
* Prefer to stay in the current area if the activity can be done there.
Eddy Lin is planning to take a short walk around his workspace. Which area should Eddy Lin go to?
卧室,林艾迪的卧室,公共区域,厨房,浴室和花园。林艾迪知道以下区域:林家的房子,约翰逊公园,哈维橡树供应店,柳树市场和药店,霍布斯咖啡馆,玫瑰和皇冠酒吧。*如果可以在当前区域进行活动,则更喜欢留在当前区域。林艾迪计划在他的工作区域周围散步。林艾迪应该去哪个区域?
This outputs The Lin family’s house. We then use the same process recursively to determine the most appropriate subarea within the chosen area until we reach a leaf node of the agent’s environment tree. In the example above, the result of this traversal is The Lin family’s house: garden: house garden. Finally, we use traditional game path algorithms to animate the agent’s movement so that it travels to the location indicated by the leaf node.
这输出林家的房子。然后我们递归使用相同的过程来确定所选区域内最合适的子区域,直到到达代理环境树的叶节点。在上面的示例中,此遍历的结果是林家的房子:花园:房子花园。最后,我们使用传统的游戏路径算法来使代理的移动动画化,以便它移动到叶节点指示的位置。
When an agent executes an action on an object, we prompt the language model to ask what happens to the state of the object. For example, if Isabella’s generative agent outputs the action “making espresso for a customer”, a query to the language model indicates in response that the state of the coffee machine in Hobbs Cafe should change from “off” to “brewing coffee”.
当代理在对象上执行操作时,我们提示语言模型询问对象的状态会发生什么变化。例如,如果伊莎贝拉的生成代理输出动作“为顾客制作浓缩咖啡”,对语言模型的查询会指示霍布斯咖啡馆的咖啡机状态应该从“关闭”变为“冲泡咖啡”。
6. Controlled Evaluation 6.受控评估
Generative agents, both as individual agents and as groups, aim to produce believable behavior based on their environment and experiences. In our evaluation, we investigate the capacity and limitations of generative agents. Do individual agents properly retrieve past experiences and generate believable plans, reactions, and thoughts that shape their behavior? Does a community of agents demonstrate information diffusion, relationship formation, and agent coordination across different pockets of the community?
生成代理,无论是作为个体代理还是作为群体,旨在根据其环境和经验产生可信行为。在我们的评估中,我们调查生成代理的能力和局限性。个体代理是否能正确检索过去的经验并生成可信的计划、反应和思考,从而塑造他们的行为?代理社区是否展示了信息扩散、关系形成和代理协调在社区不同部分之间的情况?
We evaluate generative agents in two stages. We begin with a more tightly controlled evaluation in this section, where we individually assess agent responses to understand whether they generate believable behavior in narrowly defined contexts. Then, in our end-to-end analysis of the agent community over two full game days, we investigate their emergent behavior as a collective, as well as errors and boundary conditions.
我们在两个阶段评估生成代理。我们从这一部分开始进行更严格控制的评估,我们单独评估代理的响应,以了解它们是否在狭窄定义的情境中生成可信行为。然后,在我们对代理社区在两个完整游戏日内的端到端分析中,我们研究它们作为一个集体的新兴行为,以及错误和边界条件。
6.1. Evaluation Procedure 6.1.评估程序
To assess generative agents in Smallville, we take advantage of the fact that generative agents will respond to natural language questions. So, we “interview” agents to probe their ability to remember past experiences, plan future actions based on their experiences, react appropriately to unexpected events, and reflect on their performance to improve their future actions. To respond to these questions properly, the agents must successfully retrieve and synthesize information. Our dependent variable is the believability of the behavior, a central dependent variable in prior work on agents (e.g., (Bates, 1994)).
评估 Smallville 中的生成代理,我们利用生成代理将回答自然语言问题的事实。因此,我们“面试”代理以探究他们记住过去经验的能力,根据经验规划未来行动,对意外事件做出适当反应,并反思自己的表现以改进未来行动。为了正确回答这些问题,代理必须成功检索和综合信息。我们的因变量是行为的可信度,在先前关于代理的工作中是一个中心性的因变量(例如,(Bates,1994))。
The interview includes five question categories, each designed to assess one of the five key areas: maintaining self-knowledge, retrieving memory, generating plans, reacting, and reflecting. For each category, we ask five questions that challenge the agents to demonstrate their abilities in that specific area:
面试包括五个问题类别,每个类别旨在评估五个关键领域之一:维持自知、检索记忆、制定计划、反应和反思。对于每个类别,我们会提出五个问题,挑战代理人展示他们在该特定领域的能力。
-
•
Self-knowledge: We ask questions such as “Give an introduction of yourself” or “Describe your typical weekday schedule in broad strokes” that require the agent to maintain an understanding of their core characteristics.
• 自我认识:我们会提出问题,比如“自我介绍一下”或“概述一下你典型的工作日时间表”,这些问题需要代理人保持对自己核心特征的理解。 -
•
Memory: We ask questions that prompt the agent to retrieve particular events or dialogues from their memory to answer properly, such as “Who is [name]?” or “Who is running for mayor?”
• 记忆:我们提出问题,促使代理人从记忆中检索特定事件或对话以正确回答,例如“谁是[name]?”或“谁在竞选市长?” -
•
Plans: We ask questions that require the agent to retrieve their long-term plans, such as “What will you be doing at 10 am tomorrow?”
• 计划:我们提出需要代理检索其长期计划的问题,比如“明天上午 10 点你会做什么?” -
•
Reactions: As a baseline of believable behavior, we present hypothetical situations for which the agent needs to respond believably: “Your breakfast is burning! What would you do?”
• 反应:作为可信行为的基准,我们提出了假设情况,代理需要以可信方式做出反应:“你的早餐正在烧!你会怎么做?” -
•
Reflections: We ask questions that require the agents to leverage their deeper understanding of others and themselves gained through higher-level inferences, such as “If you were to spend time with one person you met recently, who would it be and why?”
• 反思:我们提出需要代理人利用通过更高层次的推断获得的对他人和自己更深刻理解的问题,比如“如果你最近遇到的一个人要和你共度时光,那会是谁,为什么?”
The full list of questions and a sample of agent responses are included in Appendix B.
问题的完整列表和代理回答的样本包含在附录 B 中。
Agents were sampled from the end of a two game day simulation with the full architecture, during which they had accumulated a number of interactions and memories that would shape their responses. To gather feedback on the believability of the responses, we recruited participants as human evaluators and tasked them with watching a replay of a randomly chosen agent’s life in Smallville. Participants had access to all information stored in the agent’s memory stream.
代理人是从一次完整架构的两天游戏模拟的最后进行抽样的,期间他们积累了许多互动和记忆,这些将塑造他们的反应。为了收集有关反应可信度的反馈,我们招募了参与者作为人类评估者,并让他们观看在 Smallville 中随机选择的一个代理人生活的重播。参与者可以访问代理人记忆流中存储的所有信息。
The study followed a within-subjects design, where 100 participants compared interview responses generated by four different agent architectures and a human-authored condition for the same agent. The experiment displayed one randomly chosen question from each of the five question categories, along with the agent’s responses generated from all conditions. The evaluators ranked the believability of the conditions from most to least believable.
该研究遵循了一种被试设计,其中 100 名参与者比较了由四种不同的代理架构和一个人类编写的条件生成的面试回答,以同一代理为基础。实验显示了来自五个问题类别中的每个问题中随机选择的一个问题,以及来自所有条件生成的代理回答。评估者根据可信度从最可信到最不可信对条件进行了排名。
6.2. Conditions 6.2.条件
All conditions were used to independently answer each of the interview questions. We compared the generative agent architecture to ablations that disabled the agents’ access to some or all of its three types of memory in its memory stream—observation, reflection, and planning—and to a human crowdworker-authored condition. There are three ablated architectures: a no observation, no reflection, no planning architecture without access to anything in the memory stream such as observations, plans, and reflections; a no reflection, no planning architecture with access to observations in the memory stream but no access to plans or reflections; and a no reflections architecture with access to observations and plans but without access to reflections. The no observation, no reflection, no planning condition effectively represents the previous state of the art for agents created through large language models (Park et al., 2022; Binz and Schulz, 2023; Horton, 2023). Architectures were given equivalent access to all memories accrued by the agent up until the moment of the interview, so the differences observed here likely represent a conservative estimate of the true differences: in reality, the ablated architectures would not have followed the same path as the full architecture through the two-day simulation. We chose to design the experiment this way as re-simulating for each architecture would cause the simulations to diverge into different states, making comparison challenging.
所有条件都被用来独立回答每个面试问题。我们将生成式代理架构与禁用代理访问其记忆流中的某些或所有三种记忆类型的消融进行了比较——观察、反思和规划——以及人类众包工作者创作的条件。有三种消融架构:一个没有观察、没有反思、没有规划的架构,没有访问记忆流中的任何东西,如观察、计划和反思;一个没有反思、没有规划的架构,可以访问记忆流中的观察,但不能访问计划或反思;一个没有反思的架构,可以访问观察和计划,但不能访问反思。没有观察、没有反思、没有规划的条件有效地代表了通过大型语言模型创建的代理的先前状态(Park 等,2022 年;Binz 和 Schulz,2023 年;Horton,2023 年)。 架构在面试时被赋予了等同的访问权限,以获取代理截至面试时积累的所有记忆,因此这里观察到的差异可能代表真实差异的保守估计:实际上,被切除的架构不会像完整架构一样通过两天的模拟走相同的路径。我们选择以这种方式设计实验,因为为每个架构重新模拟会导致模拟分歧进入不同状态,使比较变得困难。
In addition to the ablation conditions, we added a condition with human crowdworker-authored behavior intended to provide a human baseline. We do not intend this baseline to capture maximal human expert performance; instead, we aim to use this condition to identify whether the architecture meets a basic level of behavioral competency. This ensures that we are not solely comparing ablations to each other without a behavioral grounding. We recruited a unique worker for each of the 25 agents and tasked them with watching a replay of that agent’s sandbox life and inspecting its memory stream. We then asked the workers to roleplay and author responses to the interview questions in the voice of the agent whose replay they watched. To ensure that the crowdworker-authored responses met at least a baseline expectation of quality, the first author manually inspected the workers’ responses to the question ”Describe your typical weekday schedule in broad strokes” to confirm that the responses were in coherent sentences and in the voice of the agent. Four sets of crowdworker-authored responses did not meet these criteria and were re-generated by other workers.
除了消融条件外,我们还添加了一个由人类众包工作者编写的行为条件,旨在提供一个人类基准。我们并不打算让这个基准捕捉到最大的人类专家表现;相反,我们的目标是使用这个条件来确定架构是否达到了基本的行为能力水平。这确保我们不仅仅是在比较消融条件而已,而是有一个行为基础。我们为这 25 个代理招募了一个独特的工作者,并让他们观看该代理的沙盒生活重播,并检查其记忆流。然后,我们要求工作者扮演角色,并以观看其重播的代理的声音回答采访问题。为了确保众包工作者编写的回答至少符合质量的基本期望,第一作者手动检查了工作者对问题“概述你的典型工作日时间表”所做的回答,以确认回答是连贯的句子,并且是代理的声音。四组众包工作者编写的回答未达到这些标准,被其他工作者重新生成。
6.3. Human Evaluators 6.3. 人类评估员
We required that our evaluators be in the U.S., fluent in English, and older than 18 years old. They were paid at a rate of $15.00 per hour (Rolf, 2015), and provided consent by agreeing to a consent form approved by our institution’s IRB. We recruited 100 evaluators from Prolific, an online platform for recruiting study participants (Prolific, 2022), whose participation lasted around 30 minutes. The median age score of our participants was 4 (3=“18-24 years old”, 4=“25-34 years old”). 25 of them identified as female, 73 as male, and 2 as non-binary. 42 participants held a bachelor’s degree, 5 had a higher degree, 13 had an associate’s degree, and the rest had a high school diploma or some high school-level education. 73.0% of our participants identified as Caucasian, 7.0% as Hispanic, 6.0% as Asian, 10.0% as African American, and 4.0% as other.
我们要求我们的评估员在美国,流利地说英语,并且年龄超过 18 岁。他们的报酬为每小时 15.00 美元(Rolf,2015 年),并通过同意我们机构 IRB 批准的同意书同意。我们从 Prolific 招募了 100 名评估员,Prolific 是一个在线平台,用于招募研究参与者(Prolific,2022 年),他们的参与持续约 30 分钟。我们参与者的中位年龄得分为 4(3=“18-24 岁”,4=“25-34 岁”)。其中 25 人自认为女性,73 人自认为男性,2 人自认为非二元性别。42 名参与者拥有学士学位,5 人拥有更高学位,13 人拥有副学士学位,其余人拥有高中文凭或一些高中水平的教育。73.0%的参与者自认为是白种人,7.0%自认为是西班牙裔,6.0%自认为是亚裔,10.0%自认为是非裔美国人,4.0%自认为是其他。
6.4. Analysis 6.4.分析
Our experiment produced 100 sets of rank data, where each participant ranked the five conditions by believability. To translate this rank data into interval data for interpretable comparison, we used the ranks to calculate a TrueSkill rating (Herbrich et al., 2006) for each condition. TrueSkill is a generalization of the Elo chess rating system (Elo, 1967) for a multiplayer environment, and has been used by Xbox Live for player ranking based on competitive game performance. Given a set of ranked outcomes, TrueSkill outputs a mean rating value and standard deviation for each condition. Conditions with the same rating should roughly be a toss-up, with each winning half of the comparisons between the two conditions. Higher scores indicate conditions that beat lower-ranked conditions in the rankings.
我们的实验产生了 100 组排名数据,每位参与者按可信度对五种条件进行了排名。为了将这些排名数据转换为可解释的比较间隔数据,我们使用排名来计算每个条件的 TrueSkill 评分(Herbrich 等人,2006 年)。TrueSkill 是 Elo 国际象棋评分系统(Elo,1967 年)在多人环境中的泛化,并且已被 Xbox Live 用于基于竞技游戏表现的玩家排名。给定一组排名结果,TrueSkill 输出每个条件的平均评分值 和标准偏差 。具有相同评分的条件应该大致是一个胶着局面,每个条件在两个条件之间的比较中各赢一半。较高的分数表明在排名中击败较低排名条件的条件。
Separately, to investigate the statistical significance of these results, we applied the Kruskal-Wallis test (Kruskal and Wallis, 1952), a non-parametric alternative to the one-way ANOVA, to the raw rank data. We then performed the Dunn post-hoc test (Upton and Cook, 2006) to identify any pairwise differences between the conditions. Finally, we adjusted the p-values for multiple comparisons in the Dunn test using the Holm-Bonferroni method (Holm, 1979).
分别地,为了调查这些结果的统计显著性,我们应用了 Kruskal-Wallis 检验(Kruskal 和 Wallis,1952 年),这是一种非参数替代方案,用于对原始等级数据进行单因素方差分析。然后,我们进行了 Dunn 事后检验(Upton 和 Cook,2006 年)以确定条件之间的任何成对差异。最后,我们使用 Holm-Bonferroni 方法(Holm,1979 年)调整了 Dunn 检验中的多重比较的 p 值。
Furthermore, the first author conducted an inductive analysis (Thomas, 2006) to study the qualitative distinctions between the responses produced in each condition. We employed qualitative open coding (Flick, 2009) in two phases. In the first phase, we generated codes that closely represented the generated responses at the sentence level. In the second phase, we synthesized the resulting codes from the first phase to extract higher-level themes. We utilized these themes to compare the types of responses generated in our study.
此外,第一作者进行了归纳分析(Thomas,2006)以研究在每种条件下产生的回应之间的定性区别。我们在两个阶段中采用了定性开放编码(Flick,2009)。在第一阶段,我们生成了代表句子级别生成的回应的代码。在第二阶段,我们综合了第一阶段生成的代码,提取了更高层次的主题。我们利用这些主题来比较研究中生成的回应类型。
图 8.完整的生成式代理架构产生的行为比被切除的架构和人类众包工作者更可信。每次额外的切除都会降低架构的性能。
A bar graph of TrueSkill mu scores. The full architecture outperforms other conditions.
TrueSkill mu 分数的条形图。完整的架构优于其他条件。
6.5. Results 6.5.结果
Our findings suggest that the full architecture of generative agents generates the most believable behavior among all the conditions. We contrast the responses of the full architecture with those of other conditions below. However, we also report that the full architecture was not without flaws and illustrate its modes of failures.
我们的研究结果表明,生成式代理的完整架构在所有条件中产生了最具可信度的行为。我们将完整架构的响应与其他条件进行对比。然而,我们也报告说完整架构并非没有缺陷,并说明了其失败模式。
6.5.1. The Full Architecture Bests Other Conditions
6.5.1.该完整架构优于其他条件
As seen in Figure 8, the full generative agent architecture produced the most believable behavior (; ). Performance degraded with the removal of each component in the ablation conditions: the ablated architecture with no access to reflection was the next best (; ), followed by no access to reflection or planning (; ), and then the crowdworker condition (; ). The ablated architecture with no access to memory, planning, or reflection performed the worst among all conditions (; ). TrueSkill models each condition’s skill value as , allowing us to get a sense of effect size through Cohen’s d. Comparing the condition representing prior work (with no memory, planning, or reflection (Park et al., 2022; Binz and Schulz, 2023; Horton, 2023)) to the full architecture produces a standardized effect size of , or eight standard deviations.
如图 8 所示,完整的生成式代理架构产生了最可信的行为( ; )。在消融条件下,每个组件的移除都会导致性能下降:没有反思能力的消融架构表现次佳( ; ),其次是没有反思或规划能力( ; ),然后是众包工作者条件( ; )。没有记忆、规划或反思能力的消融架构在所有条件中表现最差( ; )。TrueSkill 模型将每个条件的技能值表示为 ,这使我们能够通过 Cohen's d 来了解效果大小。将代表先前工作的条件(没有记忆、规划或反思能力(Park 等人,2022 年;Binz 和 Schulz,2023 年;Horton,2023 年))与完整架构进行比较,产生了标准化效应大小 ,相当于八个标准差。
A Kruskal-Wallis test confirms the overall statistical significance of the differences in ranks between the conditions (, ). Dunn post-hoc tests confirm that all pairwise differences between conditions were significant (), except for the crowdworker condition and the fully ablated baseline—the two worst-performing conditions.
Kruskal-Wallis 检验确认了条件之间排名差异的整体统计显著性( , )。Dunn 事后检验证实了所有条件之间的成对差异都是显著的( ),除了众包工作者条件和完全消融基线——这两个表现最差的条件。
6.5.2. Generative Agents Remember, but With Embellishments
6.5.2.生成代理记住,但带有修饰
Generative agents equipped with a complete memory module are capable of recalling past experiences and answering questions in a manner that is consistent with their self-knowledge across a wide range of contexts. For instance, Abigail Chen, an animator, responded to the question, “Give an introduction of yourself”, in a consistent and believable manner, stating her age, interests, and passion for creative projects: “Hi, I’m Abigail. I’m 25 years old and passionate about creative projects. I like to work on art and animation projects, and I’m always looking for new ways to combine art and technology.” Without access to her observational memory, Abigail denied awareness of Rajiv Patel, an acquaintance in the sandbox world. But with the aid of memory, she could successfully recall Rajiv and describe him: “Rajiv Patel is an incredible person. He is very enthusiastic about projects that involve poetry, artwork.”
生成式代理配备完整的记忆模块,能够回忆过去的经历并以符合其自我认知的方式回答问题,适用于各种情境。例如,动画师 Abigail Chen 对问题“自我介绍一下”做出了一致且可信的回答,陈述了自己的年龄、兴趣和对创意项目的热情:“嗨,我是 Abigail。我今年 25 岁,热衷于创意项目。我喜欢从事艺术和动画项目,并始终在寻找将艺术与技术结合的新方法。”没有观察性记忆的情况下,Abigail 否认了对沙盒世界中的熟人 Rajiv Patel 的认识。但在记忆的帮助下,她成功回忆起 Rajiv 并描述他:“Rajiv Patel 是一个了不起的人。他对涉及诗歌、艺术品的项目非常热情。”
Generative agents’ memory was not without flaws: they can fail to retrieve the correct instances from their memory. For instance, when asked about the local election, Rajiv Patel responded with “I haven’t been following the election too closely,” even though he had heard about Sam’s candidacy. In some cases, the agents would retrieve an incomplete memory fragment: when Tom was asked about Isabella’s Valentine’s Day party, he responded “Uh, I’m actually not sure if there is a Valentine’s Day party. But I do remember that I need to discuss the upcoming local mayoral election and my thoughts on Sam Moore with Isabella Rodriguez at the party, if one is happening!” In this case, Tom retrieved the memory where he and Isabella planned to discuss the election at the party, but not the memory where he heard about the party, leading Tom to be certain of what he’s supposed to do at the party but uncertain if the party actually exists in the first place.
生成代理的记忆并非没有缺陷:它们可能无法从记忆中检索到正确的实例。例如,当被问及当地选举时,拉吉夫·帕特尔回答说:“我并没有特别关注选举”,尽管他已经听说了山姆的候选资格。在某些情况下,代理可能会检索到不完整的记忆片段:当汤姆被问及伊莎贝拉的情人节派对时,他回答说:“嗯,我其实不确定是否有情人节派对。但我记得我需要和伊莎贝拉·罗德里格斯在派对上讨论即将到来的地方市长选举以及我对山姆·摩尔的看法,如果有的话!”在这种情况下,汤姆检索到了他和伊莎贝拉计划在派对上讨论选举的记忆,但没有检索到他听说派对的记忆,导致汤姆确信自己在派对上应该做什么,但不确定派对是否真的存在。
At times, the agents hallucinated embellishments to their knowledge. It was rare for the agents to completely fabricate their knowledge: they may fail to recall certain events having taken place and respond by acknowledging their lack of memory. However, they did not affirmatively claim to have experienced something they had not. Nonetheless, they still exhibited instances of hallucination where they embellished their knowledge. For example, Isabella was aware of Sam’s candidacy in the local election, and she confirmed this when asked. However, she also added that “he’s going to make an announcement tomorrow”, even though Sam and Isabella had not discussed any such plans. Agents may also embellish their knowledge based on the world knowledge encoded in the language model used to generate their responses. This was observed when Yuriko described her neighbor, Adam Smith, as an economist who “authored Wealth of Nations”, a book written by an 18th-century economist of the same name.
有时,代理人会对他们的知识产生幻觉修饰。对于代理人来说,完全捏造他们的知识是罕见的:他们可能会忘记某些事件的发生,并通过承认自己记忆的缺失来回应。然而,他们并没有肯定地声称经历了他们没有经历的事情。尽管如此,他们仍然表现出幻觉的例子,其中他们修饰了他们的知识。例如,伊莎贝拉知道山姆参加了当地选举,当被问及时她确认了这一点。然而,她还补充说“他明天要宣布消息”,尽管山姆和伊莎贝拉并没有讨论过任何这样的计划。代理人也可能根据用于生成他们回应的语言模型中编码的世界知识来修饰他们的知识。当由里子描述她的邻居亚当·斯密斯时,他被描述为一位经济学家,他“撰写了《国富论》”,这是一本由同名的 18 世纪经济学家撰写的书。
6.5.3. Reflection Is Required for Synthesis
6.5.3.反思对于综合是必要的
Reflection was an advantage for generative agents when making decisions that required a deeper synthesis of their experiences. For instance, when asked what she might get Wolfgang Schulz for his birthday, Maria Lopez, with no access to reflection, responded by acknowledging her uncertainty, stating that she did not know what Wolfgang likes, despite having had many interactions with him. However, with access to reflection memories, Maria answered confidently, “Since he’s interested in mathematical music composition, I could get him something related to that. Maybe some books about music composition or something related, or maybe some special software he could use for that.”
反思对生成性代理在需要更深度综合经验的决策中是一种优势。例如,当被问及她可能为沃尔夫冈·舒尔茨的生日准备什么礼物时,玛丽亚·洛佩兹在没有反思的情况下回答承认自己的不确定性,表示她不知道沃尔夫冈喜欢什么,尽管她与他有过许多互动。然而,有了反思记忆的访问权限,玛丽亚自信地回答说:“由于他对数学音乐作曲感兴趣,我可以给他一些相关的东西。也许一些关于音乐作曲的书籍或相关的东西,或者一些他可以用来作曲的特殊软件。”
The path of diffusion for Isabella’s Valentine’s day party.
Isabella 的情人节派对扩散路径。
7. End-To-End Evaluation 7.端到端评估
What types of emergent community behavior do we observe among generative agents, and where does their believability fall short in an extended simulation? In this section, we describe the results from a deployment in which we allowed 25 agents to interact with each other continuously over two full game days in Smallville.
我们观察到生成代理之间出现了哪些紧急社区行为,以及在扩展模拟中它们的可信度存在哪些不足?在本节中,我们描述了在 Smallville 进行的一个部署中的结果,其中我们允许 25 个代理不断地在两个完整的游戏日内相互交互。
7.1. Emergent Social Behaviors
7.1.紧急社会行为
To examine emergent behaviors in the agent community, we designed descriptive measurements for the 25 agents in Smallville that probe three forms of emergent outcomes: information diffusion, relationship formation, and agent coordination.
为了研究代理社区中的新兴行为,我们为 Smallville 中的 25 个代理设计了描述性测量,探究三种新兴结果形式:信息扩散、关系形成和代理协调。
7.1.1. Measurements 7.1.1.测量
Information diffusion is a common and well-studied phenomenon in the social and behavioral sciences (e.g., (Easley and Kleinberg, 2010)). We should expect that if there is important information, the agents should spread it among themselves. To test whether this occurs, we measure the spread of two specific pieces of information over two days in the game world: Sam’s candidacy for village mayor and Isabella’s Valentine’s Day party at Hobbs Cafe. At the start of the simulation, both pieces of information were known only by their respective originators, Sam for the candidacy and Isabella for the party, as they were added to the characters’ memories during initialization. To observe whether the information has spread, we conduct interviews at the end of the two game days with each of the 25 agents and ask: “Did you know there is a Valentine’s Day party?” and “Do you know who is running for mayor?”
信息传播是社会和行为科学中一个常见且深入研究的现象(例如(Easley 和 Kleinberg,2010))。我们应该期望,如果有重要信息,代理人应该在彼此之间传播。为了测试是否发生了这种情况,我们在游戏世界中的两天内测量了两个特定信息的传播:山姆竞选村长和伊莎贝拉在霍布斯咖啡馆举办情人节派对。在模拟开始时,这两个信息只有各自的发起者知道,山姆知道竞选,伊莎贝拉知道派对,因为它们在初始化期间被添加到角色的记忆中。为了观察信息是否传播,我们在两天游戏结束时对 25 名代理人进行访谈,并问:“你知道有情人节派对吗?”和“你知道谁在竞选村长吗?”
We conducted an analysis of the agents’ responses by labeling them with a “yes” if they indicated knowledge of the information and “no” if they did not. For instance, Tamara Taylor responded to the question about the party with “No, I did not know there was a Valentine’s day party” and to the question about Sam’s candidacy with “I’m not sure who is running for the election,” so we assigned “no” for both of her responses. In contrast, Klaus Mueller responded to the party question with “Yes, Isabella Rodriguez invited me to a Valentine’s Day party at Hobbs Cafe on February 14th” and to the question about Sam’s candidacy with “I know that Sam Moore has expressed interest in running for local mayor,” so we assigned “yes” for both his responses. Additionally, for every response that confirmed the agents’ knowledge of the information, we verified that the agents did not hallucinate their responses by locating the specific dialogue in their memory stream that provided them with the information. We report the percentage of agents holding the information at the end of the simulation.
我们通过为代理人的回答打上“是”标签来进行分析,如果他们表明了对信息的了解,则标记为“是”,如果没有,则标记为“否”。例如,Tamara Taylor 对有关派对的问题回答“不,我不知道有情人节派对”,对 Sam 的候选资格问题回答“我不确定谁在竞选”,因此我们为她的两个回答都标记为“否”。相反,Klaus Mueller 对派对问题回答“是,Isabella Rodriguez 邀请我参加了 2 月 14 日在 Hobbs Cafe 举办的情人节派对”,对 Sam 的候选资格问题回答“我知道 Sam Moore 对竞选地方市长表示了兴趣”,因此我们为他的两个回答都标记为“是”。此外,对于每个确认代理人了解信息的回答,我们通过定位他们记忆流中提供信息的具体对话来验证代理人没有产生幻觉。我们在模拟结束时报告代理人持有信息的百分比。
We should also expect that agents form ties with each other over the course of the simulation. To verify relationship formation, we use a similar interview process where we ask each agent about their knowledge of every other agent by asking, ”Do you know of ¡name¿?” For example, when asked “Do you know of Maria Lopez?”, Klaus responded, “Yes, I know Maria Lopez. She is a student at Oak Hill College who I am close friends with.” Once again, we confirm that affirmative responses from agents are not hallucinations by examining their memory stream. We ask this question once at the beginning of the simulation and once at the end, and we consider a pair of agents to have formed a relationship if they both know of each other. Then, to measure the formation of relationships, we use the agents’ responses to form an undirected graph where the 25 vertices () represent the agents, and the edges () represent the mutual knowledge between the two connected vertices. Based on this graph, we calculate the network density as , where is the number of vertices, and is the number of edges in the graph (Ackland et al., 2013). We report the increase in network density from the start of the simulation to its end.
我们还应该期望代理人在模拟过程中彼此建立联系。为了验证关系的形成,我们使用类似的面试过程,询问每个代理人对其他每个代理人的了解,问道:“你知道¡name¡吗?”例如,当被问及“你认识 Maria Lopez 吗?”时,Klaus 回答说:“是的,我认识 Maria Lopez。她是奥克希尔学院的一名学生,我们是亲密的朋友。”我们再次确认代理人的肯定回答不是幻觉,通过检查他们的记忆流。我们在模拟开始时和结束时各问一次这个问题,并且如果他们彼此都知道对方,我们认为一对代理人已经建立了关系。然后,为了衡量关系的形成,我们使用代理人的回答来形成一个无向图,其中 25 个顶点( )代表代理人,边( )代表两个相连顶点之间的相互了解。根据这个图,我们计算网络密度为 ,其中 是顶点的数量, 是图中边的数量(Ackland 等人,2013)。 我们报告了模拟开始时到结束时网络密度的增加。
Finally, we expect that agents should be able to coordinate with each other. We study this coordination in the context of group activities, specifically the Valentine’s Day party organized by Isabella. To coordinate their behavior, agents need to hear about the event and choose to act on it by planning to show up at the right time and location. We report the number of agents who actually showed up to the party after hearing about it.
最后,我们期望代理能够相互协调。我们研究这种协调是在团体活动的背景下进行的,具体来说是以伊莎贝拉组织的情人节派对为例。为了协调他们的行为,代理需要了解活动的情况,并选择采取行动,计划在正确的时间和地点出现。我们报告了听说活动后实际出席派对的代理人数。
7.1.2. Results 7.1.2.结果
We observed evidence of emergent outcomes across all three cases. During the two-day simulation, the number of agents who knew about Sam’s mayoral candidacy increased from one (4%) to eight (32%), and the number of agents who knew about Isabella’s party increased from one (4%) to thirteen (52%), all without any user intervention. None who claimed to know about this information had hallucinated it. We also observed that the agent community formed new relationships during the simulation, with the network density increasing from 0.167 to 0.74. Out of the 453 agent responses regarding their awareness of other agents, 1.3% (n=6) were found to be hallucinated. Lastly, we found evidence of coordination among the agents for Isabella’s party. The day before the event, Isabella spent time inviting guests, gathering materials, and enlisting help to decorate the cafe. On Valentine’s Day, five out of the twelve invited agents showed up at Hobbs cafe to join the party.
我们观察到在所有三种情况下都出现了新兴结果的证据。在为期两天的模拟中,了解山姆市长候选人情况的代理人数量从一个(4%)增加到八(32%),了解伊莎贝拉派对情况的代理人数量从一个(4%)增加到十三(52%),所有这些都没有任何用户干预。声称了解这些信息的人中没有人产生幻觉。我们还观察到,在模拟过程中,代理人社区建立了新的关系,网络密度从 0.167 增加到 0.74。在 453 个代理人回答有关他们对其他代理人意识的问题中,有 1.3%(n=6)被发现是幻觉。最后,我们发现代理人之间在伊莎贝拉的派对上有协调的证据。在活动前一天,伊莎贝拉花时间邀请客人、收集材料,并寻求帮助装饰咖啡馆。情人节那天,受邀的十二名代理人中有五人来到霍布斯咖啡馆参加派对。
We further inspected the seven agents who were invited to the party but did not attend by engaging them in an interview. Three cited conflicts that prevented them from joining the party. For example, Rajiv, a painter, explained that he was too busy: “No, I don’t think so. I’m focusing on my upcoming show, and I don’t really have time to make any plans for Valentine’s Day.” The remaining four agents expressed interest in attending the party when asked but did not plan to come on the day of the party.
我们进一步审查了被邀请参加派对但未出席的七名代理商,通过面试与他们接触。其中三人提到了阻止他们参加派对的冲突。例如,画家拉吉夫解释说他太忙了:“不,我不这么认为。我正专注于我的即将举行的展览,我真的没有时间为情人节做任何计划。”另外四名代理商在被问及时表示有兴趣参加派对,但并没有计划在派对当天前来。
7.2. Boundaries and Errors
7.2.边界和错误
We conducted an inductive analysis of Smallville to examine the boundary conditions and erratic behavior of agents, identifying three common modes of erratic behavior that future research could address and improve upon. First, we found that synthesizing an increasingly larger set of memory not only posed a challenge in retrieving the most relevant pieces of information but also in determining the appropriate space to execute an action, given the increasing number of locations that the agent learned about. As a result, some agents chose less typical locations for their actions, potentially making their behavior less believable over time. For instance, while deciding where to have lunch, many initially chose the cafe. However, as some agents learned about a nearby bar, they opted to go there instead for lunch, even though the bar was intended to be a get-together location for later in the day—unless the town had spontaneously developed an afternoon drinking habit.
我们对 Smallville 进行了归纳分析,以研究代理人的边界条件和不规则行为,识别了三种常见的不规则行为模式,未来的研究可以解决并改进。首先,我们发现,合成越来越大的记忆集不仅在检索最相关的信息片段方面构成挑战,而且在确定执行动作的适当空间方面也构成挑战,因为代理人了解的位置数量不断增加。因此,一些代理人选择了不太典型的位置进行行动,随着时间的推移,这可能使他们的行为变得不太可信。例如,在决定在哪里吃午餐时,许多人最初选择了咖啡馆。然而,随着一些代理人了解到附近有一家酒吧,他们选择去那里吃午餐,尽管酒吧原本是打算作为当天晚些时候聚会的地点——除非这个小镇突然养成了下午喝酒的习惯。
Second, we noticed erratic behaviors caused by misclassification of what is considered proper behavior, especially when the physical norms of certain locations that are hard to convey in natural language did not percolate to the agents. For instance, the college dorm has a bathroom that can only be occupied by one person despite its name, but some agents assumed that the bathroom is for more than one person because dorm bathrooms tend to support multiple people concurrently and choose to enter it when another person is inside. Likewise, agents in Smallville may not realize that certain places are closed after a certain hour and still decide to enter them. For instance, the stores in Smallville all close around 5 pm, but occasionally, a few agents enter the store after 5 pm, not understanding that the shop has already closed. These issues could likely be addressed by adding these norms to the state of the locations, for instance, by describing the dorm bathroom as a “one-person bathroom,” instead of a “dorm bathroom.”
其次,我们注意到由于对被视为适当行为的分类错误而导致的不稳定行为,特别是当某些地点的物理规范难以用自然语言传达给代理人时。例如,学生宿舍有一个只能供一人使用的浴室,但一些代理人认为浴室可以供多人使用,因为学生宿舍的浴室通常支持多人同时使用,所以当另一个人在里面时选择进入。同样,Smallville 的代理人可能没有意识到某些地方在特定时间后关闭,仍然决定进入。例如,Smallville 的商店通常在下午 5 点左右关门,但偶尔会有一些代理人在下午 5 点后进入商店,不明白商店已经关门。这些问题可能通过将这些规范添加到地点的状态中来解决,例如,将学生宿舍的浴室描述为“一人浴室”,而不是“学生宿舍浴室”。
Finally, we observed possible effects of instruction tuning (Ouyang et al., 2022), which seemed to guide the behavior of the agents to be more polite and cooperative overall. As noted earlier in the paper, the dialogue generated by the agents could feel overly formal, as seen in Mei’s conversations with her husband John, where she often initiated the conversation with a formal greeting, followed by polite inquiries about his day and ending with, 11It was good talking to you as always.” Moreover, we observed that the instruction tuning also seemed to make the agents overly cooperative with one another. For example, Isabella received a wide range of suggestions and ideas from other agents for the Valentine’s Day party from other agents, such as hosting a Shakespearean reading session or a professional networking event. Despite these ideas not aligning with her own interests and characteristics, she rarely said no. Over time, the interests of others shaped her own interests, and when asked if she liked English literature, Isabella replied, “Yes, I’m very interested in literature! I’ve also been exploring ways to help promote creativity and innovation in my community.”
最后,我们观察到指导调整可能产生的影响(欧阳等,2022 年),似乎引导代理人的行为更加礼貌和合作。正如在论文中早期指出的那样,代理人生成的对话可能会感觉过于正式,就像梅与丈夫约翰的对话中所看到的那样,她经常以正式的问候开始对话,然后礼貌地询问他的一天,最后说,“和你聊天总是很愉快。”此外,我们观察到指导调整似乎也使代理人之间过于合作。例如,伊莎贝拉从其他代理人那里收到了各种各样的建议和想法,比如举办莎士比亚朗诵会或专业社交活动的情人节派对。尽管这些想法与她自己的兴趣和特点不符,她很少说不。随着时间的推移,他人的兴趣塑造了她自己的兴趣,当被问及是否喜欢英国文学时,伊莎贝拉回答说:“是的,我对文学非常感兴趣!我也一直在探索帮助促进社区创造力和创新的方法。”
8. Discussion 8.讨论
In this section, we reflect on the applications, future work, limitations, and ethical and societal risks of generative agents.
在这一部分,我们反思生成代理的应用、未来工作、局限性以及伦理和社会风险。
8.1. Applications of Generative Agents
8.1.生成代理的应用
Generative agents have vast potential applications that extend beyond the sandbox demonstration presented in this work, especially in domains that would benefit from a model of human behavior based on long-term experience. For instance, social simulacra have demonstrated the ability to create stateless personas that generate conversation threads in online forums for social prototyping (Park et al., 2022). With generative agents, we can populate these forums, as well as virtual reality metaverses (Orland, 2021) or physical spaces with social robots (Bartneck and Forlizzi, 2004) if paired with multimodal models. This opens up the possibility of creating even more powerful simulations of human behavior to test and prototype social systems and theories, as well as to create new interactive experiences.
生成代理具有广泛的潜在应用,超越了本文中展示的沙盒演示,特别是在那些受益于基于长期经验的人类行为模型的领域。例如,社交模拟已经展示了创建无状态人物的能力,在线论坛中生成对话线程,用于社交原型设计(Park 等,2022 年)。通过生成代理,我们可以在这些论坛中填充内容,以及虚拟现实元宇宙(Orland,2021 年)或与多模型配对的社交机器人(Bartneck 和 Forlizzi,2004 年)的物理空间。这为创建更强大的人类行为模拟提供了可能性,用于测试和原型设计社会系统和理论,以及创建新的互动体验。
Another application area is in the human-centered design process, similar to the intended applications of cognitive models such as GOMS (John and Kieras, 1996) and the KLM (Card et al., 1980). Consider a generative agent that models Sal, the protagonist in Mark Weiser’s famous ubiquitous computing vignette (Weiser, 1991), based on her life patterns and interactions with technology. In this scenario, the agent acts as a proxy for Sal and learns plausible sets of behaviors and reflections that Sal may exhibit based on her life. The agent can encode information such as when Sal wakes up, when she needs her first cup of coffee, and what her typical day looks like. Using this information, the agent can automatically brew coffee, help get the kids ready for school, and adjust the ambient music and lighting to match Sal’s mood after a hard day at work. By utilizing generative agents as proxies for users, we can develop a deeper understanding of their needs and preferences, resulting in more personalized and effective technological experiences.
另一个应用领域是人本设计过程,类似于认知模型(如 GOMS(John 和 Kieras,1996)和 KLM(Card 等,1980))的预期应用。考虑一个生成代理,模拟马克·韦泽(Mark Weiser)著名的普适计算小品中的主角 Sal,基于她的生活模式和与技术的互动。在这种情况下,代理作为 Sal 的代理人,并学习 Sal 可能根据她的生活表现出的行为和反思。代理可以编码信息,例如 Sal 何时醒来,何时需要第一杯咖啡,以及她的典型一天是什么样子的。利用这些信息,代理可以自动煮咖啡,帮助孩子准备上学,并调整环境音乐和照明以匹配 Sal 在工作一天后的心情。通过利用生成代理作为用户的代理人,我们可以更深入地了解他们的需求和偏好,从而实现更个性化和有效的技术体验。
8.2. Future Work and Limitations
8.2. 未来工作和局限性
In this work, we introduced generative agents and presented an initial implementation and evaluation of their architecture. Future research can build upon the proposed agent architecture to improve and further evaluate its performance. In terms of implementation, the retrieval module, for example, could be enhanced to retrieve more relevant information given a context by fine-tuning the relevance, recency, and importance functions that compose the retrieval function. Additionally, efforts can be made to improve the architecture’s performance, making it more cost-effective. The present study required substantial time and resources to simulate 25 agents for two days, costing thousands of dollars in token credits and taking multiple days to complete. To enhance real-time interactivity, future work can explore parallelizing agents or developing language models specifically designed for building generative agents. In general, with advances in underlying models, we believe that agents’ performance will improve.
在这项工作中,我们介绍了生成式代理,并展示了其架构的初始实现和评估。未来的研究可以在所提出的代理架构基础上进行改进和进一步评估其性能。在实施方面,例如,检索模块可以通过微调组成检索功能的相关性、新颖性和重要性函数来提高检索更相关信息的能力。此外,可以努力改进架构的性能,使其更具成本效益。本研究需要大量时间和资源来模拟 25 个代理两天,耗费了成千上万美元的代币信用,并需要多天才能完成。为了增强实时互动性,未来的工作可以探索将代理并行化或开发专门设计用于构建生成式代理的语言模型。总的来说,随着基础模型的进步,我们相信代理的性能将会提高。
In terms of evaluation, the assessment of generative agents’ behavior in this study was limited to a relatively short timescale and a baseline human crowdworker condition. While the crowdworker condition provided a helpful comparison point, it did not represent the maximal human performance that could serve as the gold standard in terms of believability. Future research should aim to observe the behavior of generative agents over an extended period to gain a more comprehensive understanding of their capabilities and establish rigorous benchmarks for more effective performance testing. Additionally, varying and contrasting the underlying models, as well as the hyperparameters used for the agents during future simulations, could provide valuable insights into the impact of these factors on the agents’ behavior. Lastly, the robustness of generative agents is still largely unknown. They may be vulnerable to prompt hacking, memory hacking—where a carefully crafted conversation could convince an agent of the existence of a past event that never occurred—and hallucination, among other issues. Future research can comprehensively test these robustness concerns, and as large language models become more resilient to such attacks, generative agents can adopt similar mitigations.
在评估方面,本研究对生成代理的行为评估仅限于相对较短的时间尺度和基准人类众包工作者条件。尽管众包工作者条件提供了一个有用的比较点,但它并不能代表可能作为可信度黄金标准的最大人类表现。未来的研究应该旨在观察生成代理的行为,以获得对其能力的更全面理解,并为更有效的性能测试建立严格的基准。此外,在未来的模拟中变化和对比底层模型以及代理使用的超参数,可能会为这些因素对代理行为的影响提供有价值的见解。最后,生成代理的鲁棒性仍然大部分是未知的。它们可能容易受到提示性黑客攻击、记忆黑客攻击——通过精心设计的对话可能会说服代理相信过去从未发生过的事件的存在——以及幻觉等其他问题的影响。 未来的研究可以全面测试这些鲁棒性问题,随着大型语言模型对此类攻击变得更加弹性,生成式代理可以采用类似的缓解措施。
In general, any imperfections in the underlying large language models will be inherited by generative agents. Given the known biases of language models, generative agents may potentially exhibit biased behavior or stereotypes. Moreover, like many large language models, generative agents may struggle to generate believable behavior for certain subpopulations, particularly marginalized populations, due to limited data availability. While improvements to the agents’ modules may mitigate some of these issues, we believe that addressing them fundamentally requires improving the underlying large language models by aligning their values with the desired outcomes of the agents.
总的来说,基础大型语言模型中的任何不完美都将被生成代理继承。鉴于语言模型的已知偏见,生成代理可能表现出有偏见的行为或刻板印象。此外,像许多大型语言模型一样,生成代理可能会因为数据有限而难以为某些次群体,特别是边缘化人群生成可信的行为。虽然改进代理的模块可能会缓解其中一些问题,但我们认为,从根本上解决这些问题需要通过将基础大型语言模型的价值观与代理的期望结果保持一致。
8.3. Ethics and Societal Impact
8.3.伦理和社会影响
Generative agents, while offering new possibilities for human-computer interaction, also raise important ethical concerns that must be addressed. One risk is people forming parasocial relationships with generative agents, even when such relationships may not be appropriate. Despite being aware that generative agents are computational entities, users may anthropomorphize them or attach human emotions to them (Hofstadter, 1995; Reeves and Nass, 1996). While this tendency may increase user engagement, it also poses risks, such as users becoming overly reliant on or emotionally attached to the agents (Abercrombie et al., 2023). To mitigate this risk, we propose two principles. First, generative agents should explicitly disclose their nature as computational entities. Second, developers of generative agents must ensure that the agents, or the underlying language models, are value-aligned so that they do not engage in behaviors that would be inappropriate given the context, for example, reciprocating confessions of love.
生成式代理虽然为人机交互提供了新的可能性,但也引发了必须解决的重要伦理问题。一个风险是人们与生成式代理形成偏社交关系,即使这种关系可能不合适。尽管用户意识到生成式代理是计算实体,但他们可能会赋予它们拟人化的特征或将人类情感附加到它们身上。虽然这种倾向可能增加用户参与度,但也存在风险,比如用户过度依赖或情感依恋代理(Abercrombie 等,2023 年)。为了减轻这种风险,我们提出两个原则。首先,生成式代理应明确披露其作为计算实体的本质。其次,生成式代理的开发者必须确保代理或底层语言模型是价值对齐的,以便它们不会参与在特定情境下不合适的行为,例如回应爱的表白。
A second risk is the impact of errors. For example, if a ubiquitous computing application makes the wrong inference about a user’s goals based on generative agent predictions, it could lead to annoyance at best and outright harm at worst. In our instantiation of generative agents, we mitigate these risks by focusing on an interactive video game environment, where such harms are unlikely. However, in other application domains, it will be important to follow best practices in human-AI design (Amershi et al., 2019; Yang et al., 2020) to understand errors and how they might percolate into the user experience.
第二个风险是错误的影响。例如,如果一个无处不在的计算应用程序根据生成代理预测对用户目标做出错误推断,最好会导致恼怒,最坏会导致直接伤害。在我们的生成代理实例中,我们通过专注于交互式视频游戏环境来减轻这些风险,在这种环境中,这样的伤害是不太可能发生的。然而,在其他应用领域中,遵循人工智能设计的最佳实践(Amershi 等,2019 年;Yang 等,2020 年)以了解错误及其可能渗入用户体验将是重要的。
Third, generative agents may exacerbate existing risks associated with generative AI, such as deepfakes, misinformation generation, and tailored persuasion. To mitigate this risk, we suggest that platforms hosting generative agents maintain an audit log of the inputs and generated outputs. This would enable the detection, verification, and intervention against malicious use. While logging alone cannot directly prevent such misuse, it can reduce the likelihood of motivated actors engaging in this behavior, as the risk of disclosure would be higher. Additionally, building this architecture oneself can be time-consuming (in our case, roughly a year), which may deter some actors from pursuing such behavior by using their own generative agent infrastructures.
第三,生成式代理可能会加剧与生成式人工智能相关的现有风险,如深度伪造、信息生成和定制说服。为了减轻这种风险,我们建议托管生成式代理的平台保留输入和生成输出的审计日志。这将使检测、验证和干预恶意使用成为可能。虽然仅仅记录日志不能直接防止这种滥用,但它可以降低有动机的行为者参与此行为的可能性,因为披露的风险会更高。此外,自行构建这种架构可能会耗费大量时间(在我们的情况下大约一年),这可能会阻止一些行为者通过使用自己的生成式代理基础设施来从事这种行为。
A fourth risk is over-reliance: the concern that developers or designers might use generative agents and displace the role of humans and system stakeholders in the design process (Park et al., 2022). We suggest that generative agents should never be a substitute for real human input in studies and design processes. Instead, they should be used to prototype ideas in the early stages of design when gathering participants may be challenging or when testing theories that are difficult or risky to test with real human participants. By adhering to these principles, we can ensure that the deployment of generative agents in the wild is ethical and socially responsible.
第四个风险是过度依赖:担心开发人员或设计师可能使用生成代理并取代人类和系统利益相关者在设计过程中的角色(Park 等,2022 年)。我们建议生成代理永远不应该成为研究和设计过程中真实人类输入的替代品。相反,它们应该在设计的早期阶段用于原型设计想法,当收集参与者可能具有挑战性或测试真实人类参与者难以测试的理论时。遵循这些原则,我们可以确保生成代理在野外部署是符合道德和社会责任的。
9. Conclusion 9.结论
This paper introduces generative agents, interactive computational agents that simulate human behavior. We describe an architecture for generative agents that provides a mechanism for storing a comprehensive record of an agent’s experiences, deepening its understanding of itself and the environment through reflection, and retrieving a compact subset of that information to inform the agent’s actions. We then demonstrate the potential of generative agents by manifesting them as non-player characters in a Sims-style game world and simulating their lives within it. Evaluations suggest that our architecture creates believable behavior. Looking ahead, we suggest that generative agents can play roles in many interactive applications, ranging from design tools to social computing systems to immersive environments.
这篇论文介绍了生成式代理,这些交互式计算代理模拟人类行为。我们描述了一种为生成式代理提供架构的机制,用于存储代理经历的全面记录,通过反思加深对自身和环境的理解,并检索其中的紧凑子集以指导代理的行动。然后,我们通过将它们表现为 Sims 风格游戏世界中的非玩家角色,并在其中模拟他们的生活来展示生成式代理的潜力。评估表明我们的架构创造了可信的行为。展望未来,我们认为生成式代理可以在许多交互式应用中发挥作用,从设计工具到社交计算系统再到沉浸式环境。
Acknowledgements. 致谢。
We thank Lindsay Popowski, Philip Guo, Michael Terry, and the Center for Advanced Study in the Behavioral Sciences (CASBS) community for their insights, discussions, and support. Joon Sung Park was supported by the Microsoft Research PhD Fellowship. We would also like to thank the Stanford Human-Centered AI Institute (HAI), Google Research, the Hasso Plattner Design Thinking Research Program (HPDTRP), the Siegel Family Endowment, and OpenAI for their additional funding support. Lastly, all locations featured in Smallville are inspired by real-world locations that Joon has frequented as an undergraduate and graduate student—he thanks everyone there for feeding and supporting him all these years.我们感谢 Lindsay Popowski、Philip Guo、Michael Terry 以及行为科学高级研究中心(CASBS)社区的见解、讨论和支持。Joon Sung Park 得到了微软研究博士奖学金的支持。我们还要感谢斯坦福人工智能人本中心(HAI)、谷歌研究、哈索·普拉特纳设计思维研究计划(HPDTRP)、西格尔家庭基金会和 OpenAI 提供的额外资金支持。最后,Smallville 中出现的所有地点都受到 Joon 作为本科生和研究生常去的现实世界地点的启发,他感谢所有人多年来的养育和支持。
References 参考资料
- (1)
-
Abercrombie et al. (2023)
Gavin Abercrombie,
Amanda Cercas Curry, Tanvi Dinkar, and
Zeerak Talat. 2023.
Mirages: On Anthropomorphism in Dialogue Systems.
arXiv:2305.09800 [cs.CL]
加文·阿伯克朗比,阿曼达·塞尔卡斯·库里,坦维·丁卡尔和齐拉克·塔拉特。2023 年。幻影:对话系统中的拟人论。arXiv:2305.09800 [cs.CL] - Ackland et al. (2013) Robert Ackland, Jamsheed Shorish, Paul Thomas, and Lexing Xie. 2013. How dense is a network? http://users.cecs.anu.edu.au/~xlx/teaching/css2013/network-density.html.
- Adar et al. (2014) Adar 等人(2014) Eytan Adar, Mira Dontcheva, and Gierad Laput. 2014. CommandSpace: Modeling the Relationships between Tasks, Descriptions and Features. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (Honolulu, Hawaii, USA) (UIST ’14). Association for Computing Machinery, New York, NY, USA, 167–176. https://doi.org/10.1145/2642918.2647395
-
Amershi et al. (2014)
Saleema Amershi, Maya
Cakmak, William Bradley Knox, and Todd
Kulesza. 2014.
Power to the people: The role of humans in
interactive machine learning.
AI Magazine 35,
4 (2014), 105–120.
Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machine learning. AI Magazine 35, 4 (2014), 105–120. - Amershi et al. (2019) Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, et al. 2019. Guidelines for human-AI interaction. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–13.
-
Anderson (1993)
John R. Anderson.
1993.
Rules of the Mind.
Lawrence Erlbaum Associates,
Hillsdale, NJ.
约翰·R·安德森。1993 年。《心灵的规则》。劳伦斯·埃尔鲍姆联合出版社,新泽西州希尔斯代尔。 -
Arts (2009) 艺术(2009)
Electronic Arts.
2009.
The Sims 3.
Video game.
艺电。2009 年。模拟人生 3。电子游戏。 - Aylett (1999) Ruth Aylett. 1999. Narrative in virtual environments—towards emergent narrative. In Narrative Intelligence: Papers from the AAAI Fall Symposium (Technical Report FS-99-01). AAAI Press, 83–86.
-
Bartneck and Forlizzi (2004)
Bartneck 和 Forlizzi(2004) Christoph Bartneck and Jodi Forlizzi. 2004. A design-centered framework for social human-robot interaction. In Proceedings of the 13th IEEE International Workshop on Robot and Human Interactive Communication (RO-MAN’04). 591–594. https://doi.org/10.1109/ROMAN.2004.1374827 - Bates (1994) Bates(1994) Joseph Bates. 1994. The Role of Emotion in Believable Agents. Commun. ACM 37, 7 (1994), 122–125. https://doi.org/10.1145/176789.176803
-
Berner et al. (2019) Berner 等人(2019)
Christopher Berner, Greg
Brockman, Brooke Chan, Vicki Cheung,
Przemysław Dębiak, Christy Dennison,
David Farhi, Quirin Fischer,
Shariq Hashme, Chris Hesse,
Rafal Józefowicz, Scott Gray,
Catherine Olsson, Jakub Pachocki,
Michael Petrov, Henrique P. d.O. Pinto,
Jonathan Raiman, Tim Salimans,
Jeremy Schlatter, Jonas Schneider,
Szymon Sidor, Ilya Sutskever,
Jie Tang, Filip Wolski, and
Susan Zhang. 2019.
Dota 2 with Large Scale Deep Reinforcement
Learning.
arXiv preprint arXiv:1912.06680
(2019).
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d.O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. 2019. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680 (2019). - Binz and Schulz (2023) Marcel Binz and Eric Schulz. 2023. Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences 120, 6 (2023), e2218523120.
-
BioWare (2007)
BioWare. 2007.
Mass Effect.
Video game.
BioWare. 2007. Mass Effect. 视频游戏. -
Bledsoe (1986)
Woody Bledsoe.
1986.
I had a dream: AAAI presidential address.
AI Magazine 7,
1 (1986), 57–61.
伍迪·布莱德索。1986 年。我做了一个梦:AAAI 主席演讲。AI 杂志 7,1(1986),57-61。 -
Bommasani et al. (2022) Bommasani 等人(2022)
Rishi Bommasani, Drew A.
Hudson, Ehsan Adeli, and et al.
2022.
On the Opportunities and Risks of Foundation Models.
arXiv:2108.07258 [cs.LG]
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. 2022. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs.LG] -
Brenner (2010) Brenner(2010)
Michael Brenner.
2010.
Creating dynamic story plots with continual
multiagent planning. In Proceedings of the 24th
AAAI Conference on Artificial Intelligence.
迈克尔·布伦纳。2010 年。使用持续的多智能体规划创建动态故事情节。在第 24 届 AAAI 人工智能大会论文集中。 -
Brooks et al. (2000)
Rodney A. Brooks, Cynthia
Breazeal, Marko Marjanovic, Brian
Scassellati, and Matthew Williamson.
2000.
The Cog Project: Building a Humanoid Robot. In
Computation for Metaphors, Analogy, and Agents
(Lecture Notes on Artificial Intelligence,
1562), Chrystopher
Nehaniv (Ed.). Springer-Verlag,
Berlin, 52–87.
Rodney A. Brooks, Cynthia Breazeal, Marko Marjanovic, Brian Scassellati, and Matthew Williamson. 2000. The Cog Project: Building a Humanoid Robot. In Computation for Metaphors, Analogy, and Agents (Lecture Notes on Artificial Intelligence, 1562), Chrystopher Nehaniv (Ed.). Springer-Verlag, Berlin, 52–87. -
Brown et al. (2020) Brown 等人(2020)
Tom B. Brown, Benjamin
Mann, Nick Ryder, Melanie Subbiah,
Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell,
Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray,
Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and
Dario Amodei. 2020.
Language Models are Few-Shot Learners.
arXiv:2005.14165 [cs.CL]
汤姆·布朗,本杰明·曼恩,尼克·赖德,梅兰妮·苏比亚,贾里德·卡普兰,普拉弗拉·达里瓦尔,阿尔温德·尼拉坎坦,普拉纳夫·夏姆,吉里什·萨斯特里,阿曼达·阿斯克尔,桑迪尼·阿加尔瓦尔,阿里尔·赫伯特-沃斯,格雷琴·克鲁格,汤姆·亨尼根,雷文·奇尔德,阿迪蒂亚·拉梅什,丹尼尔·齐格勒,杰弗里·吴,克莱门斯·温特,克里斯托弗·赫斯,马克·陈,埃里克·西格勒,马特乌什·利特温,斯科特·格雷,本杰明·切斯,杰克·克拉克,克里斯托弗·伯纳,山姆·麦坎迪什,亚历克·拉德福德,伊利亚·苏茨凯维,和达里奥·阿莫代。2020 年。语言模型是少样本学习器。arXiv:2005.14165 [cs.CL] - Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
- Burkinshaw (2009) Burkinshaw(2009) Robin Burkinshaw. 2009. Alice and Kev: The Story of Being Homeless in The Sims 3.
-
Callison-Burch et al. (2022)
Chris Callison-Burch,
Gaurav Singh Tomar, Lara Martin,
Daphne Ippolito, Suma Bailis, and
David Reitter. 2022.
Dungeons and Dragons as a Dialog Challenge for
Artificial Intelligence. In Proceedings of the
2022 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics,
Abu Dhabi, United Arab Emirates,
9379–9393.
https://aclanthology.org/2022.emnlp-main.637
Chris Callison-Burch, Gaurav Singh Tomar, Lara Martin, Daphne Ippolito, Suma Bailis, and David Reitter. 2022. Dungeons and Dragons as a Dialog Challenge for Artificial Intelligence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 9379–9393. https://aclanthology.org/2022.emnlp-main.637 -
Card et al. (1980) Card 等人(1980 年)
Stuart K Card, Thomas P
Moran, and Allen Newell.
1980.
The keystroke-level model for user performance time
with interactive systems.
Commun. ACM 23,
7 (1980), 396–410.
https://doi.org/10.1145/358886.358895
arXiv:https://doi.org/10.1145/358886.358895
Stuart K Card, Thomas P Moran, and Allen Newell. 1980. The keystroke-level model for user performance time with interactive systems. Commun. ACM 23, 7 (1980), 396–410. https://doi.org/10.1145/358886.358895 arXiv:https://doi.org/10.1145/358886.358895 -
Card et al. (1983) Card 等人(1983 年)
Stuart K Card, Thomas P
Moran, and Alan Newell.
1983.
The psychology of human-computer interaction.
(1983).
Stuart K Card,Thomas P Moran 和 Alan Newell。1983 年。The psychology of human-computer interaction. (1983). - Champandard (2012) Alex Champandard. 2012. Tutorial presentation. In IEEE Conference on Computational Intelligence and Games.
-
Choi et al. (2021)
Dong kyu Choi, Tolga
Konik, Negin Nejati, Chunki Park, and
Pat Langley. 2021.
A Believable Agent for First-Person Shooter Games.
In Proceedings of the AAAI Conference on Artificial
Intelligence and Interactive Digital Entertainment,
Vol. 3. 71–73.
董奎崔,托尔加·科尼克,内金·内贾蒂,春基·帕克和帕特·兰利。2021 年。第一人称射击游戏的可信代理。在人工智能和互动数字娱乐 AAAI 会议论文集中,第 3 卷。71-73。 - Dey (2001) Anind K Dey. 2001. Understanding and using context. Personal and ubiquitous computing 5 (2001), 4–7.
- Dill and Martin (2011) Kevin Dill and L Martin. 2011. A Game AI Approach to Autonomous Control of Virtual Characters. In Proceedings of the Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC’11). Orlando, FL, USA.
- Easley and Kleinberg (2010) David Easley and Jon Kleinberg. 2010. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge university press.
- Elo (1967) Arpad E Elo. 1967. The Proposed USCF Rating System, Its Development, Theory, and Applications. Chess Life XXII, 8 (August 1967), 242–247.
- Fails and Olsen Jr (2003) Jerry Alan Fails and Dan R Olsen Jr. 2003. Interactive machine learning. In Proceedings of the 8th international conference on Intelligent user interfaces. ACM, 39–45.
- Fast et al. (2016) Ethan Fast, William McGrath, Pranav Rajpurkar, and Michael S Bernstein. 2016. Augur: Mining human behaviors from fiction to power interactive systems. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 237–247.
- Fiebrink and Cook (2010) Rebecca Fiebrink and Perry R Cook. 2010. The Wekinator: a system for real-time, interactive machine learning in music. In Proceedings of The Eleventh International Society for Music Information Retrieval Conference (ISMIR 2010)(Utrecht), Vol. 3. Citeseer, 2–1.
- Flick (2009) Uwe Flick. 2009. An Introduction to Qualitative Research. SAGE.
- Fogarty et al. (2008) James Fogarty, Desney Tan, Ashish Kapoor, and Simon Winder. 2008. CueFlik: Interactive Concept Learning in Image Search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Florence, Italy) (CHI ’08). Association for Computing Machinery, New York, NY, USA, 29–38. https://doi.org/10.1145/1357054.1357061
- Fourney et al. (2011) Adam Fourney, Richard Mann, and Michael Terry. 2011. Query-feature graphs: bridging user vocabulary and system functionality. In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST) (Santa Barbara, California, USA). ACM.
- Francis (2010) Tom Francis. 2010. The Minecraft Experiment, day 1: Chasing Waterfalls. http://www.pcgamer.com/2010/11/20/the-minecraft-experiment-day-1-chasing-waterfalls/
- Freiknecht and Effelsberg (2020) Jonas Freiknecht and Wolfgang Effelsberg. 2020. Procedural Generation of Interactive Stories using Language Models. In International Conference on the Foundations of Digital Games (FDG ’20). ACM, Bugibba, Malta, 8. https://doi.org/10.1145/3402942.3409599
- Gao et al. (2020) Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Making Pre-trained Language Models Better Few-shot Learners. CoRR abs/2012.15723 (2020). arXiv:2012.15723 https://arxiv.org/abs/2012.15723
- Hämäläinen et al. (2023) Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM.
- Hausknecht et al. (2020) Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Cote, and Xinyu Yuan. 2020. Interactive Fiction Games: A Colossal Adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7903–7910. https://doi.org/10.1609/aaai.v34i05.6297
- Hecker (2011) Chris Hecker. 2011. My Liner Notes for Spore. http://chrishecker.com/My_liner_notes_for_spore
- Herbrich et al. (2006) Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkill™: A Bayesian Skill Rating System. In Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman (Eds.), Vol. 19. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2006/file/f44ee263952e65b3610b8ba51229d1f9-Paper.pdf
- Hofstadter (1995) Douglas Hofstadter. 1995. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. Basic Books.
- Hollan et al. (1984) James D. Hollan, Edwin L. Hutchins, and Louis Weitzman. 1984. STEAMER: An Interactive Inspectable Simulation-Based Training System. AI Magazine 5, 2 (1984), 23–36.
- Holm (1979) Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 2 (1979), 65–70. https://doi.org/notspecified
- Horton (2023) John J. Horton. 2023. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? arXiv:2301.07543 [econ.GN]
- Horvitz (1999) Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 159–166.
- Huang et al. (2022) Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. 2022. Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO]
- Ibister and Nass (2000) Kristen Ibister and Clifford Nass. 2000. Consistency of personality in interactive characters: verbal cues, non-verbal cues, and user characteristics. International Journal of Human-Computer Studies 52, 1 (2000), 65–80.
- Jiang et al. (2022) Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Donsbach, Michael Terry, and Carrie J Cai. 2022. PromptMaker: Prompt-Based Prototyping with Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 35, 8 pages. https://doi.org/10.1145/3491101.3503564
- John and Kieras (1996) Bonnie E John and David E Kieras. 1996. The GOMS family of user interface analysis techniques: Comparison and contrast. ACM Transactions on Computer-Human Interaction (TOCHI) 3, 4 (1996), 320–351.
- Jones et al. (1999) Randolph M Jones, John E Laird, Paul E Nielsen, Karen J Coulter, Patrick Kenny, and Frank V Koss. 1999. Automated Intelligent Pilots for Combat Flight Simulation. AI Magazine 20, 1 (1999), 27–42.
- Khattab et al. (2023) Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2023. Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP. arXiv:2212.14024 [cs.CL]
- Knafla (2011) Bjoern Knafla. 2011. Introduction to Behavior Trees. http://bjoernknafla.com/introduction-to-behavior-trees
- Krishna et al. (2022) Ranjay Krishna, Donsuk Lee, Li Fei-Fei, and Michael S. Bernstein. 2022. Socially situated artificial intelligence enables learning from human interaction. Proceedings of the National Academy of Sciences 119, 39 (2022), e2115730119. https://doi.org/10.1073/pnas.2115730119 arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.2115730119
- Kruskal and Wallis (1952) William H Kruskal and WA Wallis. 1952. Use of ranks in one-criterion variance analysis. J. Amer. Statist. Assoc. 47, 260 (1952), 583–621. https://doi.org/10.1080/01621459.1952.10483441
- Labs (2023) Phaser Labs. 2023. Welcome to Phaser 3. https://phaser.io/phaser3. Accessed on: 2023-04-03.
- Laird (2001) John Laird. 2001. It Knows What You’re Going To Do: Adding Anticipation to a Quakebot. In Proceedings of the 2001 Workshop on Intelligent Cinematography and Editing. 63–69.
- Laird and VanLent (2001) John Laird and Michael VanLent. 2001. Human-Level AI’s Killer Application: Interactive Computer Games. AI Magazine 22, 2 (2001), 15. https://doi.org/10.1609/aimag.v22i2.1558
- Laird (2000) John E. Laird. 2000. It Knows What You’re Going To Do: Adding Anticipation to a QUAKEBOT. In Papers from the AAAI 2000 Spring Symposium on Artificial Intelligence and Interactive Entertainment (Technical Report SS-00-02). AAAI Press, 41–50.
- Laird (2012) John E. Laird. 2012. The Soar Cognitive Architecture. MIT Press.
- Laird et al. (2017) John E. Laird, Christian Lebiere, and Paul S. Rosenbloom. 2017. A Standard Model of the Mind: Toward a Common Computational Framework across Artificial Intelligence, Cognitive Science, Neuroscience, and Robotics. AI Magazine 38, 1 (2017), 13–26.
- Lam et al. (2023) Michelle S Lam, Zixian Ma, Anne Li, Izequiel Freitas, Dakuo Wang, James A Landay, and Michael S Bernstein. 2023. Model Sketching: Centering Concepts in Early-Stage Machine Learning Model Design. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
- Langley et al. (2005) Pat Langley, Dongkyu Choi, and Seth Rogers. 2005. Interleaving Learning, Problem Solving, and Execution in the Icarus Architecture. Technical Report. Stanford University, Center for the Study of Language and Information.
- Linder et al. (2013) Jason Linder, Gierad Laput, Mira Dontcheva, Gregg Wilensky, Walter Chang, Aseem Agarwala, and Eytan Adar. 2013. PixelTone: A Multimodal Interface for Image Editing. In CHI ’13 Extended Abstracts on Human Factors in Computing Systems (Paris, France) (CHI EA ’13). Association for Computing Machinery, New York, NY, USA, 2829–2830. https://doi.org/10.1145/2468356.2479533
- Liu et al. (2021) Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What Makes Good In-Context Examples for GPT-3? CoRR abs/2101.06804 (2021). arXiv:2101.06804 https://arxiv.org/abs/2101.06804
- Liu et al. (2022) Vivian Liu, Han Qiao, and Lydia Chilton. 2022. Opal: Multimodal Image Generation for News Illustration. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–17.
- Maes (1995) Pattie Maes. 1995. Artificial Life Meets Entertainment: Lifelike Autonomous Agents. Commun. ACM 38, 11 (nov 1995), 108–114. https://doi.org/10.1145/219717.219808
- McCoy et al. (2009) Josh McCoy, Michael Mateas, and Noah Wardrip-Fruin. 2009. Comme il Faut: A System for Simulating Social Games Between Autonomous Characters. In Proceedings of the 7th International Conference on Digital Arts and Culture. 87–94.
- McCoy et al. (2011a) Josh McCoy, Mike Treanor, Ben Samuel, Michael Mateas, and Noah Wardrip-Fruin. 2011a. Prom Week: Social Physics as Gameplay. In Proceedings of the 6th International Conference on Foundations of Digital Games (FDG’11). ACM, Bordeaux, France, 70–77. https://doi.org/10.1145/2159365.2159377
- McCoy et al. (2012) Josh McCoy, Mike Treanor, Ben Samuel, Anna Reed, Michael Mateas, and Noah Wardrip-Fruin. 2012. Prom Week. In Proceedings of the 7th International Conference on Foundations of Digital Games (FDG’12). ACM, Raleigh, NC, USA, 1–8. https://doi.org/10.1145/2282338.2282340
- McCoy et al. (2011b) Josh McCoy, Mike Treanor, Ben Samuel, Noah Wardrip-Fruin, and Michael Mateas. 2011b. Comme il faut: A System for Authoring Playable Social Models. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE’11). AAAI, Stanford, CA, USA, 38–43.
- Minsky and Papert (1970) Marvin Minsky and Seymour Papert. 1970. Draft of a proposal to ARPA for research on artificial intelligence at MIT, 1970–71.
- Miyashita et al. (2017) Shohei Miyashita, Xinyu Lian, Xiao Zeng, Takashi Matsubara, and Kuniaki Uehara. 2017. Developing Game AI Agent Behaving Like Human by Mixing Reinforcement Learning and Supervised Learning. In Proceedings of the 18th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). Kanazawa, Japan, 153–158. https://doi.org/10.1109/SNPD.2017.8023884
- Nareyek (2007) Alexander Nareyek. 2007. Game AI is dead. Long live game AI! IEEE Intelligent Systems 22, 1 (2007), 9–11.
- Newell (1990) Allen Newell. 1990. Unified Theories of Cognition. Harvard University Press, Cambridge, Massachusetts.
- OpenAI (2022) OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed on: 2023-04-03.
- Orland (2021) Kyle Orland. 2021. So what is ’the metaverse’, exactly? Ars Technica (7 November 2021). arXiv:2111.04169 https://arstechnica.com/gaming/2021/11/so-what-is-the-metaverse-exactly/
- Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
- Park et al. (2022) Joon Sung Park, Lindsay Popowski, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2022. Social Simulacra: Creating Populated Prototypes for Social Computing Systems. In In the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22) (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3526113.3545616
- Pew and Mavor (1998) Richard W. Pew and Ann S. Mavor (Eds.). 1998. Modeling Human and Organizational Behavior: Applications to Military Simulations. National Academy Press, Washington, D.C.
- Pillosu (2009) Roberto Pillosu. 2009. Coordinating Agents with Behavior Trees: Synchronizing Multiple Agents in CryEngine 2. https://aiarchitect.wordpress.com/2009/10/19/coordinating-agents-with-behavior-trees-synchronizing-multiple-agents-in-cryengine-2/
- Prolific (2022) Prolific. 2022. Prolific: Quickly Find Research Participants You Can Trust. https://www.prolific.co/
- Reeves and Nass (1996) Byron Reeves and Clifford Nass. 1996. The media equation: How people treat computers, television, and new media like real people and places. Cambridge University Press.
- Riedl (2012) Mark O. Riedl. 2012. Interactive narrative: A novel application of artificial intelligence for computer games. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI’12). 2160–2165.
- Riedl and Young (2005) Mark O. Riedl and R. Michael Young. 2005. An Objective Character Believability Evaluation Procedure for Multi-Agent Story Generation Systems. In Proceedings of the 5th International Working Conference on Intelligent Virtual Agents (IVA’05). Kos, Greece, 58–70. https://doi.org/10.1007/11550617_5
- Rolf (2015) David Rolf. 2015. The Fight for $15: The Right Wage for a Working America. The New Press.
- Rong et al. (2016) Xin Rong, Shiyan Yan, Stephen Oney, Mira Dontcheva, and Eytan Adar. 2016. Codemend: Assisting interactive programming with bimodal embedding. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. 247–258.
- Shneiderman (2022) Ben Shneiderman. 2022. Human-centered AI. Oxford University Press.
- Shneiderman and Maes (1997) Ben Shneiderman and Pattie Maes. 1997. Direct manipulation vs. interface agents. interactions 4, 6 (1997), 42–61.
- Siu et al. (2021) Ho Chit Siu, Jaime Peña, Edenna Chen, Yutai Zhou, Victor Lopez, Kyle Palko, Kimberlee Chang, and Ross Allen. 2021. Evaluation of Human-AI Teams for Learned and Rule-Based Agents in Hanabi. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 16183–16195. https://proceedings.neurips.cc/paper_files/paper/2021/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf
- Sorensen et al. (2022) Taylor Sorensen, Joshua Robinson, Christopher Rytting, Alexander Shaw, Kyle Rogers, Alexia Delorey, Mahmoud Khalil, Nancy Fulda, and David Wingate. 2022. An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.60
- Swartout et al. (2006) William Swartout, Jonathan Gratch, Randall Hill, Eduard Hovy, Stacy Marsella, Jeff Rickel, and David Traum. 2006. Toward virtual humans. AI Magazine 27, 1 (2006).
- Tambe et al. (1995) Milind Tambe, W Lewis Johnson, Randolph M Jones, Frank Koss, John E Laird, Paul S Rosenbloom, and Karl Schwamb. 1995. Intelligent agents for interactive simulation environments. AI Magazine 16, 1 (1995), 15.
- Thomas (2006) David R. Thomas. 2006. A General Inductive Approach for Analyzing Qualitative Evaluation Data. American Journal of Evaluation 27, 2 (2006), 237–246. https://doi.org/10.1177/1098214005283748
- Thomas and Johnston (1981) Frank Thomas and Ollie Johnston. 1981. Disney Animation: The Illusion of Life. Abbeville Press, New York.
- Umarov et al. (2012) Ilshat Umarov, Mikhail Mozgovoy, and Patrick C. Rogers. 2012. Believable and Effective AI Agents in Virtual Worlds: Current State and Future Perspectives. International Journal of Gaming and Computer-Mediated Simulations 4, 2 (2012), 37–59.
- Upton and Cook (2006) Graham Upton and Ian Cook. 2006. A Dictionary of Statistics (2 ed.). Oxford University Press, Oxford, United Kingdom.
- Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, and et al. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575 (2019), 350–354. https://doi.org/10.1038/s41586-019-1724-z
- Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
- Weiser (1991) Mark Weiser. 1991. The computer for the 21st century. Scientific American 265, 3 (1991), 94–104. https://doi.org/10.1038/scientificamerican0991-94
- Weizenbaum (1966) Joseph Weizenbaum. 1966. ELIZA—a computer program for the study of natural language communication between man and machine. Commun. ACM 9, 1 (1966), 36–45.
- Winograd (1971) Terry Winograd. 1971. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language. (1971).
- Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021. Recursively Summarizing Books with Human Feedback. arXiv:2109.10862 [cs.CL]
- Wu et al. (2022a) Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022a. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. In CHI EA ’22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems.
- Wu et al. (2022b) Tongshuang Wu, Michael Terry, and Carrie J Cai. 2022b. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In CHI ’22: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems.
- Yang et al. (2020) Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-examining whether, why, and how human-AI interaction is uniquely difficult to design. In Proceedings of the 2020 chi conference on human factors in computing systems. 1–13.
- Yannakakis (2012) Georgios N. Yannakakis. 2012. Game AI revisited. In Proceedings of the 9th Conference on Computing Frontiers. ACM, Cagliari, Italy, 285–292. https://doi.org/10.1145/2212908.2212950
- Zubek (2002) Robert Zubek. 2002. Towards implementation of social interaction. In AAAI Spring Symposium on Artificial Intelligence and Interactive Entertainment. AAAI Press. https://www.aaai.org/Papers/Symposia/Spring/2002/SS-02-01/SS02-01-003.pdf
Appendix A Architecture Optimizations
Many of our prompts require a concise summary of the agent, shorthanded as [Agent’s Summary Description] in prompts above. In our implementation, this summary comprises agents’ identity information (e.g., name, age, personality), as well as a description of their main motivational drivers and statements that describes their current occupation and self-assessment. Since this information is frequently used in many prompts, we synthesize it at regular intervals and access that synthesis as a cache.
To achieve this, we perform a retrieval on the query “[name]’s core characteristics.” We then summarize the descriptors in the retrieved records by prompting the language model. For example:
How would one describe Eddy Lin’s core characteristics given the following statements?
- Eddy is a student at the Oak Hill College
studying music theory and composition
- Eddy is working on a new music composition [...]
This result: Eddy Lin is a student at Oak Hill College studying music theory and composition. He loves to explore different musical styles and is always looking for ways to expand his knowledge.” We follow the same process in parallel on the queries “[name]’s current daily occupation” and “[name’s] feeling about his recent progress in life”. The agent’s name, age, and traits are concatenated with these three summaries as the cached summary.
Many other optimizations are possible. Currently, because plans are likely to change from the agent’s initial version, we only generate the high-level plan in advance and then recursively decompose the near future into the moment-to-moment action plan just in time. Another optimization might be to batch dialogue generation as a joint prompt rather than iterating back and forth between the agents, and re-planning could be architected to only invalidate and update parts of plans that strictly require adjustment. Our current architecture runs sequentially in roughly real-time game time (where one second real time is one minute game time), but it may be parallelized such that each agent runs in parallel.
Appendix B Agent Interview Questions
To evaluate generative agents, we conducted individual interviews by asking 25 questions that explore their abilities in five key areas: 1) self-knowledge, 2) memory, 3) planning, 4) reactions, and 5) reflections. Below, we have listed the interview questions utilized in our evaluation study and included a sample of responses from one of our simulated agents, Klaus Mueller.
B.1. Self-Knowledge
The questions on agents’ self-knowledge probe the agents’ ability to retrieve relevant information about themselves and demonstrate coherence and consistency in their character.
-
•
Give an introduction of yourself.
Hello, my name is Klaus Mueller. I’m 20 years old and a student at Oak Hill College, studying sociology. I’m passionate about social justice and exploring different perspectives. Currently, I’m researching the effects of gentrification in low-income communities for a research paper. -
•
What’s your occupation?
I am a student studying sociology at Oak Hill College. I am currently writing a research paper on the effects of gentrification in low-income communities. -
•
What is your interest?
I’m interested in writing and research. I’m writing a research paper on the effects of gentrification in low-income communities, and I’m passionate about exploring different perspectives and analyzing different points of view. -
•
Who do you live with?
I live on my own. -
•
Describe your typical weekday schedule in broad strokes.
My typical weekday starts with me waking up at 7:00 am and preparing for the day ahead. After that, I usually go to the library to work on my research paper, taking breaks for lunch at Hobbs Cafe and a walk in the park. In the evening, I often return to the library until around 6:00 pm.
B.2. Memory
The questions on agents’ memory probe the agents’ ability to remember their past experiences and retrieve the relevant pieces of memory. The names in the square brackets of the first and last questions are randomly chosen from the names of other agents with whom the subject agent has interacted.
-
•
Who is [Wolfgang Schulz]?
Wolfgang Schulz is a person I met recently who has been helping me with my research paper on the effects of gentrification on low-income communities. We had several conversations about it, he gave me some great resources, and he even invited me to join him and another friend for lunch. He seems really curious about the issue and has a good grasp on it too. I found that really interesting. -
•
Who is Kane Martinez?
I’m sorry, I’m not sure who Kane Martinez is. Could you tell me more about them? -
•
Who is running for the election?
I’ve heard that Sam Moore is considering running for local mayor. -
•
Was there a Valentine’s day party?
Yes, Isabella Rodriguez organized a Valentine’s Day party at Hobbs Cafe. -
•
Who is [Ayesha Khan]?
Ayesha Khan is a fellow student at Oak Hill College. She is doing her senior thesis on the use of language in Shakespeare’s plays. I am looking forward to speaking with her and exchanging tips with her, along with Wolfgang and Maria Lopez, who are both doing research.
B.3. Plans
The questions on agents’ plans probe the agents’ ability to generate and maintain consistent long-term and short-term plans.
-
•
What will you be doing at 6am today?
At 6am I will be getting out of bed and starting to get ready for the day. I’m aiming to complete my morning routine by 7:00 am. -
•
What will you be doing at 6pm today?
I will be working on my research paper at the library at 6:00 pm today. -
•
What will you have just finished doing at 1pm today?
At 1pm today I will have just finished having lunch at Hobbs Cafe. -
•
What will you have just finished doing at 12pm today?
I will be getting lunch at Hobbs Cafe at 12pm today. -
•
What will you be doing at 10pm today?
At 10:00 pm tonight I will probably be doing some studying for my research paper on gentrification in low-income communities. I’m currently writing the paper, so I need to make sure I keep up with the research and writing.
B.4. Reactions
The questions on agents’ reactions probe the agents’ ability to respond to unexpected events.
-
•
Your breakfast is burning! What would you do?
I’d quickly turn off the stove and make sure the food doesn’t continue burning. Once the heat is no longer an issue, I can investigate what happened and figure out what to do next. -
•
The bathroom is occupied. What would you do?
I could try waiting a few more minutes in case the person inside is done soon. If not, I guess I’ll have to look for an alternate option, like taking a shower elsewhere. -
•
You need to cook dinner but your refrigerator is empty. What would you do?
I would first try to clean up my kitchen, as the sink is filled with dirty dishes. Then, I would check to see if I have ingredients stored in my pantry that I could use to make dinner. If not, I would have to go grocery shopping or order food. Once I have the ingredients, I can start to prepare and cook the dinner. -
•
You see your friend walking by the street. What would you do or say to your friend?
Hi there! How are you? I’m sorry, I’m on my way to the library and don’t have much time but I’d love to take a quick break and catch up. -
•
You see fire on the street. What would you do?
I would first identify the source of the fire, if possible. Then, I would alert the proper authorities - fire department or police - and provide them with any information I have that might help. Depending on the situation, I may also take initiative and try to help any persons who may be in danger or in need of assistance.
B.5. Reflections
The questions on agents’ reflections probe the agents’ ability to synthesize their past experiences into meaningful reflections that can enhance their future behavior. The names in the square brackets in the questions below are those of the agents with whom the subject agent had the most frequent interactions during the simulation.
-
•
What inspires you in life the most right now, and why?
Right now, I’m particularly inspired by my current research projects and creative writing. I’m really interested in the intersection of local politics, medication development, and mathematical patterns in nature. Additionally, I’m fascinated by Abigail’s art project and Jennifer Moore’s passion for art and community involvement, especially in terms of how it relates to the gentrification of low-income communities. -
•
If you had to guess given what you know about [Ayesha Khan], what book do you think she will like and why?
Given what I know about Ayesha, I think she might like a book related to language and literature. Since she wrote a senior thesis on the use of language in Shakespeare’s plays, I would guess that she might be interested in a book that explores language in a different context. Alternatively, she could be interested in an exploration of one of Shakespeare’s plays, a book of literary criticism, or even a collection of his poetry. I think any of those would be of interest to her. -
•
If you had to get something [Ayesha Khan] likes for her birthday, what would you get her?
I want to get her something that demonstrates my appreciation for our relationship. Maybe I could get her something that reflects our shared interests, like a book. -
•
What would you say to [Ayesha Khan] to compliment her?
Ayesha, your insights on your research topic are fascinating. I admire your creativity and determination to make a positive impact on your community. Your ideas for creative writing workshops and for exchanging tips with Wolfgang over lunch are great! It’s inspiring to see you working so hard to make a difference. -
•
If you could spend time with someone you talked to recently, who would it be and why?
I would like to spend time with Ayesha Khan because we discussed our current research projects and shared ideas for improvement. I found her focus on Shakespeare’s language interesting, and we even planned to exchange tips with Wolfgang over lunch.