Advancing Multimodal Cognitive Architectures in Digital Humans: The Role of Abstract State Machines and Large Language Models

Advancing Multimodal Cognitive Architectures in Digital Humans: The Role of Abstract State Machines and Large Language Models

The world of artificial intelligence has experienced remarkable advancements, and as we venture deeper into creating digital humans, there’s an increasing need for sophisticated models to represent intricate human-machine interactions. At the forefront of this pursuit are Abstract State Machines (ASMs) and large language models. Together, they hold the promise to redefine and advance the realm of multimodal cognitive architectures.

Abstract State Machines: A Brief Overview

ASMs are mathematical constructs that describe systems in terms of states and transitions between these states. Unlike their simpler counterpart, Finite State Machines (FSMs), ASMs are not limited to a fixed number of states or transitions. While FSMs provide a structured way of modeling a system’s behavior based on a fixed number of predefined states and inputs, ASMs bring flexibility and a more abstract representation, allowing for a broader range of systems and behaviors to be modeled.

Envisioning Multimodal Cognitive Architectures in Digital Humans

Digital humans aim to be sophisticated replicas of real humans in the virtual world, encompassing intricate behaviors, emotions, and interactions. As the term ‘multimodal’ suggests, these architectures must process and integrate various forms of input – from text to speech and visuals. In such a complex scenario, ASMs provide the backbone for these architectures.

  1. Handling Diverse Inputs: With ASMs, one can delineate states to represent stages of processing different input types. The system could, for instance, transition from processing visual data to understanding textual context associated with that data. Such fluidity in transitioning between modalities ensures a seamless experience.
  2. Adaptive and Dynamic Responses: ASMs can guide a system to adapt its responses. If a digital human receives both visual and textual cues, the ASM’s states and transitions can ensure that the response generated considers both inputs, much like how a real human would.
  3. Hierarchical Processing: The human cognition process is layered, transitioning from basic sensory input to more abstract decision-making. ASMs, with their ability for hierarchical structuring, mirror this beautifully. This hierarchical approach can guide low-level processes for basic interpretations, with higher tiers handling intricate decisions.

When we introduce large language models into this architecture, we get a powerful tool for language processing. These models, adept at understanding and generating language, become essential components of the ASM framework, activated when the system needs to comprehend or produce textual content.

The Future: A Vivid Picture of Advanced Digital Humans

Imagine a digital human capable of understanding a painting you show it, discussing its historical context, the emotions it evokes, and even the technique used, all while referencing a poem that captures the essence of the artwork. Such a scenario is possible when ASMs seamlessly integrate visual and textual understanding, with large language models filling in the linguistic expertise.

Further, with feedback loops modeled within ASMs, these digital entities could learn and adapt over time. An error or unexpected input wouldn’t derail the interaction; instead, the system might ask clarifying questions, much like a human in doubt.


In the quest to create advanced digital humans, the synergy between ASMs and large language models offers immense potential. ASMs provide the structured yet flexible framework to handle diverse modalities, while large language models bring linguistic prowess. Together, they promise a future where digital humans are not mere facsimiles but entities that interact, understand, and respond with an almost human-like finesse.