Notes:
This white paper presents a consolidated account of how Rhetorical Structure Theory (RST) can be operationalized in dialog systems for virtual beings, including digital humans, virtual humans, virtual influencers, and VTubers. It defines core RST concepts such as elementary discourse units and rhetorical relations, explains their relevance to conversational analysis and generation, and outlines a practical architecture that integrates RST-driven discourse parsing and planning with natural language understanding, dialog management, and natural language generation. The paper surveys representative applications—summarization, dialog act classification, stylistically adaptive generation, affect-sensitive interaction, and multimodal coordination—then discusses datasets, annotation practices, evaluation methods, implementation considerations, and open challenges. It concludes with a forward-looking agenda for robust, multilingual, real-time RST in service of highly coherent, controllable, and transparent conversational behaviors for virtual beings.
Wikipedia:
See also:
Rhetorical Structure Theory as a Framework for Coherent and Adaptive Dialog in Virtual Beings
Rhetorical Structure Theory models the coherence of texts as a tree of relations linking elementary discourse units into nucleus–satellite and multinuclear structures that explain why a message “hangs together” (Mann & Thompson, 1988). When applied to dialog, RST provides a principled layer between surface utterances and communicative intent: it helps identify what is central versus supportive, organizes turns into coherent plans, and supplies constraints for generation. In virtual beings, this enables consistent persona-driven behavior, controllable discourse strategies, and faithful summarization of long interactions. A practical system segments conversational transcripts into EDUs, predicts rhetorical relations, incrementally builds discourse trees, and exposes these structures to the dialog manager and generator. Intrinsic evaluation measures relation labeling and tree quality; extrinsic evaluation measures task success, efficiency, user-reported quality, and downstream gains in classifiers or generators. The key challenges are reliable segmentation in spontaneous speech, handling interruptions and repairs, limited annotated dialog corpora, multilingual transfer, and real-time constraints, which can be addressed by incremental parsing, domain adaptation, and careful data and tooling choices.
RST analyzes discourse as hierarchies of elementary discourse units that are connected by relations such as Elaboration, Evidence, Contrast, Justify, and Condition, producing trees in which nuclei carry the central message and satellites provide support (Mann & Thompson, 1988). While originally developed for written text, RST principles have been extended to spoken interaction and planning for generation, where the tree serves as a blueprint for information ordering and cueing. Compared to shallow discourse frameworks that annotate local connectives, RST encodes global structure and functional roles, which is particularly useful for multi-turn conversations where intent unfolds over time.
Dialog systems require models that bridge intent, context, and surface form; RST offers an account of coherence that complements dialog act schemes by linking acts into functional structures over multiple turns. In adaptive and stylistically aware generation, RST trees guide what to foreground and how to linguistically realize relations, which supports persona and style control in virtual beings (Mairesse & Walker, 2010; Mairesse, 2008). RST-aware planners have been used to structure cooperative exchanges and to coordinate language with nonverbal behaviors in embodied and multimodal settings without privileging any specific hardware form, aligning well with virtual human production pipelines (Isard & Matheson, 2012; Meza et al., 2010; Pineda et al., 2010).
An RST-enabled conversational architecture for virtual beings consists of four layers connected by a shared discourse state. The segmentation and interpretation layer detects EDUs from noisy, spontaneous inputs and aligns them with turns. The relation prediction layer classifies nucleus–satellite roles and relation labels, incrementally maintaining a discourse tree over the session. The dialog management layer consults the tree to determine next intents, ensuring that responses either advance nuclei or supply appropriate satellites such as Evidence or Elaboration. The generation layer plans text from tree structures, realizing rhetorical relations with lexical cues, sentence aggregation, and referential choices while maintaining persona and affect constraints drawn from the character’s profile. The architecture exposes programmatic hooks so summaries, highlights, and explanations can be derived by extracting nuclei and their supporting satellites.
Segmentation can be approached with supervised sequence models trained on RST-style corpora adapted to dialog, incorporating prosodic and punctuation surrogates where available. Relation prediction typically uses feature-rich classifiers or neural encoders trained to label relations and nuclearity; incremental parsers can update the tree as turns arrive to support real-time interaction. For generation, schema-based planners can map target intents into RST skeletons that are fleshed out by sentence planners and surface realizers; stylistic and personality parameters conditioned by psychologically informed models can modulate lexical choice and aggregation while respecting the discourse plan (Mairesse & Walker, 2010; Isard & Matheson, 2012). Affect-sensitive variants can select supportive relations such as Justify or Enable to align with user state without altering factual content (Skowron, 2010).
Effective use of RST in dialog requires annotated corpora with EDU boundaries, nuclearity, and relation labels. Existing RST practices and tools, such as tree annotation utilities and multi-layer discourse annotation environments, provide starting points that can be adapted to conversational data and to languages beyond English (O’Donnell, 2000; Van der Vliet et al., 2011; Redeker et al., 2012). For Spanish and other languages with active dialog-system research, multimodal corpora integrating gestures and gaze with RST-style rhetorical acts have been assembled to support cross-channel fusion and fission modules (Meza et al., 2010; Pineda et al., 2010). Where full RST annotation is costly, partial labels for high-frequency relations and nuclearity can still yield practical gains for summarization and planning.
RST supports several high-value functions for virtual beings. In dialog summarization, extracting nuclei and their satellites yields concise, faithful recaps tailored to user goals. In dialog act classification and intent tracking, discourse roles help disambiguate functionally similar acts by their structural position. In personality- and style-based generation, discourse plans align content selection with persona traits while preserving coherence (Mairesse & Walker, 2010). In multimodal coordination, planned relations inform timing and emphasis of visual behaviors and other modalities, ensuring that supportive content is appropriately cued (Isard & Matheson, 2012). Additional applications include argument presentation, question generation that targets gaps implied by relations, and committed-belief analysis where discourse role informs epistemic stance in conversation (Prabhakaran et al., 2010; Cadilhac et al., 2011; Kanayama et al., 2012).
Intrinsic evaluation measures the accuracy of EDU segmentation, nuclearity assignment, and relation labeling, as well as tree similarity metrics that compare predicted and reference structures. Extrinsic evaluation measures impact on downstream tasks: improvements in task success rates, fewer turns to completion, higher user ratings of coherence and appropriateness, and better performance of summarizers and generators compared with non-RST baselines. For stylistically adaptive systems, human judgments of persona consistency and pragmatic appropriateness complement automatic metrics. In multimodal settings, alignment between rhetorical emphasis and nonverbal behaviors can be assessed via expert review and user studies.
Deployments for virtual beings must handle interruptions, overlaps, repairs, and elliptical turns. Incremental parsing and partial-tree maintenance allow the system to revise structures as more context arrives. Domain adaptation is essential because relation distributions shift across tasks and genres; fine-tuning on in-domain dialog while regularizing toward general RST constraints mitigates overfitting. Multilingual support requires language-specific segmentation features and relation lexicons, yet shared nuclearity and structural biases transfer well. Tooling should enable annotator guidance, consistency checks on nuclearity and relation constraints, and export of trees to the dialog manager and generator in a stable interchange format.
RST presupposes tree-structured coherence, but real conversations often exhibit cross-cutting dependencies, topic resumption, and parallel threads that are not strictly tree-like. Spontaneous speech complicates EDU boundaries, especially with backchannels and incremental constructions. Relation inventories vary across corpora, and label granularity impacts both learnability and utility. There is limited availability of richly annotated dialog corpora for many languages and domains, and building them is resource-intensive. Finally, achieving low-latency, incrementally updated discourse structures that remain stable enough for planning remains a central engineering tension.
Promising avenues include hybrid models that combine RST with shallow discourse cues and information-state representations, incremental neural parsers trained with latency-aware objectives, and controllable generators that accept RST trees as planning inputs for reliable content ordering and persona realization. Semi-automatic annotation and active learning can reduce data costs, while cross-lingual transfer and projection can extend coverage. For transparency and safety, exposing simplified discourse views to users and creators can make virtual beings’ reasoning auditable and adjustable without compromising real-time performance.
RST supplies a functional account of coherence that is directly actionable in dialog systems for virtual beings. By structuring conversations into nuclei and satellites linked by explicit rhetorical relations, developers can build systems that plan, explain, and summarize interactions more effectively, control style without sacrificing clarity, and coordinate multimodal behaviors around meaningful discourse units. Despite practical challenges in segmentation, relation prediction, and real-time stability, the integration of RST into the conversational stack materially improves coherence and controllability and provides a principled foundation for future, more transparent virtual beings.