Reinforcement Learning for Adaptive Dialogue Systems


Reinforcement Learning for Adaptive Dialogue Systems (2011) .. by Oliver Lemon etc


Contents

1 Introduction . . . 1

1.1 The Design Problem for Spoken Dialogue Systems . . . 1

1.2 Overview . . . 2

1.3 Structure of the Book . . . 4

Part I Fundamental Concepts

2 Background . . . 9

2.1 Human-Computer Interaction . . . 10

2.2 Dialogue Strategy Development . . . 11

2.2.1 Conventional Development Lifecycle . . . 12

2.2.2 Evaluation and Strategy Quality Control . . . 13

2.2.3 Strategy Implementation . . . 17

2.2.4 Challenges for Strategy Development . . . 19

2.3 Literature review: Learning Dialogue Strategies . . . 21

2.3.1 Machine Learning Paradigms . . . 21

2.3.2 Supervised Learning for Dialogue Strategies . . . 22

2.3.3 Dialogue as Decision Making under Uncertainty . . . 23

2.3.4 Reinforcement Learning for Dialogue Strategies . . . 24

2.4 Summary . . . 26

3 Reinforcement Learning . . . 29

3.1 The Nature of Dialogue Interaction . . . 30

3.1.1 Dialogue is Temporal . . . 30

3.1.2 Dialogue is Dynamic . . . 31

3.2 Reinforcement Learning-based Dialogue Strategy Learning . . . 32

3.2.1 Dialogue as a Markov Decision Process . . . 32

3.2.2 The Reinforcement Learning Problem . . . 36

3.2.3 Model-based vs. Simulation-based Strategy Learning . . . 42

3.3 Dialogue Simulation . . . 45

3.3.1 Wizard-of-Oz Studies . . . 45

3.3.2 Computer-based Simulations . . . 46

3.3.3 Discussion . . . 47

3.4 Application Domains . . . 48

3.4.1 Information-Seeking Dialogue Systems. . . 48

3.4.2 Multimodal Output Planning and Information Presentation 49

3.4.3 Multimodal Dialogue Systems for In-Car Digital Music Players . . . 52

3.5 Summary . . . 52

4 Proof-of-Concept: Information Seeking Strategies . . . 53

4.1 Introduction . . . 53

4.1.1 A Proof-of-Concept Study . . . 54

4.2 Simulated Learning Environments . . . 55

4.2.1 Problem Representation . . . 55

4.2.2 Database Retrieval Simulations . . . 56

4.2.3 Noise Model . . . 57

4.2.4 User Simulations . . . 58

4.2.5 Objective and Reward Function . . . 59

4.2.6 Application Scenarios . . . 60

4.3 Threshold-based Baseline . . . 61

4.4 Reinforcement Learning Method . . . 63

4.4.1 Training the Policies . . . 63

4.5 Results . . . 65

4.6 Summary . . . 69

Part II Policy Learning in Simulated Environments

5 The Bootstrapping Approach to Developing Reinforcement Learning-based Strategies . . . 73

5.1 Motivation . . . 74

5.1.1 Term Definition . . . 75

5.1.2 Related Work . . . 76

5.2 Advantages for Learning from WOZ Data . . . 77

5.2.1 Challenges for Learning from WOZ Data . . . 78

5.3 The Bootstrapping Method . . . 79

5.3.1 Step 1: Data Collection in a Wizard-of-Oz Experiment . . . 79

5.3.2 Step 2: Build a Simulated Learning Environment . . . 81

5.3.3 Step 3: Train and test a strategy in simulation . . . 81

5.3.4 Step 4: Test with Real Users . . . 82

5.3.5 Step 5: Post-Evaluation . . . 82

5.4 Summary . . . 82

6 Data Collection in aWizard-of-Oz Experiment . . . 85

6.1 Experimental Setup . . . 86

6.1.1 Recruited Subjects: Wizards and Users . . . 89

6.1.2 Experimental Procedure and Task Design . . . 90

6.2 Noise Simulation . . . 90

6.2.1 Related Work . . . 90

6.2.2 Method . . . 91

6.2.3 Results and Discussion . . . 91

6.3 Corpus Description . . . 92

6.4 Analysis . . . 94

6.4.1 Qualitative Measures . . . 94

6.4.2 Subjective Ratings from the User Questionnaires . . . 95

6.5 Summary and Discussion . . . 98

7 Building Simulation Environments from Wizard-of-Oz Data . . . 101

7.1 Dialogue Strategy Learning with Simulated Environments . . . 101

7.1.1 Method and Related Work . . . 103

7.1.2 Outline . . . 106

7.2 Database Description . . . 107

7.3 Action Set Selection . . . 108

7.3.1 Method and Related Work . . . 108

7.3.2 Annotation Scheme . . . 108

7.3.3 Manual Annotation . . . 110

7.3.4 Action Set for Learning . . . 111

7.4 State Space Selection . . . 112

7.4.1 Method and Related Work . . . 112

7.4.2 Task-based State Space Features . . . 113

7.4.3 Feature Selection Techniques for Domain-specific State Space Features . . . 114

7.5 MDP and Strategy Design . . . 118

7.5.1 Motivation . . . 118

7.5.2 Implementation . . . 118

7.5.3 Hierarchical Reinforcement Learning in the ISU Approach 119

7.5.4 Further System Behaviour . . . 120

7.6 Wizard Behaviour . . . 122

7.6.1 Method and Related Work . . . 122

7.6.2 Supervised Learning: Rule-based Classification . . . 124

7.7 Noise Simulation: Modelling the Effects of Mis-Communication . . . 125

7.7.1 Method and Related Work . . . 125

7.7.2 Simulating the Effects of Non- and Mis-Understandings . . . 127

7.8 User Simulation . . . 128

7.8.1 Method and Related Work . . . 129

7.8.2 User Actions . . . 132

7.8.3 A Simple Bi-gram Model . . . 133

7.8.4 Cluster-based User Simulation . . . 134

7.8.5 Smoothed Bi-gram User Simulation . . . 136

7.8.6 Evaluation of User Simulations . . . 138

7.8.7 Speech Act Realisation Dependent on the User Goal . . . 139

7.9 Reward and Objective Functions . . . 142

7.9.1 Method and Related Work . . . 142

7.9.2 Linear Regression for Information Acquisition . . . 146

7.9.3 Non-linear Rewards for Information Presentation . . . 148

7.9.4 Final Reward . . . 150

7.10 State-Space Discretisation . . . 151

7.11 Learning Experiments . . . 152

7.11.1 Training with SHARSHA . . . 152

7.11.2 Results for Testing in Simulation . . . 154

7.11.3 Qualitative Strategy Description . . . 155

7.11.4 Strategy Implementation . . . 157

7.11.5 Discussion and Error Analysis . . . 158

7.12 Summary . . . 162

Part III Evaluation and Application

8 Comparing Reinforcement and Supervised Learning of Dialogue Policies with Real Users . . . 167

8.1 Policy Integration into a Dialogue System . . . 168

8.1.1 The DUDE Rapid Dialogue Development Tools . . . 168

8.1.2 Extensions to DUDE . . . 170

8.2 Experimental Setup . . . 174

8.2.1 Technical Setup . . . 174

8.2.2 Primary Driving Task . . . 174

8.2.3 Subjects and Procedure . . . 175

8.2.4 Task Types . . . 176

8.2.5 User Questionnaires . . . 176

8.3 Results . . . 177

8.3.1 Subjective User Ratings . . . 178

8.3.2 Objective Dialogue Performance . . . 181

8.4 Discussion of Real User Evaluation Results . . . 182

8.5 Meta-Evaluation . . . 183

8.5.1 Transfer Between Simulated and Real Environments . . . 183

8.5.2 Evaluation of the Learned Reward Function . . . 184

8.6 Summary . . . 188

9 Adaptive Natural Language Generation . . . 189

9.1 Introduction . . . 190

9.1.1 Previous Work on Information Presentation in SDS . . . 190

9.2 NLG as Planning Under Uncertainty . . . 192

9.3 Wizard-of-Oz Data Collection . . . 192

9.3.1 Experimental Setup and Data Collection . . . 193

9.3.2 Surface Realiser . . . 193

9.3.3 Human “Wizard” Baseline Strategy . . . 194

9.4 The Simulation / Learning Environment . . . 195

9.4.1 User Simulations . . . 195

9.4.2 Database Matches and “Focus of Attention” . . . 197

9.4.3 Data-driven Reward Function . . . 197

9.5 Reinforcement Learning Experiments. . . 198

9.5.1 Experimental Set-up . . . 199

9.5.2 Results . . . 199

9.6 Evaluation with real users . . . 202

9.7 Conclusion . . . 203

10 Conclusion . . . 205

10.1 Contributions . . . 206

10.2 Discussion . . . 207

10.2.1 Lessons Learned . . . 208

10.2.2 RL for Commercial Dialogue Strategy Development . . . 209

10.3 Outlook: challenges for future statistical dialogue systems . . . 210

Example Dialogues . . . 213

A.1 Wizard-of-Oz Example Dialogues . . . 213

A.2 Example Dialogues from Simulated Interaction . . . 216

A.3 Example Dialogues from User Testing . . . 218

Learned State-Action Mappings . . . 223

References . . . 229

About the Authors . . . 253