Self-Learning Speaker Identification (2011) .. by Wolfgang Minker etc
Contents
1 Introduction . . . 1
1.1 Motivation . . . 2
1.2 Overview . . . 4
2 Fundamentals . . . 5
2.1 Speech Production. . . 5
2.2 Front-End . . . 8
2.3 Speaker Change Detection . . . 16
2.3.1 Motivation . . . 16
2.3.2 Bayesian Information Criterion . . . 18
2.4 Speaker Identification . . . 21
2.4.1 Motivation and Overview. . . 22
2.4.2 Gaussian Mixture Models . . . 25
2.4.3 Detection of Unknown Speakers . . . 29
2.5 Speech Recognition . . . 30
2.5.1 Motivation . . . 30
2.5.2 Hidden Markov Models . . . 31
2.5.3 Implementation of an Automated Speech Recognizer . . . 35
2.6 Speaker Adaptation. . . 37
2.6.1 Motivation . . . 38
2.6.2 Applications for Speaker Adaptation in Speaker Identification and Speech Recognition . . . 39
2.6.3 Maximum A Posteriori . . . 41
2.6.4 Maximum Likelihood Linear Regression . . . 44
2.6.5 Eigenvoices . . . 45
2.6.6 Extended Maximum A Posteriori . . . 51
2.7 Feature Vector Normalization . . . 53
2.7.1 Motivation . . . 53
2.7.2 Stereo-based Piecewise Linear Compensation for Environments . . . 54
2.7.3 Eigen-Environment . . . 56
2.8 Summary. . . 57
3 Combining Self-Learning Speaker Identification and Speech Recognition . . . 59
3.1 Audio Signal Segmentation . . . 59
3.2 Multi-Stage Speaker Identification and Speech Recognition. . . 63
3.3 Phoneme Based Speaker Identification . . . 65
3.4 First Conclusion. . . 69
4 Combined Speaker Adaptation . . . 71
4.1 Motivation . . . 71
4.2 Algorithm . . . 72
4.3 Evaluation. . . 76
4.3.1 Database . . . 76
4.3.2 Evaluation Setup . . . 77
4.3.3 Results . . . 79
4.4 Summary. . . 82
5 Unsupervised Speech Controlled System with Long-Term Adaptation . . . 83
5.1 Motivation . . . 83
5.2 Joint Speaker Identification and Speech Recognition . . . 85
5.2.1 Speaker Specific Speech Recognition . . . 86
5.2.2 Speaker Identification . . . 91
5.2.3 System Architecture . . . 93
5.3 Reference Implementation . . . 95
5.3.1 Speaker Identification . . . 95
5.3.2 System Architecture . . . 96
5.4 Evaluation. . . 97
5.4.1 Evaluation of Joint Speaker Identification and Speech Recognition . . . 98
5.4.2 Evaluation of the Reference Implementation . . . 105
5.5 Summary and Discussion . . . 111
6 Evolution of an Adaptive Unsupervised Speech Controlled System . . . 115
6.1 Motivation . . . 115
6.2 Posterior Probability Depending on the Training Level . . . 116
6.2.1 Statistical Modeling of the Likelihood Evolution . . . 117
6.2.2 Posterior Probability Computation at Run-Time . . . 122
6.3 Closed-Set Speaker Tracking . . . 125
6.4 Open-Set Speaker Tracking . . . 128
6.5 System Architecture . . . 132
6.6 Evaluation . . . 134
6.6.1 Closed-Set Speaker Identification . . . 134
6.6.2 Open-Set Speaker Identification . . . 139
6.7 Summary. . . 143
7 Summary and Conclusion . . . 145
8 Outlook . . . 149
A Appendix . . . 153
A.1 Expectation Maximization Algorithm . . . 153
A.2 Bayesian Adaptation . . . 155
A.3 Evaluation Measures . . . 158
References . . . 161
Index . . . 171