전체 페이지뷰

2017년 7월 12일 수요일

AUTOMATED QUALITY MONITORING IN THE CALL CENTER WITH ASR AND MAXIMUM ENTROPY

https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjk64eQiYTVAhXDrJQKHVcqBOEQFggnMAA&url=http%3A%2F%2Fwww.danielpovey.com%2Ffiles%2Ficassp06_call_center.pdf&usg=AFQjCNHM-Xh9rAT8Kct89oi0oOBQksAJrw


AUTOMATED QUALITY MONITORING IN THE CALL CENTER WITH ASR AND MAXIMUM ENTROPY

G. Zweig, O. Siohan, G. Saon, B. Ramabhadran, D. Povey, L. Mangu and B. Kingsbury IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 ABSTRACT

 This paper describes an automated system for assigning quality scores to recorded call center conversations. The system combines speech recognition, pattern matching, and maximum entropy classification to rank calls according to their measured quality. Calls at both end of the spectrum are flagged as “interesting” and made available for further human monitoring. In this process, pattern matching on the ASR transcript is used to answer a set of standard quality control questions such as “did the agent use courteous words and phrases,” and to generate a question-based score. This is interpolated with the probability of a call being “bad,” as determined by maximum entropy operating on a set of ASR-derived features such as “maximum silence length” and the occurrence of selected n-gram word sequences. The system is trained on a set of calls with associated manual evaluation forms. We present precision and recall results from IBM’s North American Help Desk indicating that for a given amount of listening effort, this system triples the number of bad calls that are identified, over the current policy of randomly sampling calls. 1. INTRODUCTION Every day, tens of millions of help-desk calls are recorded at call centers around the world. As part of a typical call center operation a random sample of these calls is normally re-played to human monitors who score the calls with respect to a variety of quality related questions, e.g. Was the account successfully identified by the agent? Did the agent request error codes/messages to help determine the problem? Was the problem resolved? Did the agent maintain appropriate tone, pitch, volume and pace? This process suffers from a number of important problems: first, the monitoring at least doubles the cost of each call (first an operator is paid to take it, then a monitor to evaluate it). This causes the second problem, which is that therefore only a very small sample of calls, e.g. a fraction of a percent, is typically evaluated. The third problem arises from the fact that most calls are ordinary and uninteresting; with random sampling, the human monitors spend most of their time listening to uninteresting calls. This paper describes an automated quality-monitoring system that addresses these problems. Automatic speech recognition is used to transcribe 100% of the calls coming in to a call center, and default quality scores are assigned based on features such as key-words, key-phrases, the number and type of hesitations, and the average silence durations. The default score is used to rank the calls from worst-to-best, and this sorted list is made available to the human evaluators, who can thus spend their time listening only to calls for which there is some a-priori reason to expect that there is something interesting. The automatic quality-monitoring problem is interesting in part because of the variability in how hard it is to answer the questions. Some questions, for example, “Did the agent use courteous words and phrases?” are relatively straightforward to answer by looking for key words and phrases. Others, however, require essentially human-level knowledge to answer; for example one company’s monitors are asked to answer the question “Did the agent take ownership of the problem?” Our work focuses on calls from IBM’s North American call centers, where there is a set of 31 questions that are used to evaluate call-quality. Because of the high degree of variability found in these calls, we have investigated two approaches: 1. Use a partial score based only on the subset of questions that can be reliably answered. 2. Use a maximum entropy classifier to map directly from ASR-generated features to the probability that a call is bad (defined as belonging to the bottom 20% of calls). We have found that both approaches are workable, and we present final results based on an interpolation between the two scores. These results indicate that for a fixed amount of listening effort, the number of bad calls that are identified approximately triples with our call-ranking approach. Surprisingly, while there has been significant previous scholarly research in automated call-routing and classification in the call center , e.g. [1, 2, 3, 4, 5], there has been much less in automated quality monitoring per se. 2. ASR FOR CALL CENTER TRANSCRIPTION 2.1. Data The speech recognition systems were trained on approximately 300 hours of 6kHz, mono audio data collected at one of the IBM call centers located in Raleigh, NC. The audio was manually transcribed and speaker turns were explicitly marked in the word transcriptions but not the corresponding times. In order to detect speaker changes in the training data, we did a forced-alignment of the data and chopped it at speaker boundaries. The test set consists of 50 calls with 113 speakers totaling about 3 hours of speech. 2.2. Speaker Independent System The raw acoustic features used for segmentation and recognition are perceptual linear prediction (PLP) features. For the speaker Segmentation/clustering Adaptation WER Manual Off-line 30.2% Manual Incremental 31.3% Manual No Adaptation 35.9% Automatic Off-line 33.0% Automatic Incremental 35.1% Table 1. ASR results depending on segmentation/clustering and adaptation type. Accuracy Top 20% Bottom 20% Random 20% 20% QA 41% 30% Table 2. Accuracy for the Question Answering system. independent system, the features are mean-normalized on a per speaker basis. Every 9 consecutive 13-dimensional PLP frames are concatenated and projected down to 40 dimensions using LDA+MLLT. The SI acoustic model consists of 50K Gaussians trained with MPE and uses a quinphone cross-word acoustic context. The techniques are the same as those described in [6]. 2.3. Incremental Speaker Adaptation In the context of speaker-adaptive training, we use two forms of feature-space normalization: vocal tract length normalization (VTLN) and feature-space MLLR (fMLLR, also known as constrained MLLR) to produce canonical acoustic models in which some of the non-linguistic sources of speech variability have been reduced. To this canonical feature space, we then apply a discriminatively trained transform called fMPE [7]. The speaker adapted recognition model is trained in this resulting feature space using MPE. We distinguish between two forms of adaptation: off-line and incremental adaptation. For the former, the transformations are computed per conversation-side using the full output of a speaker independent system. For the latter, the transformations are updated incrementally using the decoded output of the speaker adapted system up to the current time. The speaker adaptive transforms are then applied to the future sentences. The advantage of incremental adaptation is that it only requires a single decoding pass (as opposed to two passes for off-line adaptation) resulting in a decoding process which is twice as fast. In Table 1, we compare the performance of the two approaches. Most of the gain of full offline adaptation is retained in the incremental version. 2.3.1. Segmentation and Speaker Clustering We use an HMM-based segmentation procedure for segmenting the audio into speech and non-speech prior to decoding. The reason is that we want to eliminate the non-speech segments in order to reduce the computational load during recognition. The speech segments are clustered together in order to identify segments coming from the same speaker which is crucial for speaker adaptation. The clustering is done via k-means, each segment being modeled by a single diagonal covariance Gaussian. The metric is given by the symmetric K-L divergence between two Gaussians. The imAccuracy Top 20% Bottom 20% Random 20% 20% ME 49% 36% Table 3. Accuracy for the Maximum Entropy system. Accuracy Top 20% Bottom 20% Random 20% 20% ME + QA 53% 44% Table 4. Accuracy for the combined system. pact of the automatic segmentation and clustering on the error rate is indicated in Table 1. 3. CALL RANKING 3.1. Question Answering This section presents automated techniques for evaluating call quality. These techniques were developed using a training/development set of 676 calls with associated manually generated quality evaluations. The test set consists of 195 calls. The quality of the service provided by the help-desk representatives is commonly assessed by having human monitors listen to a random sample of the calls and then fill in evaluation forms. The form for IBM’s North American Help Desk contains 31 questions. A subset of the questions can be answered easily using automatic methods, among those the ones that check that the agent followed the guidelines e.g. Did the agent follow the appropriate closing script? Did the agent identify herself to the customer? But some of the questions require human-level knowledge of the world to answer, e.g. Did the agent ask pertinent questions to gain clarity of the problem? Were all available resources used to solve the problem? We were able to answer 21 out of the 31 questions using pattern matching techniques. For example, if the question is “Did the agent follow the appropriate closing script?”, we search for “THANK YOU FOR CALLING”, “ANYTHING ELSE” and “SERVICE REQUEST”. Any of these is a good partial match for the full script, “Thank you for calling, is there anything else I can help you with before closing this service request?” Based on the answer to each of the 21 questions, we compute a score for each call and use it to rank them. We label a call in the test set as being bad/good if it has been placed in the bottom/top 20% by human evaluators. We report the accuracy of our scoring system on the test set by computing the number of bad calls that occur in the bottom 20% of our sorted list and the number of good calls found in the top 20% of our list. The accuracy numbers can be found in Table 2. 3.2. Maximum Entropy Ranking Another alternative for scoring calls is to find arbitrary features in the speech recognition output that correlate with the outcome of a Fig. 1. Display of selected calls. call being in the bottom 20% or not. The goal is to estimate the probability of a call being bad based on features extracted from the automatic transcription. To achieve this we build a maximum entropy based system which is trained on a set of calls with associated transcriptions and manual evaluations. The following equation is used to determine the score of a call using a set of predefined features. Due to the fact that our training set contained under 700 calls, we used a hand-guided method for defining features. Specifi- cally, we generated a list of VIP phrases as candidate features, e.g. “THANK YOU FOR CALLING”, and “HELP YOU”. We also created a pool of generic ASR features, e.g. “number of hesitations”, “total silence duration”, and “longest silence duration”. A decision tree was then used to select the most relevant features and the threshold associated with each feature. The final set of features contained 5 generic features and 25 VIP phrases. If we take a look at the weights learned for different features, we can see that if a call has many hesitations and long silences then most likely the call is bad. We use 2? @4GF as shown in Equation 1 to rank all the calls. Table 3 shows the accuracy of this system for the bottom and top 20% of the test calls. At this point we have two scoring mechanisms for each call: one that relies on answering a fixed number of evaluation questions and a more global one that looks across the entire call for hints. These two scores are both between 0 and 1, and therefore can be interpolated to generate one unique score. After optimizing the interpolation weights on a held-out set we obtained a slightly higher weight (0.6) for the maximum entropy model. It can be seen in Table 4 that the accuracy of the combined system is greater that the accuracy of each individual system, suggesting the complementarity of the two initial systems. Fig. 2. Interface to listen to audio and update the evaluation form. 4. END-TO-END SYSTEM PERFORMANCE 4.1. User Interface This section describes the user interface of the automated quality monitoring application. As explained in Section 1, the evaluator scores calls with respect to a set of quality-related questions after listening to the calls. To aid this process, the user interface provides an efficient mechanism for the human evaluator to select calls, e.g. All calls from a specific agent sorted by score The top 20% or the bottom 20% of the calls from a specific agent ranked by score The top 20% or the bottom 20% of all calls from all agents The automated quality monitoring user interface is a J2EE web application that is supported by back-end databases and content management systems 1 The displayed list of calls provides a link to the audio, the automatically filled evaluation form, the overall score for this call, the agent’s name, server location, call id, date and duration of the call (see Figure 1). This interface now gives the agent the ability to listen to interesting calls and update the answers in the evaluation form if necessary (audio and evaluation form illustrated in 2). In addition, this interface provides the evaluator with the ability to view summary statistics (average score) and additional information about the quality of the calls. 4.2. Precision and Recall This section presents precision and recall numbers for the identification of “bad” calls. The test set consists of H,I calls that were manually evaluated by call center personnel. Based on these manual scores, the calls were ordered by quality, and the bottom 20% were deemed to be “bad.” To retrieve calls for monitoring, we sort the calls based on the automatically assigned quality score and return the worst. In our summary figures, precision and recall are plotted as a function of the number of calls that are selected for monitoring. This is important because in reality only a small number of calls can receive human attention. Precision is the ratio 1 In our case, the backend consists of DB2 and IBM’s Websphere Information Integrator for Content and the application is hosted on Websphere 5.1.) 0 20 40 60 80 100 0 20 40 60 80 100 Observed Ideal Random Fig. 3. Precision for the bottom 20% of the calls as a function of the number of calls retrieved. 0 20 40 60 80 100 0 20 40 60 80 100 Observed Ideal Random Fig. 4. Recall for the bottom 20% of the calls. of bad calls retrieved to the total number of calls monitored, and recall is the ratio of the number of bad calls retrieved to the total number of bad calls in the test set. Three curves are shown in each plot: the actually observed performance, performance of random selection, and oracle or ideal performance. Oracle performance shows what would happen if a perfect automatic ordering of the calls was achieved. Figure 3 shows precision performance. We see that in the monitoring regime where only a small fraction of the calls are monitored, we achieve over 60% precision. (Further, if 20% of the calls are monitored, we still attain over 40% precision.) Figure 4 shows the recall performance. In the regime of lowvolume monitoring, the recall is midway between what could be achieved with an oracle, and the performance of random-selection. Figure 5 shows the ratio of the number of bad calls found with our automated ranking to the number found with random selection. This indicates that in the low-monitoring regime, our automated technique triples efficiency. 4.3. Human vs. Computer Rankings As a final measure of performance, in Figure 6 we present a scatterplot comparing human to computer rankings. We do not have calls that are scored by two humans, so we cannot present a human-human scatterplot for comparison. 5. CONCLUSION This paper has presented an automated system for quality monitoring in the call center. We propose a combination of maximumentropy classification based on ASR-derived features, and question answering based on simple pattern-matching. The system can either be used to replace human monitors, or to make them more 1 1.5 2 2.5 3 3.5 4 4.5 5 0 20 40 60 80 100 Observed Ideal Fig. 5. Ratio of bad calls found with QTM to Random selection as a function of the number of bad calls retrieved. 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 Fig. 6. Scatter plot of Human vs. Computer Rank. efficient. Our results show that we can triple the efficiency of human monitors in the sense of identifying three times as many bad calls for the same amount of listening effort. 6. REFERENCES [1] J. Chu-Carroll and B. Carpenter, “Vector-based natural language call routing,” Computational Linguistics, 1999. [2] P. Haffner, G. Tur, and J. Wright, “Optimizing svms for complex call classification,” 2003. [3] M. Tang, B. Pellom, and K. Hacioglu, “Call-type classification and unsupervised training for the call center domain,” in ARSU-2003, 2003. [4] D. Hakkani-Tur, G. Tur, M. Rahim, and G. Riccardi, “Unsupervised and active learning in automatic speech recognition for call classification,” in ICASSP-04, 2004. [5] C. Wu, J. Kuo, E.E. Jan, V. Goel, and D. Lubensky, “Improving end-to-end performance of call classification through data confusion reduction and model tolerance enhancement,” in Interspeech-05, 2005. [6] H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig, “The ibm 2004 conversational telephony system for rich transcription,” in Eurospeech-2005, 2005. [7] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig, “fMPE: Discriminatively trained features for speech recognition,” in ICASSP-2005, 2004. [8] A. Berger, S. Della Pietra, and V. Della Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol. 22, no. 1, 1996.

댓글 없음:

댓글 쓰기