One of the ways that Dialpad is able to continuously improve our transcription accuracy is by releasing at least one new ASR model every single week. These new models are the result of team updates to the Language Model (LM), Acoustic Model (AM) or to add new words to the Dialpad dictionary. For more information on what we mean by LM, AM and dictionary, check out our ASR 101 post.
Naturally, these updates will require testing to ensure that they are not degrading accuracy or performance, and they are actually making the changes they are supposed to. Even if each component has not been updated, we make sure to test them all as our QA process focuses on how the three components work together. But these models work by turning spoken phone call audio into text in the form of transcripts, how on earth can you test that without making our QA team make test calls to each other all day? Read on to find out how we were able to actually automate this process!
The preliminary automated weekly QA process in ASR
STEP 1 — Gathering Keyphrases
Dialpad customers have access to a feature called their “company dictionary” where they can submit words and phrases that are important to their business such as jargon, product names and competitors. These entries allow our system to better recognise these often unique terms in our transcripts. Every week we pull any new entries, which we call keyphrases, that we have received, along with the category the customer has selected for them (jargon, competitor, product etc.). From the list of all the new keyphrases, a random subsample is selected to be tested during the QA process.
The Dialpad Company Dictionary feature in action.
STEP 2 — Testing Keyphrases
Using actual customer conversations as a guide, we have formulated a set of 5 template sentences for each category to mimic real-life usage of each keyphrase. For example, a template sentence for a company keyphrase looks like this:
Hello, thank you for calling [company], this is Mary, how can I help you?
To start the QA process, we insert the keyphrase into the relevant section of each template sentence. As in the example above [company] would be replaced with an actual company name from our list of keyphrases. This results in a variety of different sentences containing each keyphrase from our randomly selected subset. Next, we use a technology called Text-to-Speech (TTS), which does exactly what the name says, it converts text into an audio file of the words being spoken. For each of the randomly selected keyphrases in our subsample, we will generate TTS audio files of the template sentences resulting in an audio file of each template sentence, for each keyphrase. These audio files make up our QA test set for the week.
STEP 3 — Model Evaluation
Now that we have our test set, we are ready to start evaluating the performance of the actual model. We take our audio files and have both our old and new model transcribe them. We can then calculate the accuracy of the resulting transcription and compare the old model’s accuracy to that of the new model we want to release. If the new model is both more accurate than the old model and reaches a baseline accuracy that is acceptable, we can release it to our customers!
On the other hand, if the QA metrics evaluation shows that the old model performs better than the new model, we will first try to re-run the QA metrics with the same test set and once again review the results. If we find that the old model is still performing better than the new model, we will need to take other factors into consideration to decide whether or not it is fit to be released. If the new model:
- has a new LM/AM and their individual performance is better than old LM/AM or,
- the new model is trained on additional customer requested words as compared to the old model and
- all the other functionalities when integrated are working as expected,
…then we will continue with the release of the new model. Luckily, failing this part of QA twice is exceedingly rare and has only happened a handful of times in the past two years.
The diagram below shows the entire process up to this point:
Step 4 — Smoke Testing
This is the last step we perform in our QA process before we deploy the models. Full disclosure, this step has not yet been automated and still requires manual effort. This is an ad-hoc step where we deploy the models to the staging environment and perform a test call to mimic customer-like conversations. In this step we are seeing how the ASR model performs in the product by looking at the transcription it produces as well as how it is performing with other Voice Intelligence features such as moments and post-call summaries.
We are currently investigating options to automate this part of our testing too!
Future updates to the QA automation process
As we expand ASR to different dialects and other languages, we will have multiple models to deploy at a time. In order to accommodate the multiple ASR models in the automated QA process, we plan to:
- Expand the range of sentence templates to include different dialects and languages.
- Update our Text to Speech technique for other languages and dialects.
- Generalize the QA metrics so that the same metrics could be used for multiple languages and dialects.
- Add in automated integration and smoke testing.