Colin Beckingham (colbec@start.ca)
Created Thu Jan 21 08:41:43 EST 2010
Purpose of this experiment: I have three microphones that I can use with my computer to interact with a speech recognition engine to give the computer instructions. At any given moment, one microphone may have advantages over the others, such as higher quality, portability, availability, etc. I know I can create separate models for all of them and then just select the model that was created with the mike I intend to use. On the other hand I can also train a single large model using all three of the mikes. The question is, by mixing the audio from different sources into the single model, how is the recognition accuracy affected?
Proposed method:
Select mikes for the test
Use all mikes with the same computer
Set up the three individual models
Using the prompts, codetrain and wav/sample* data created in (3), combine the data from each of the basic models into additional artificial models using
Two at a time (3)
All three (1)
The result is 7 different models
Subject the 7 different models to the same testing procedure with each of the relevant mikes and record the results
Draw conclusions
Proposed equipment:
Computer: IBM Netvista CPU Information Processor (CPU): Intel(R) Pentium(R) 4 CPU 1.80GHz Speed: 1,794.19 MHz Temperature: 21 °C Memory Information Total memory (RAM): 1,003.3 MiB Free memory: 37.0 MiB (+ 670.6 MiB Caches) Free swap: 2.0 GiB
Mikes: Sennheiser PC131, Logitech Clearchat Pro USB Wireless, Jabra BT2040 Bluetooth with Belkin dongle
Software: Linux Opensuse 11.2, Voxforge utilities, HTK toolkit, Julius, supplementary PHP and bash scripting
Proposed grammar (identical for all 3 hardware setups): this is a real world application, where a dialog manager is expected to identify one of 10 buffers, and once the buffer has been selected to start filling it with single word content. An example transaction would be:
BUFFER THREE DELTA SIX SIERRA ONE TANGO
with the result that at the end the buffer 3 contains D6S1T. So the grammar contains:
Each of the possible prompts will be trained 4 (four)
times, making 46*4 prompts all told.
Edit 1:
to forestall any errors from HTK due to the limited grammar, all
grammars will be bulked up with identical test words which, if taken
alone, would ensure that a good number of phonemes are represented.
These words are outside the grammar.
Observations:
Testing procedure: Select 100 random prompts from the prompts file, require input of that prompt to the SRE, detect result and compare. If failure, record the model being used (1-7), the prompt and what the SRE heard. Tally the results and compare.
Tests 1 and 2 are 24 hours apart.
| Test Setup | Test #1 errors (n=100) |
Problematic words | Reason for fail | Test #2 errors (n=100) |
Problematic words | Reason for fail | Comments |
|---|---|---|---|---|---|---|---|
| I | 6 | ONE ZERO VICTOR KILO* VICTOR FOUR |
UNIFORM INDIA DELTA SIERRA DELTA GOLF |
8 | NOVEMBER VICTOR INDIA KILO ZERO KILO NINE NOVEMBER |
UNIFORM DELTA JULIET SIERRA INDIA SIERRA MIKE UNIFORM |
|
| II | 2 |
NINE NINE |
MIKE MIKE |
3 |
BRAVO TWO EIGHT |
ALPHA HOTEL SIX |
|
| III | 1 | LIMA* | UNIFORM | 1 | THREE* | X-RAY | |
| I + II | Using I - 10 Using II |
GOLF FIVE UNIFORM NINE MIKE CHARLIE GOLF NOVEMBER NINE ZERO Zero errors |
ALPHA BUFFER FIVE INDIA YANKEE QUEBEC QUEBEC ALPHA INDIA YANKEE DELTA |
Using I Using II |
Zero errors Zero errors |
||
| I + III | Using I Using III |
Zero errors Zero errors |
Using I Using III |
Zero errors Zero errors |
|||
| II + III | Using II Using III |
Zero errors Zero errors |
Using II -1 Using III |
BUFFER ZERO* Zero errors |
VICTOR |
||
| I + II + III | Using I - 1 Using II Using III |
BUFFER ZERO Zero errors Zero errors |
QUEBEC |
Using I Using II Using III |
Zero errors Zero errors Zero errors |
Legend:I - Jabra
bt2040; II - Sennheiser PC131; III - Logitech USB Wireless
* denotes a first test in a series, Julius is more liable to make errors on the first thing
heard.