Are monolithic models useful?

Colin Beckingham (colbec@start.ca)

Created Thu Jan 21 08:41:43 EST 2010

Purpose of this experiment: I have three microphones that I can use with my computer to interact with a speech recognition engine to give the computer instructions. At any given moment, one microphone may have advantages over the others, such as higher quality, portability, availability, etc. I know I can create separate models for all of them and then just select the model that was created with the mike I intend to use. On the other hand I can also train a single large model using all three of the mikes. The question is, by mixing the audio from different sources into the single model, how is the recognition accuracy affected?

Proposed method:

  1. Select mikes for the test

  2. Use all mikes with the same computer

  3. Set up the three individual models

  4. Using the prompts, codetrain and wav/sample* data created in (3), combine the data from each of the basic models into additional artificial models using

    1. Two at a time (3)

    2. All three (1)

  5. The result is 7 different models

  6. Subject the 7 different models to the same testing procedure with each of the relevant mikes and record the results

  7. Draw conclusions

Proposed equipment:

  1. Computer: IBM Netvista CPU Information Processor (CPU): Intel(R) Pentium(R) 4 CPU 1.80GHz Speed: 1,794.19 MHz Temperature: 21 °C Memory Information Total memory (RAM): 1,003.3 MiB Free memory: 37.0 MiB (+ 670.6 MiB Caches) Free swap: 2.0 GiB

  2. Mikes: Sennheiser PC131, Logitech Clearchat Pro USB Wireless, Jabra BT2040 Bluetooth with Belkin dongle

  3. Software: Linux Opensuse 11.2, Voxforge utilities, HTK toolkit, Julius, supplementary PHP and bash scripting

Proposed grammar (identical for all 3 hardware setups): this is a real world application, where a dialog manager is expected to identify one of 10 buffers, and once the buffer has been selected to start filling it with single word content. An example transaction would be:

BUFFER THREE
DELTA
SIX
SIERRA
ONE
TANGO

with the result that at the end the buffer 3 contains D6S1T. So the grammar contains:

  1. The numbers from ZERO to NINE (10)
  2. The Nato alphabet from ALPHA to ZULU (26)
  3. BUFFER ( ZERO | ... | NINE ) (10)

Each of the possible prompts will be trained 4 (four) times, making 46*4 prompts all told.
Edit 1: to forestall any errors from HTK due to the limited grammar, all grammars will be bulked up with identical test words which, if taken alone, would ensure that a good number of phonemes are represented. These words are outside the grammar.

Observations:

  1. The grammar has not been designed to be accuracy-friendly. In fact the liability to error is useful since results that show no errors anywhere would be either perfect or meaningless. Hence the use of many short, one word samples.

Testing procedure: Select 100 random prompts from the prompts file, require input of that prompt to the SRE, detect result and compare. If failure, record the model being used (1-7), the prompt and what the SRE heard. Tally the results and compare.

Results

Tests 1 and 2 are 24 hours apart.

Test Setup Test #1
errors (n=100)
Problematic words Reason for fail Test #2
errors (n=100)
Problematic words Reason for fail Comments
I 6 ONE
ZERO
VICTOR
KILO*
VICTOR
FOUR
UNIFORM
INDIA
DELTA
SIERRA
DELTA
GOLF
8 NOVEMBER
VICTOR
INDIA
KILO
ZERO
KILO
NINE
NOVEMBER
UNIFORM
DELTA
JULIET
SIERRA
INDIA
SIERRA
MIKE
UNIFORM

II 2 NINE
NINE
MIKE
MIKE
3 BRAVO
TWO
EIGHT
ALPHA
HOTEL
SIX

III 1 LIMA* UNIFORM 1 THREE* X-RAY
I + II Using I - 10










Using II
GOLF
FIVE
UNIFORM
NINE
MIKE
CHARLIE
GOLF
NOVEMBER
NINE
ZERO
Zero errors
ALPHA
BUFFER FIVE
INDIA
YANKEE
QUEBEC
QUEBEC
ALPHA
INDIA
YANKEE
DELTA
Using I
Using II
Zero errors
Zero errors


I + III Using I
Using III
Zero errors
Zero errors

Using I
Using III
Zero errors
Zero errors


II + III Using II
Using III
Zero errors
Zero errors

Using II -1
Using III
BUFFER ZERO*
Zero errors
VICTOR

I + II + III Using I - 1
Using II
Using III
BUFFER ZERO
Zero errors
Zero errors
QUEBEC


Using I
Using II
Using III
Zero errors
Zero errors
Zero errors


Legend:I - Jabra bt2040; II - Sennheiser PC131; III - Logitech USB Wireless
* denotes a first test in a series, Julius is more liable to make errors on the first thing heard.