#StackBounty: #python #random-forest #natural-language How to train a model to recognize the pronunciation of a syllabe

Bounty: 50

I would like to train a model able to predict the syllabe pronunciation of french words.

So, I created a set of syllabized word and for each syllabe I’ve its pronunciation code. Example: The word parkinson is syllabised like this : par-kin-son and its pronunciation code is paR-kin-sOn. I choosed a english word for clarity but in reality I’ve a training set of french word.

First, I’ve created a table of symbols for syllabes:

symbols_input = [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
                 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
                 'x', 'y', 'z', 'à', 'â', 'ç', 'è', 'é', 'ê', 'î', 'ï', 'ô',
                 'û', 'ü']

And for pronunciation codes:

symbols_output = [' ', '1', '2', '5', '8', '9', '@', 'E', 'G', 'N', 'O', 'R',
                  'S', 'Z', 'a', 'b', 'd', 'e', 'f', 'g', 'i', 'j', 'k', 'l',
                  'm', 'n', 'o', 'p', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
                  '§', '°']

What I do after that is convert my syllabes and pronunciation code into vectors by using the following function:

def convert_syllabe_2_vector(syllabe, table):
    """Helper function to convert a syllabe into a vector of integer values."""
    return [table[letter] for letter in syllabe]


>>> convert_syllabe_2_vector('par', symbol_input)
16 1 18
>>> convert_syllabe_2_vector('par', symbol_output)
27 14 11

Since the longest syllabe in my set is 8 char, and the longest pronunciation code is 6, the next thing I do is padding my vector to zero like this:
16 01 18 00 00 00 00 00 and 27 14 11 00 00 00

Because in french, previous syllabe and next syllabe can have influences on the pronunciation I form a bigger of 8 * 3 dimensions. So, for par syllabe, it will give me:

# empty                 par                    -kin
00 00 00 00 00 00 00 00 16 01 18 00 00 00 00 00 11 09 14 00 00 00 00 00 

and the correct prediction paR:

27 14 11 00 00 00

Do you know a better approach ? Is it right to say it’s a classification problem and not a regression one ? What kind of model could fit properly ? I tried a Random Forest Classifier it seems to work but I suspect there is many overfitting and training uses a lot of RAM. I also tried to normalize my vector by dividing them by 38, the number of symbols of each table and then multiplying the predicted output by 38 again but the results I obtained were terribles.

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.