I would like to train a model able to predict the syllabe pronunciation of french words.
So, I created a set of syllabized word and for each syllabe I’ve its pronunciation code. Example: The word
parkinson is syllabised like this :
par-kin-son and its pronunciation code is
paR-kin-sOn. I choosed a english word for clarity but in reality I’ve a training set of french word.
First, I’ve created a table of symbols for syllabes:
symbols_input = [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
'x', 'y', 'z', 'à', 'â', 'ç', 'è', 'é', 'ê', 'î', 'ï', 'ô',
And for pronunciation codes:
symbols_output = [' ', '1', '2', '5', '8', '9', '@', 'E', 'G', 'N', 'O', 'R',
'S', 'Z', 'a', 'b', 'd', 'e', 'f', 'g', 'i', 'j', 'k', 'l',
'm', 'n', 'o', 'p', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
What I do after that is convert my syllabes and pronunciation code into vectors by using the following function:
def convert_syllabe_2_vector(syllabe, table):
"""Helper function to convert a syllabe into a vector of integer values."""
return [table[letter] for letter in syllabe]
>>> convert_syllabe_2_vector('par', symbol_input)
16 1 18
>>> convert_syllabe_2_vector('par', symbol_output)
27 14 11
Since the longest syllabe in my set is 8 char, and the longest pronunciation code is 6, the next thing I do is padding my vector to zero like this:
16 01 18 00 00 00 00 00 and
27 14 11 00 00 00
Because in french, previous syllabe and next syllabe can have influences on the pronunciation I form a bigger of 8 * 3 dimensions. So, for
par syllabe, it will give me:
# empty par -kin
00 00 00 00 00 00 00 00 16 01 18 00 00 00 00 00 11 09 14 00 00 00 00 00
and the correct prediction
27 14 11 00 00 00
Do you know a better approach ? Is it right to say it’s a classification problem and not a regression one ? What kind of model could fit properly ? I tried a Random Forest Classifier it seems to work but I suspect there is many overfitting and training uses a lot of RAM. I also tried to normalize my vector by dividing them by 38, the number of symbols of each table and then multiplying the predicted output by 38 again but the results I obtained were terribles.
Get this bounty!!!