Implementations in Python and PHP
MLphone
MLphone is a phonetic algorithm for indexing Malayalam words by their pronunciation, like Metaphone for English. The algorithm generates three Romanized phonetic keys (hashes) of varying phonetic affinities for a given Malayalam word.
The algorithm takes into account the context sensitivity of sounds, syntactic and phonetic gemination, compounding, modifiers, and other known exceptions to produce Romanized phonetic hashes of increasing phonetic affinity that are very faithful to the pronunciation of the original Malayalam word.
- key0 = a broad phonetic hash comparable to a Metaphone key that doesn't account for hard sounds (ഷ, ണ..) or phonetic modifiers
- key1 = is a slightly more inclusive hash that accounts for hard sounds
- key2 = highly inclusive and narrow hash that accounts for hard sounds and phonetic modifiers
MLphone was created to aid spelling tolerant Malayalam word search, but may be useful in tasks like spell checking, word suggestion etc.
Examples
Word | key0 | key1 | key2 | Transliteration | Metaphone |
നീലക്കുയില് | NLKYL | NLKYL | N4LK25Y4L | Neelakkuyil | NLKYL |
മൃഗം | MRK3 | MRK3 | MRK3 | Mrugam | MRKM |
മ്രിഗം | MRK3 | MRK3 | MRK3 | Mrigam | MRKM |
ഉത്സവം | U0SV3 | U0SV3 | U0SV3 | Uthsavam | U0SFM |
ഉല്സവം | U0SV3 | U0SV3 | U0SV3 | Ulsavam | ULSFM |
വാഹനം | VHN3 | VHN3 | VHN3 | Vaahanam | FHNM |
വിഹനനം | VHNN3 | VHNN3 | V4HNN3 | Vihananam | FHNNM |
രാഷ്ട്രീയം | RSTRY3 | RS1TRY3 | RS1TR4Y3 | Raashtreeyam | RXTRYM |
കണ്ണകി | KNK | KNK | KN2K4 | Kannaki | KNK |
കന്യക | KNYK | KNYK | KNYK | Kanyaka | KNYK |
മനം | MN3 | MN3 | MN3 | Manam | MNM |
മണം | MN3 | MN13 | MN13 | Manam | MNM |
വിഭക്ത്യാഭാസം | VBK0YBS3 | VBK0YBS3 | V4BK0YBS3 | Vibhakthyaabhaasam | FBHK0YBHSM |
വലയം | VLY3 | VLY3 | VLY3 | Valayam | FLYM |
വളയം | VLY3 | VL1Y3 | VL1Y3 | Valayam | FLYM |
രഥം | R03 | R03 | R03 | Ratham | R0M |
രദം | R03 | R03 | R03 | Radam | RTM |
രത്തം | R03 | R03 | R03 | Rattham | RTM |
രധം | R03 | R03 | R03 | Radham | RTHM |
Usage
The algorithm's available in Python 3 and PHP.
Python
from mlphone import MLphone converter = MLphone() keys = converter.compute(ml_str)
PHP
<?php require 'mlphone.php'; $keys = MLphone::compute($ml_str); ?>
Background
Refer to the Substitution map.
- 0 denotes the ത group (borrowing from Metaphone where 0 = th)
- 1 denotes hard sounds (ണ, ള, റ ... as opposed to ന, ല, ര)
- 2 denotes doubled sounds (ക്ക, ച്ച ...)
- 3-9 denote modifiers (ം, ി, ു ...)
- with the exception of ൃ which is R, and ാ, ഃ, ് that are discarded
Consider the word നീലക്കുയില്:
1. Discard all non-Malayalam characters 2. Group modified entitites 2.1 Group compounds from the compounds table along with their modifiers നീല{ക്കു}യില് 2.2 Group non-compounds along with their modifiers {നീ}ല{ക്കു}{യി}ല് 3. Group unmodified entities 3.1 Group compounds from the compounds table {നീ}{ല}{ക്കു}{യി}{ല്} 3.2 Group non-compounds {നീ}{ല}{ക്കു}{യി}{ല്} 4. Substitute individual modified and un-modified entity groups with corresponding keys {Nീ}{L}{K2ു}{Yി}{L്} 5. Substitute the modifiers in the groups with numeric modifier keys to get key2 N4LK25Y4L 6. Remove numeric modifiers 2, and 4-9 from key2 to obtain key1 NLKYL 7. Remove numeric modifiers 1, 2, 4-9 from key2 to obtain key0 NLKYL
Substitution map
Vowels
അA | ആA | ഇI | ഈI | ഉU | ഊU | ഋR | എE | ഏE | ഐAI | ഒO | ഓO | ഔO |
Consonants
കK | ഖK | ഗK | ഘK | ങNG | ചC | ഛC | ജJ | ഝJ | ഞNJ | ടT | ഠT | ഡT |
ഢT | ണN1 | ത0 | ഥ0 | ദ0 | ധ0 | നN | പP | ഫF | ബB | ഭB | മM | യY |
രR | ലL | വV | ശS1 | ഷS1 | സS | ഹH | ളL1 | ഴZ | റR1 |
Chills
ൽL | ൾL1 | ൺN1 | ൻN | ർR1 | ൿK |
Compounds
ക്കK2 | ഗ്ഗK | ങ്ങNG | ച്ചC2 | ജ്ജJ | ഞ്ഞNJ | ട്ടT2 | ണ്ണN2 | ത്ത0 | ദ്ദD | ദ്ധD | ന്നNN | ന്തN0 | ങ്കNK | ണ്ടN1T | ബ്ബB |
പ്പP2 | മ്മM2 | യ്യY | ല്ലL2 | വ്വV | ശ്ശS1 | സ്സS | ള്ളL12 | ഞ്ചNC | ക്ഷKS1 | മ്പMP | റ്റT | ന്റNT | ന്തN0 | ്രിR | ്രുR |
Modifiers
ാ | ഃ | ് | ൃR | ം3 | ി4 | ീ4 | ു5 | ൂ5 | െ6 | േ6 | ൈ7 | ൊ8 | ോ8 | ൌ9 | ൗ9 |