4 min read

Aklanon Stemmer

Table of Contents

AklStemmer is a library that finds the root form of Aklanon words. It works on inflected words, even those with mixed Aklanon-English terms or those not found in dictionaries. It removes affixes, reduces repeated syllables, and applies transformation rules to find possible root forms. These are filtered using a list of valid words and conditions. The best root is then chosen based on how much was changed during the process.

Why I built it

AklStemmer was originally developed as a supplementary tool for the spellchecker used in the Aklish project. It was specifically designed to improve the dictionary-based spellchecker, as inflected words are often not included in word lists. Today, it serves as a reusable tool for stemming any Aklanon-inflected words.

How it turned out

AklStemmer can be used as a Python library by importing it into a Python script. Check the repository for complete instructions.

>>> from aklstemmer import stemmer

>>> stem = stemmer.get_stem("nagsueat")  # "nagsueat" = "wrote"

>>> stem.word  # Stemmed form: "sueat" = "write"
'sueat'

>>> stem.pre  # Prefix: "nag" = past tense marker
'nag'

>>> stem.suf  # No suffix found
None

How I built it

AklStemmer performs stemming by applying a modular sequence of affix-removal and transformation rules to input words. Here’s how it works under the hood:

Tokenization

Since the input text can contain multiple words, it is first tokenized using NLTK, with optional punctuation filtering. Each token is then processed individually.

Affix Removal

AklStemmer removes prefixes, infixes, and suffixes based on curated lists and applies these stemming rules:

  • Prefix removal: strips leading affixes (nag-, pa-, etc.), and also handles:
    • Phoneme change (d/r) (e.g. parayaw → dayaw).
    • Assimilation:
      • -ng (k/null) (e.g. pangablit → kablit)
      • -m (b/p) (e.g. pamahaw → bahaw, pamasyar → pasyar)
      • -n (d/s/t) (e.g. panumdum → dumdum, panigarilyo → sigarilyo, panahi → tahi)
  • Infix removal: detects and removes mid-word infixes like <in> (e.g. sinueat → sueat), and also handles:
    • Phoneme change (e/l) (linahog → eahog)
  • Suffix removal: strips trailing affixes (-an, -on, etc.) and also handles:
    • Contractions (ending with ng, g, ‘t, ‘y, t)
    • Phoneme change:
      • (n/null) (e.g. eot → eon)
      • (d/r) (e.g. bayari → bayad)
      • (d/l) (e.g. sugilanon → sugid)
      • (i/y) (e.g. agyan → agi)
      • (o/w) (e.g. tubwan → tubo)
      • (o/u) (e.g. tauhan → tao, inuman → inom)
      • (e/i) (e.g. kingkihan → kingke, paitin → paet)
    • Vowel loss (e.g. buksa → bukas)
    • Metathesis (e.g. islan → ilis)

Reduplication

AklStemmer can identify and reduce repetition in words which can occur in Aklanon:

  • Partial reduplication: when part of the word repeats e.g. aabot → abot, babakae → bakae
  • Full duplication: when the whole word repeats e.g. ano-ano → ano, including contracted or altered forms (e.g. anu-ano → ano, ibat-iba → iba)

Choosing the Stem

After applying the affix removal and transformations, AklStemmer can generate multiple candidate stems. For example, the word ‘bukot’ might produce ‘bukot’, ‘buko’, and ‘bukon’.

These candidates are evaluated and ranked using a custom scoring system that favors fewer transformations, longer removed affixes or reduplication, and more morphologically plausible stems. The top-ranked candidate is picked as the final stem.