TagLID

TagLID is a library that labels each word in a Taglish (Tagalog-English mix) text by language. It gives either a simple tag (tgl or eng) or detailed frequency info with flags indicating how the word was identified. It is a rule-based and opinionated system that mostly uses dictionary lookups. It also handles cases like skipping numbers, names, and interjections, and includes logic for dealing with slang, abbreviations, contractions, stemming or lemmatizing inflected words, intrawords, and correcting misspellings.

Why I built it

TagLID was originally developed as a data analysis tool for our SHS Practical Research 2 project to calculate the frequency of Tagalog and English words in Taglish texts. ¹ Since then, it has evolved into a reusable and general-purpose tool for Taglish language detection.

How it turned out

Library mode

Users can use TagLID as a library by importing it into their Python script.

from taglid.lid import lang_identify, simplify

labeled_text = lang_identify("hello, mundo")
print(simplify(labeled_text))

Output:

[('hello', 'eng'), ('mundo', 'tgl')]

CLI mode

Users can also use TagLID as a CLI tool.

python -m taglid.lid

Example input:

text: hello, mundo

Output:

word      eng    tgl  flag    correction
------  -----  -----  ------  ------------
hello       1      0  DICT
mundo       0      1  DICT

How I built it

The core algorithm utilizes dictionary lookups to identify language. Dictionaries were web scraped from the Pinoy Dictionary and parsed from the XML version of the GCIDE English dictionary using BeautifulSoup4. To assign frequencies to words, an additional list from the Leipzig Corpora Collection was integrated.

This algorithm checks if the word appears in Tagalog and English dictionaries and labels its language accordingly. This initial algorithm is in line with the one employed by Herrera, et al. (2022).

To address ambiguous cases where a word is present in both dictionaries such as the English preposition ‘at’ and the Tagalog conjunction ‘at’ (and), the word frequency in both languages will be compared.

function LangIdentify(word):
    lang = None
    if word in EngDict and word in TglDict:
        if EngFreq[word] > TglFreq[word]:
            lang = "ENG"
        else if TglFreq[word] > EngFreq[word]:
            lang = "TGL"
    else if word in EngDict:
        lang = "ENG"
    else if word in TglDict:
        lang = "TGL"
    return lang

Before the main function, there are supplementary functions to exclude certain words from the count since they are not specific to English or Tagalog. It is then followed by other supplementary functions to handle special cases that the word frequency lists cannot catch.

What I learned

This is one of the first coding projects I’ve written, so a lot of things were new. I learned how to package a Python library and a CLI tool, clean data better, scrape the web, and parse XML. I also realized that building something I actually wanted to reuse later made me care more about code structure and maintainability.

At first, it felt like labeling words as English or Tagalog should be simple—just check a dictionary, right? But once I ran into tricky edge cases like “at”, slang, misspellings, and inflections, I knew it wasn’t going to be that straightforward. Dictionary lookups helped, but I still had to write a bunch of rules to deal with the nuances of Taglish and how people actually talk, especially when grammar goes out the window.

This was essential for our research Correlation Between the Level of Language Dominance and the Relative Frequency of Taglish Code-Switching of STEM Students (Maagma & Salido, 2023), conducted for Practical Research 2, since we were dealing with a large dataset collected from respondents. ↩