TagLID is a library that labels each word in a Taglish (Tagalog-English mix)
text by language. It gives either a simple tag (tgl
or eng
) or detailed
frequency info with flags indicating how the word was identified. It is a
rule-based and opinionated system that mostly uses dictionary lookups. It also
handles cases like skipping numbers, names, and interjections, and includes
logic for dealing with slang, abbreviations, contractions, stemming or
lemmatizing inflected words, intrawords, and correcting misspellings.
Why I built it
TagLID was originally developed as a data analysis tool for our SHS Practical Research 2 project to calculate the frequency of Tagalog and English words in Taglish texts. 1 Since then, it has evolved into a reusable and general-purpose tool for Taglish language detection.
How it turned out
Library mode
Users can use TagLID as a library by importing it into their Python script.
from taglid.lid import lang_identify, simplify
labeled_text = lang_identify("hello, mundo")
print(simplify(labeled_text))
Output:
[('hello', 'eng'), ('mundo', 'tgl')]
CLI mode
Users can also use TagLID as a CLI tool.
python -m taglid.lid
Example input:
text: hello, mundo
Output:
word eng tgl flag correction
------ ----- ----- ------ ------------
hello 1 0 DICT
mundo 0 1 DICT
How I built it
The core algorithm utilizes dictionary lookups to identify language. Dictionaries were web scraped from the Pinoy Dictionary and parsed from the XML version of the GCIDE English dictionary using BeautifulSoup4. To assign frequencies to words, an additional list from the Leipzig Corpora Collection was integrated.
This algorithm checks if the word appears in Tagalog and English dictionaries and labels its language accordingly. This initial algorithm is in line with the one employed by Herrera, et al. (2022).
To address ambiguous cases where a word is present in both dictionaries such as the English preposition ‘at’ and the Tagalog conjunction ‘at’ (and), the word frequency in both languages will be compared.
function LangIdentify(word):
lang = None
if word in EngDict and word in TglDict:
if EngFreq[word] > TglFreq[word]:
lang = "ENG"
else if TglFreq[word] > EngFreq[word]:
lang = "TGL"
else if word in EngDict:
lang = "ENG"
else if word in TglDict:
lang = "TGL"
return lang
Before the main function, there are supplementary functions to exclude certain words from the count since they are not specific to English or Tagalog. It is then followed by other supplementary functions to handle special cases that the word frequency lists cannot catch.
What I learned
This is one of the first coding projects I’ve written, so a lot of things were new. I learned how to package a Python library and a CLI tool, clean data better, scrape the web, and parse XML. I also realized that building something I actually wanted to reuse later made me care more about code structure and maintainability.
At first, it felt like labeling words as English or Tagalog should be simple—just check a dictionary, right? But once I ran into tricky edge cases like “at”, slang, misspellings, and inflections, I knew it wasn’t going to be that straightforward. Dictionary lookups helped, but I still had to write a bunch of rules to deal with the nuances of Taglish and how people actually talk, especially when grammar goes out the window.
Footnotes
-
This was essential for our research Correlation Between the Level of Language Dominance and the Relative Frequency of Taglish Code-Switching of STEM Students (Maagma & Salido, 2023), conducted for Practical Research 2, since we were dealing with a large dataset collected from respondents. ↩