TagLID
A word level Language Identification (LID) tool for Tagalog-English (Taglish) text.
TagLID is a word level Language Identification (LID) tool for Tagalog-English (Taglish) text. It is an open source project and is available as a Python library. TagLID was first developed as a data analysis tool for our Practical Research 2 (Maagma & Salido, 2023) in my Senior High School. The main purpose of the library is to identify the language of a text as either Tagalog or English. It was specifically developed to evaluate the frequency values of Tagalog and English words in a Taglish text. This was needed for our study since we need to determine the relative frequency of Taglish in a large dataset of text from our respondents. However, it grew as a reusable tool for language detection.
Usage
TagLID can be used both as a Python library and a command-line interface (CLI) tool. For comprehensive instructions, check the repository.
Library mode
You can use TagLID as a library by importing it into your Python code.
from taglid.lid import lang_identify, simplify
labeled_text = lang_identify("hello, mundo")
print(simplify(labeled_text))
Output:
[('hello', 'eng'), ('mundo', 'tgl')]
CLI mode
You can also use TagLID as a CLI tool by running python -m taglid
.
How it works
The basic algorithm is defined below by utilizing word frequency lists. This function checks the presence of a word in both Tagalog and English dictionaries and labels its language accordingly. This initial algorithm is in line with the one employed by Herrera (2022). However, to address ambiguous cases where a word is present in both dictionaries such as words like the English preposition ‘at’ and the Tagalog conjunction ‘at’ (and), the word frequency of the word in both languages will be compared and evaluated accordingly.
% Dictionary-based word level language identification
\begin{algorithm}
\caption{Dictionary-based word level language identification}
\begin{algorithmic}
\FUNCTION{LangIdentify}{$$word$$}
\STATE $$lang = null$$
\IF{word \textbf{in} EngDict \textbf{and} word \textbf{in} TglDict}
\IF{EngFreq[word] > TglFreq[word]}
\STATE $$lang = "ENG"$$
\ELSEIF{TglFreq[word] > EngFreq[word]}
\STATE $$lang = "TGL"$$
\ENDIF
\ELSEIF{word \textbf{in} EngDict}
\STATE $$lang = "ENG"$$
\ELSEIF{word \textbf{in} TglDict}
\STATE $$lang = "TGL"$$
\ENDIF
\ENDFUNCTION
\end{algorithmic}
\end{algorithm}
This function is preceded by supplementary functions to exclude certain words from the count since they are not specific to English or Tagalog. It is then proceeded by other supplementary functions to handle special cases in which the word frequency lists cannot catch. Check the source code to see these supplementary functions.
References
2023
- Correlation Between the Level of Language Dominance and the Relative Frequency of Taglish Code-Switching of Science, Technology, and Engineering Students2023Conducted for Practical Research 2