November 20, 2018
Python, one of the most powerful and popular coding language can be put to many use. One of such include language detection.
What is Language Detection?
Are you able to tell which language are the following sentences written in:
“sonrakine bir göz at”
No? Try this one
“iyi, eğer bir selamı hakedebilirsen”
Well, finding a documents source language is an important step for numerous cross-language tools.
Hence, being the reason for the implementation of a Language Identification algorithm.
Language Detection comes in the category of NLP (Natural Language Processing), which involves the study regarding how computers can decode a code and value from human language.
It deals with the problem of determining which natural language given content is in.
Need For Language Identification
Natural Language models are usually specific to a discrete language. If one is not sure of the receiving document’s language, it becomes very hard to provide anyone with a good experience.
Wouldn’t it be great, if a technical support chatbot is made capable to determine a speaker’s language and could reply with documentation in the same language?Read also : Top 100 Python Interview Questions and Answers
Or a sentiment analysis tool that automatically detects the sentiment of any human language?
Well, Language Identification is designed to do exactly the same.
Language Identification is an important objective in the text mining process. Successful study of the extracted text with natural language processing or machine learning training demands a good language identification algorithm.
If it is unsuccessful in identifying the language, this misconception will invalidate the processes ahead.
NLP algorithms have to be modified for different corpus and in accordance with the grammar of different languages.
Certain NLP software is best suited for certain languages, such as NLTK and FreeLing.
NLTK is the most favored natural language processing package for English under Python, but FreeLing is best for Spanish.
The effectiveness of language processing depends on various factors.
A superior quality model for text analysis includes the following:
1. Text Extraction:
Text can be extracted by web data extraction, importing it in a particular format, grabbing it from a database, or obtain it through an API.
2. Text Identification:
it is a process of separating interesting text from another format that adds noise to the analysis
is a set of algorithms that support in the processing of different languages.
4. Machine Learning:
is an essential step for objectives such as collaborative, sentiment analysis and clustering.
There are a lot of languages recognizing software available online. NLTLK use Crúbadán, whereas Gate includes TexCat.
We prefer using Google Language API because it is very precise even for just one word. It consists of an accuracy measure in return.
Language detection with Python
langdetect is one derived directly from Google language detection. From the home page of the Python library, you can get access to the project page, this seems to be different from the code on which the R library CLDR is based. And in fact, the Python library seems to be well organized and maintained.
It claims it can detect 55 languages and upon a simple call to the function “detect” will return the two letter iso code of the language detect while a call to detect Lang will return a vector of probabilities strings. The vector contains a single item.
In Python, there are certain options which it offers, and the alternative library that found here is langid.
This claims to be a standalone library capable of detecting 97 languages and you can use the langid to classify (“your text”) to get the most likely language and its “score”.
Better the score, more probable the language. Let us see a bit of code these two libraries quickly:
Combination of Python and NLTK for Language Detection
Most of us are involved in search engines and Social networks to show data in certain languages, for example, Spanish and English.
To accomplish this process, an indexed text has to be examined well which would give intended result and store it together.Read also : How is Python Being Used At Facebook?
There are quite a number of way to achieve that, the easiest way to go through is the stop words based approach.
“Stopword” is used in Natural Language Processing to mention words which should be filtered from the text before any kind of processing takes place.
Your Way through Stop words
Now, We have a text to detect the language. The basic step will be to Tokenize the given text to a list of “words” and “tokens”- using an approach depending as per our requirement.
Following is the example of detecting language with the help of Python and NLTK
As you can see, the text has been tokenized and processed further to get the intended results. The outcome of the above programme is given below:
The Result shows clearly that the text inserted in the programming, containing different languages were identified by the Python coding method. There were total four of them and all were detected successfully.