Markovian Word and Language Classifier

This is a markovic (sort of) classification demonstration program. It is released under the GNU General Public License V2.0. Written by Linus Walleij.

What It Is

It is a program designed to use the statistical properties of written natural languages in computer-readable form in order to determine whether:

  1. A certain word is probable to be a word in that language or not, thus may to a certain extent detect the presence of nonsense words.
  2. It may calculate a probability that a sufficiently long piece of text belongs to a certain natural language.

How to Install It

Just untar the .tar.gz file "tar xvfz markov-classifier.tar.gz" and enter the created directory, "cd markov-classifier". Run all examples from there. This is not a reusable module. It is an example of what can be done with these methods.

The base directory contains example programs and a few test files, the directory "stats" contains frequency data for a few selected languages, and the directort "ispell-dictionaries" contain the dictionaries that were used to generate these frequency files.

Examples

$ ./guess-language.pl testme.txt

 Hello. I am a language guesser program.
 I can sort of tell what language a text is written in.
 This looks like Italian!

$ ./classify-words.pl -l en classifyme.en

 Hello. I am a word classifying program.
 I can tell whether words are meaningful or nonsensical.
 Classifying words in file classifyme.en
 Classifying word "rhododendron"... meaningful (score: 0.142863599603693)
 Classifying word "happiness"... meaningful (score: 0.166520392998076)
 Classifying word "gfasdgafghda"... nonsense (score: 0.0738799965448918)
 Classifying word "asfdfagsdfgfd"... nonsense (score: -0.298013889001338)
 Classifying word "stoneroller"... meaningful (score: 0.117259309085975)

The other programs are used for generating the frequencies.

How It Works

The program used markov probabilities. First, a large textcorpus is analyzed to obtain probabilities. This program used "Ispell" dictionaries to obtain statistical data, but in practice any textcorpus typical of the language you want to profile may be used.

The probabilities are based on two-letter symbols. For a given word:

 F O O B A R

The word is analyzed such that the times that a certain letter will follow two other letters is stored in a table. For this example the table will be:

 After:        Follows:
 ----------------------
 FO            O
 OO            B
 OB            A
 BA            R

This is repeated for several hundred words. Then the probability of a certain letter following a two-letter combination is calculated (a value between 0 and 1). In this minimal example, the probability for O following FO, A following OB etc will be 1, but for a larger corpus more realistic probabilities are obtained. All non-existing two-letter combinations implicitly have the probability 0.

When analyzing a word to see if it belongs to a certain language, the programs will analyze one word at a time and create a hash of each two consequtive letters, then add the probability that the next letter follows these two to the "score" for this word. The score is weighted with the word length for fairness.

If a previosly unknown letter occurs after a known two-letter combination, a penalty is awarded. If a previously unknown two-letter combination occurs, an even bigger penalty is awarded. The penalty is also weighted.

To determine if a word is nonsense, the total score is compared against some limit (like 0.1) and if it is below this limit, it is determined nonsense. This method is not foolproof but quite good!

To determine what natural language a piece of text belongs to, the scores for all words are added up. This is repeated for all language statistics, and the language with the highest score "wins".