Pseudo-swedish generator

This program is rather stupid but a thing I had the idea of doing for some time. It uses Markov probabilities on sentences to generate "pseudo-swedish", which is sort of Claude Shannons entropy theory for written language run backwards.

Idea

On each word you calculate the probability that any letter will follow any other letter. E.g. the probability that a "b" will succeed an "a" for all letters in the alphabet. This probability is converted to a distribution curve, i.e. each discrete letter following another letter is assigned an interval in the real axis between 0..1. As "space" is regarded a letter, beginnings and ends of words will also be included in this probability distribution.

When you have this distribution curve, you may randomize numbers in the interval 0..1 and pick the letter corresponding to that interval on the distribution curve. This way a random number generator may generate a sequence of letters, which may in turn be split at the "space" characters to form words.

On top of this I have put a "filter", which will apply some basic Swedish syntax checks, e.g. triple consonants are forbidden in Swedish, but may be generated by the random character generator. The most clever part of this is a check for beginnings of words, which are rather strict, and valid beginnings are also scanned from the input text. Thus:

Step 1: INPUT TEXT -> MARKOV PROBABILITIES -> DISTRIBUTION

Step 2: RANDOM NUMBERS -> DISTRIBUTION -> RANDOM WORDS -> FILTER -> FILTERED RANDOM TEXT

Download program

Java source and class files

Run it with "java Pseudo <infile> <number of words>", e.g. "java fat_novel.txt 100" to generate 100 words based on the novel fat_novel.txt. You can get a Swedish novel at project Runeberg.

I have found that the program generates a lot of invalid words but also many remarkably familiar. Some are valid Swedish words, others obviously invalid, and others are not Swedish words, but could obviously have been so. (Which is sort of fun.)

Example

Using John Wahlborgs antology Guldregn och syrén as input, the following words were generated (valid Swedish words removed, bad words sorted out using my feeling for the Swedish language):

dotväs
skomalåka
skafås
bäla
slina
krasigatt
hosogå
hederban
föröston
inapöka
flisi
dekadetera
hällig
...

And so on, ad nauseam. When looking at the words I thought that the ones I didn't know looked much like latin or ancient nordic, which happens to be what Swedish is probably made up from, basically. But that was just a feeling I had.

Other ways

I presume you could do the same thing using n-dimensional vector representations of swedish words (where n is the number of letters in the alphabet), and some criterion for drawing n-dimensional graphs and defining a valid n-dimensional space for "Swedish", and see if a randomly generated n-dimensional vector will fit in this space or not.

Perhaps you could create a program which does probabilities not as Markov chains (which is actually rather lame) but as more complex probabilities, where the probability of the next letter being dependent on the n previous and following letters, inluding beginning and termination whitespace. Maybe that is what the above idea would actually do.

A true linguist would never do things this way. He would instead (probably) use fonemics to combine sound elements (which the language is believed to reflect) i.e. phonems, according to their distribution in the language and phonem->phonem probabilities (in the Markov case). The phonems can then be translated to actual words using low-level grammar which maps a phonem representation to the corresponding character sequence, and also has rules for how the words are formed when these are concatenated. (As these words are rid of semantics, they can never be used to generate whole sentences however.)

If you want to generate sentences rather than words, that is a whole different story, but of course you can do the same Markov thing on words. However there exist high-level grammar rules on that level which should be applied, and which are much more intricate. For an itroduction to this entirely different subject, please read Chomskys Syntactic Structures.

(These are left as an exercise for the happy programmer.)