train_markov_chain_pride_and

Training a Markov Process on Pride and Prejudice

Justin Pearson
Oct 26, 2014

For simplicity, we’ll just chop up the tokens by whitespace and keep all the punctuation. Consequently the word “Elizabeth” will be different from the word “Elizabeth,” (with a comma). Fine.

Enumerate the unique words

Here’s an association to convert from words to their index in the unique-word list:

The text of Pride and Prejudice, converted to word indices:

Chapter	1
1	2
It	3
is	4
a	5
truth	6
universally	7
acknowledged,	8
that	9
a	5
single	10
man	11
in	12
possession	13
of	14
a	5
good	15
fortune,	16
must	17
be	18
in	12
want	19
of	14
a	5
wife.	20
However	21
little	22
known	23
the	24
feelings	25
or	26
views	27
of	14
such	28
a	5
man	11
may	29
be	18
on	30
his	31

Method 1: Build our own first-order Markov chain

Build the adjacency matrix for our discrete markov process:

Method 2: Use Mathematica’s build-in DiscreteMarkovProcess

We can use Mathematica’s built-in functions to train a discrete Markov process from our text.
Unfortunately, it’s quite slow for datasets of this size, so we use a subset of the text.

Here we use Mathematica’s “EstimatedProcess” function to estimate the weights of an n-state discrete Markov chain:

Let’s run the discrete markov process for 200 timesteps:

Appendix: Word-frequency analysis

Wordcounts often follow a Zipf distribution: “The top 10% most-frequent words account for 90% of all words.” Let’s see if Pride and Prejudice follows this trend.

Wikipedia:

https://en.wikipedia.org/wiki/Zipf%27s_law

Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word “of” accounts for slightly over 3.5% of words (36,411 occurrences), followed by “and” (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.[4]

Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency). For example, the word “the” (as described above) would appear at x = log(1), y = log(69971). It is also possible to plot reciprocal rank against frequency or reciprocal frequency or interword interval against rank.[1] The data conform to Zipf’s law to the extent that the plot is linear.

What are the rarest words? What are the most common?

(The numbers on the left are chapter numbers.)

Plot log-log to assess eligibility for Zipf distribution

Let’s plot log-rank and log-count as described above, so the word “to” would appear at x = log(1), y = log(4051), “the” would appear at x = log(2), y = log(4047), etc.