Training a Markov Process on Pride and Prejudice

Justin Pearson
Oct 26, 2014

index_1.gif

For simplicity, we’ll just chop up the tokens by whitespace and keep all the punctuation. Consequently the word “Elizabeth” will be different from the word “Elizabeth,” (with a comma). Fine.

index_2.gif

index_3.gif

index_4.gif

Enumerate the unique words

index_5.gif

index_6.gif

index_7.gif

Here’s an association to convert from words to their index in the unique-word list:

index_8.gif

index_9.gif

index_10.gif

index_11.gif

index_12.gif

index_13.gif

index_14.gif

The text of Pride and Prejudice, converted to word indices:

index_15.gif

index_16.gif

index_17.gif

Chapter 1
1 2
It 3
is 4
a 5
truth 6
universally 7
acknowledged, 8
that 9
a 5
single 10
man 11
in 12
possession 13
of 14
a 5
good 15
fortune, 16
must 17
be 18
in 12
want 19
of 14
a 5
wife. 20
However 21
little 22
known 23
the 24
feelings 25
or 26
views 27
of 14
such 28
a 5
man 11
may 29
be 18
on 30
his 31

Method 1: Build our own first-order Markov chain

Build the adjacency matrix for our discrete markov process:

index_18.gif

index_19.gif

index_20.gif

index_21.gif

index_22.gif

index_23.gif

index_24.gif

index_25.gif

index_26.gif

index_27.gif

index_28.gif

index_29.gif

index_30.gif

Method 2: Use Mathematica’s build-in DiscreteMarkovProcess

We can use Mathematica’s built-in functions to train a discrete Markov process from our text.
Unfortunately, it’s quite slow for datasets of this size, so we use a subset of the text.

Here we use Mathematica’s “EstimatedProcess” function to estimate the weights of an n-state discrete Markov chain:

index_31.gif

index_32.gif

Let’s run the discrete markov process for 200 timesteps:

index_33.gif

index_34.gif

Appendix: Word-frequency analysis

Wordcounts often follow a Zipf distribution: “The top 10% most-frequent words account for 90% of all words.” Let’s see if Pride and Prejudice follows this trend.

Wikipedia:

https://en.wikipedia.org/wiki/Zipf%27s_law

Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word “of” accounts for slightly over 3.5% of words (36,411 occurrences), followed by “and” (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.[4]

Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency). For example, the word “the” (as described above) would appear at x = log(1), y = log(69971). It is also possible to plot reciprocal rank against frequency or reciprocal frequency or interword interval against rank.[1] The data conform to Zipf’s law to the extent that the plot is linear.

index_35.gif

index_36.gif

What are the rarest words? What are the most common?

index_37.gif

index_38.gif

index_39.gif

(The numbers on the left are chapter numbers.)

Plot log-log to assess eligibility for Zipf distribution

Let’s plot log-rank and log-count as described above, so the word “to” would appear at x = log(1),  y = log(4051), “the” would appear at x = log(2), y = log(4047), etc.

index_40.gif

index_41.gif

Graphics:The wordcount-vs-rank plot is somewhat linear.

Looks pretty linear in the middle, not so great at the ends. What’s up with the long tail -- all the words that only appear once?

index_43.gif

Graphics:Half of the 13065 unique words appear only once. (TOTAL # of words in P&P: 121553)

A lot of words occur only once:

index_45.gif

index_46.gif

It is mainly due to us not erasing punctuation:

index_47.gif

index_48.gif

How many words account for half the text?

index_49.gif

index_50.gif

index_51.gif

index_52.gif

Wow, one could write half of Pride and Prejudice with only 84 words!

Fit a Zipf dist to it.

index_53.gif

index_54.gif

The estimated distribution seems to line up well with the histogram of wordcounts:

index_55.gif

Graphics:Zipf distribution fits the word-count PDF-histogram.

With this, you can answer questions like: What’s the probability that a word occurs more than 100 times in the text?

index_57.gif

index_58.gif

Hopefully this matches the fraction of the unique words that occur more than 200 times:

index_59.gif

index_60.gif

Not too bad.

index_61.gif

Created with the Wolfram Language