The Google books ngram datasets are a nice resource for playing with text generation. I've created a program that will, given a starting phrase, blather on by connecting phrases from the datasets.

The results are much more varied, and seem to more often make sense than most Markov text generators I've seen before, such as dadadodo. It had better be -- it's working with data derived from 4% of all books ever written!

Some examples. Do they tell a story?

(The starting phrase I entered is in italics.)

  • He is assailed — they merge
  • To be or not to be so deceived ! She denies abdominal scanning
  • Beauty with enormous strides . The hospital was under the control of the communal organization of solitary mammals . In other words the mere failure of an individual .
  • Time passes . It is the child's first day in town and at his own level ?
  • Time above his left shirt pocket for safe keeping . At first I accepted the old man's work . He settled in that part of Paris .
  • Until then no ill Accident would generate hydrogen sulfide
  • Random nonsense put together huge boxes with ammunition against it itself stops . Working things out afterwards
  • A change in voice and piano . He sat down at last at No . 5 in connection with another or with others ; and that by some strange disaster .
  • O God ! What is the condition of morality is that they all hold hands in a chain of propositions . These propositions are the primary bearers of meaning making and the rearing of children

I have cheated slightly by picking where to stop quoting the blather each time. It tends to go on for quite a while otherwise. Although it often does manage some kind of sense at length too:

  • Worst Case Scenarios in Search of Reality : The ERA and emendation . I can never recover . She became thoughtful . He looked at me gloomily above the tree tops . The women were asked how they used to behave more competitively . Forecasts should be made

This was accomplished using only 99 lines of code. I imported the datasets into Xapian databases using this program. And this program is the text generator.

I threw out the metadata, to keep the Xapian databases of reasonable size. By which I mean, only a dozen gigabytes. It might be interesting to include the publication year data in Xapian, so that it could prefer ngrams that were published around the same time as the input text.

I'm still importing data. I may put up a web interface later if there's interest.


  • What about Spam ? Who caused it ? You have said it so loud that the little lady in a black storm .
  • What about Spam ? I think it cannot be any . The Church had the right men . They are our friends and relations who could be depended upon with regard to the UN . Some people paint the portraits of the writers .


  • Joey Hess was not the real author of these lines .