dictionary Generating your own random text

Use this Bash script to make your own placeholder text.

There are many reasons you need to create placeholder text. For example, if you are building a new website, you may not have all of the content ready as you're creating the design; placeholder text helps you see what the design will look like after you've added the content.

For years, my "go-to" to generate sample content for documents was the lipsum.com website, to insert Latin-like meaningless text. Most people are able to ignore the placeholder content if they immediately recognize that it's just meaningless words, and Lorem Ipsum can do that very well. If I want placeholder text in English, I sometimes use other placeholder generators to do the same job, by inserting random content from Star Wars, Doctor Who, and Star Trek.

Depending on what I'm working on, I might also use scribble fonts with placeholder text such as the Flow font or Redacted Script font. Using a combination of "dummy" text with a scribble font hides the placeholder content, so it doesn't matter what text you use on the sample website or in the sample document. Instead, people only "see" lines and squiggles to suggest text on the page. And that's great if I'm just trying to work out a design without getting bogged down by the details of what specific text will be on the page.

If you want to insert placeholder text like Lorem Ipsum, you can install any number of text generators for your system. Some text editors can also insert Lorem Ipsum placeholder text for you. But there's another way to create placeholder text without installing an app or copying from a website: you can make your own text generator.

Scripting with Bash

I wrote my own Bash script on Linux to generate a few paragraphs of random text. Here's how it works.

Word lists

Every Linux system includes a default dictionary of correctly-spelled words, usually saved in /usr/share/dict/words. These words are in sorted order, and contain both uppercase and lowercase words. If you use the head command to print the first ten lines of the words file, you will see "words" that start with numbers:

$ head /usr/share/dict/words
1080
10-point
10th
11-point
12-point
16-point
18-point
1st
2
20-point

If you add the grep command to search for all lines that start with a lowercase letter a, you can see the first ten examples:

$ grep '^a' /usr/share/dict/words | head
a
a'
a-
a.
a1
aa
aaa
aah
aahed
aahing

The grep command is an old Unix command that operates with regular expressions. You can think of a regular expression as search text that can include special text that indicates the start of a line (^), the end of a line ($), or other markers. You can specify repeating examples of text by using + for one or more or * to mean zero or more of the previous character. If you want to specify certain classes of characters, you can use special brackets like [[:upper:]] to mean the uppercase letters A to Z, or [[:lower:]] for the lowercase letters.

This is a very flexible command that makes it possible to search for all kinds of text in a file. For example, to print all lines that start with an uppercase letter followed by one or more lowercase letters, you would use this regular expression:

grep '^[[:upper:]][[:lower:]]\+$' /usr/share/dict/words

However, this can find some very long words, if they are in the words file. On my system, the longest words that start with an uppercase letter followed by one or more lowercase letters are Prorhipidoglossomorpha, Pseudolamellibranchia, and Pseudolamellibranchiata. Those are too long if I want to generate some random placeholder text for a website. I think good placeholder text is a reasonable length, maybe 2 to 8 letters long for lowercase words, or 4 to 8 letters for uppercase words.

To limit the length of the words, I can send the output of the grep command to another classic Unix command called awk, implemented as gawk (GNU awk) on most Linux systems. The awk command takes pairs of patterns and actions; for each matching pattern, it executes the action. In my case, to print just the words that start with an uppercase letter followed by one or more lowercase letters, and are more than 2 letters and less than 8 letters long, I would use this command:

grep '^[[:upper:]][[:lower:]]\+$' /usr/share/dict/words | gawk 'length($0)>4 && length($0)<8 {print}'

That's a long line, but it's just a grep command to find lines of text, and sending that to the gawk command.

But a gawk pattern can also be a regular expression, using basically the same syntax as the grep command. That allows us to rewrite the command to search the /usr/share/dict/words file for all words that start with an uppercase letter followed by one or more lowercase letters, more than 2 letters and less than 8 letters, as a single gawk command:

gawk '/^[[:upper:]][[:lower:]]+$/ {if ((length($0)>2) && (length($0)<8)) {print}' /usr/share/dict/words > upper.tmp

This moves the length test inside the action, using if to determine if the word's length is greater than 2 and less than 8. Other than using a redirector (>) to save the output to a temporary file called upper.tmp, the command is essentially the same, but doing it all inside gawk instead of using grep then gawk.

We can generate a list of all lowercase words of a certain length with a similar gawk command:

gawk '/^[[:lower:]]+$/ {if ((length($0)>2) && (length($0)<8)) {print}}' /usr/share/dict/words > lower.tmp

Loops

The script generates 5 paragraphs of text, each consisting of a random number of sentences, each with a random number of words. I do this with several for loops, to iterate over a set of values. For example, to print out the text "Hello" 4 times, I would write this for loop:

for word in 1 2 3 4; do echo "Hello"; done

If you type this at the Bash command line, or save it to a "script" file and run it, you should see "Hello" printed back to you 4 times:

Hello
Hello
Hello
Hello

At every "pass" through the loop, the variable word is assigned the value 1, 2, 3, or 4. You can print out the value of the word variable by writing it with a "dollar sign" in front, like this to print the numbers 1, 2, 3, and 4:

for word in 1 2 3 4; do echo $word; done

If you run this at the Bash command line, Bash will print the values 1, 2, 3, and 4 on separate lines:

1
2
3
4

You can also put one for loop "inside" another; this is called nested loops. It's easiest to show nested loops by writing it in a script, where I can split up the lines to make the instructions more clear. For example, this prints the values A1, A2, B1, and B2 to the screen using nested loops:

for letter in A B ; do
  for number in 1 2 ; do
    echo $letter$number
  done
done

I've also added some extra spacing so you can see the nested loops in action, and to make clear what is "inside" each loop. When I write for loops like this, I usually write the ; with spaces on either side. This is just a personal style, you don't need to use the extra space.

If you save this to a script and run it, you should see the values A1, A2, B1, and B2 printed to the screen. That's because the "outer" loop iterates through the letters A and B; for each "letter" loop, the "inner" loop iterates through the numbers 1 and 2. The effect is the loop generates the four values in order:

A1
A2
B1
B2

Random lines

To generate random words, either all lowercase words or words that start with an initial uppercase letter, we need to print random lines from a word file. We can use gawk to find the words we need; the next step is to pick random words from the temporary file.

Linux provides a command called shuf that can shuffle a text file and generate a file with the lines in a random order. For example, let's print the numbers 1, 2, 3, and 4 in a random order with the shuf command:

for num in 1 2 3 4; do echo $num; done | shuf

Every time you run this command, Bash will print the list of 4 numbers, and "send" them to the shuf program, which prints the lines in a random order. But there's an easier way to generate a few numbers, using the seq command to print a sequence of numbers. For example, to print the sequence from 1 to 4, but in a random order, send the output from seq into the shuf command:

seq 4 | shuf

The random order changes every time you run this command, but it might create a list that looks like this:

4
1
2
3

If you have a longer list, but only want to see the first few lines from the shuffled list, send the output to the head command. This prints only the first ten lines by default; use a hyphen with a number to print that many lines, such as this to shuffle a list of ten numbers but print only 4 lines of output:

seq 10 | shuf | head -4

Putting it all together

With these Bash scripting commands, plus a few extra Bash features that I'll show you, you can generate a few paragraphs of random text. Each paragraph contains a random number of sentences, between 5 and 8 sentences. Each sentence has a random number of words, between 6 and 9 words.

#!/bin/bash

words=/usr/share/dict/words

lower=/tmp/lower.tmp
upper=/tmp/upper.tmp

gawk '/^[[:lower:]]+$/ {if ((length($0)>2) && (length($0)<8)) {print}}' $words > $lower
gawk '/^[[:upper:]][[:lower:]]+$/ {if ((length($0)>2) && (length($0)<8)) {print}}' $words > $upper

for para in $(seq 5) ; do
  s=$((RANDOM % 5 + 3))

  for sent in $(seq $s) ; do
    w=$((RANDOM % 6 + 3))
    ( shuf -n 1 $upper ; shuf -n $w $lower ) | tr '\n' ' ' | sed 's/ $/. /'
  done
  echo -e '\n'
done

rm -f $lower $upper

On my system, I saved this script to a file called mkwords.bash. Let's look at this in more detail to understand how it works:

The first few lines save some values to a few variables; a variable is just a way to access a value later on. In this case, I've saved the path to the word list in a words variable, the path to a list of lowercase words in the lower variable, and a list of uppercase words in the upper variable. I can use these at any time in the Bash script with a "dollar sign" like $words to get the full path to the word list, at /usr/share/dict/words.

After that, the script runs the two gawk commands to generate the list of all-lowercase words and the list of words that start with an uppercase letter.

Then, the script uses a nested for loop to print 5 paragraphs. This also sets a variable called s that is a random number between 3 and 7. That's because the $(( )) brackets create an arithmetic expansion, so Bash can do simple arithmetic. You probably know the basic arithmetic operators like add (+), subtract (-), multiply (*) and divide (/). You can also use % to mean modulo, or the remainder after division. For example, 9 % 4 is 1, because 9 divided by 4 is 2 with 1 left over. The arithmetic expansion to assign a value to s uses RANDOM to mean a random number, and taking the modulo of 5 will give a value in the range 0, 1, 2, 3, or 4. That means s can be in the range 3 (0 + 3) to 7 (4 + 3).

The next loop generates that many random sentences, from 1 to s, using a similar trick to pick a random number of words (w) between 3 and 8.

The last line inside the "inner" loop uses two shuf commands to print 1 random word (-n 1) from the uppercase words, then the random number of words (-n $w) from the list of lowercase words. The random words are printed one per line, so I've added the tr command to translate the newline (\n) to a space. The sed command makes line-by-line edits to add a period to the end of the line. These commands generate a series of "sentences" that begin with an uppercase word followed by a random number of lowercase words, plus a period.

After each sentence, the script uses an echo command to print an extra newline. Actually, the echo command itself generates a newline, so this command effectively prints 2 newlines.

The last line in the script cleans up my temporary files by deleting them.

Random placeholder text

Whenever I need to generate some placeholder text for a project, I can just run this Bash script to print out a few paragraphs. Every time I run the script, it prints 5 paragraphs of a few sentences, each with a reasonable number of words. This is somewhat representative of text that I might include in a document.

The script prints each paragraph on a single line. To make it more readable, I'll send the output through the fmt program to "wrap" the lines:

$ bash mkwords.bash | fmt
Birkett lainer palmus sault momo platten. Jonas besugo rerake hisself
pont mappers rethaw. Eurasia mousees unlobed unogled plumcot pit conchal
bohawn. Duparc ova ethnic fainty peened erbiums.

Gargan arustle memoria tertio goonies violins chablis raku fautor. Gustin
markery sabalo garvock therms ginete darshan. Murdock ead shaving
lifeway atveen aswash woorali bauckie bruiser. Nolitta culture luteway
lehrs hostel gelofer ahmedi thecla fondish. Kona upcrawl coadmit toronto
devilet cropper jinni.

Pinto hearten toit cazimi upmast mastmen. Lynnell altho kitties twains
coden zaffirs enflesh hippic. Lati math sledger rivets. Alphons achene
vexers sago oops nookery. Palilia desugar orguil laddie. Bascom kowtows
milage rematch. Buber frizzy remue giddily micas chays.

Kula pans hatch shucked murly. Nivre cucupha fizzed boding villein
tortive. Esidrix pagan grinnie muggily. Templia weste unget shellum
fungoid mansard aequor slinker dernly.

Jan cockeye elfwife erosely. Hendel yett mouched isaac. Seami iocs cedar
adonite junt rappage doatish.

Every time I run the mkwords.bash script, it generates new random words, sentences, and paragraphs:

$ bash mkwords.bash | fmt
Anomura sardel asks nongod abacus postage belder. Haydn nad munjeet
nth subcool galjoen sipper defiber. Zapus nasutus honkie slour. Joshuah
alunite sideman colarin opiate pound mibound. Mozart snivy tchu storied
opine flated darnix calx.

Aquilid somaten gaucher pedro glaucus skellum. Luhe justles break
mattins mids baryte. Smoos buirdly oenone perrier terefah kuskus autarky
barms. Parris dishpan yoicks juddock lab bechalk aidful. Lorado sieur
croft civets ritards lezzies deaned.

Mikana rowdy strond taxicab braye leprosy beachy opiner. Qld dowse
morally tissual. Disa goad alinit sirex soddier.

Rozelle helloes shamash ideally truced ascitb lobo. Suzanna teetee trull
didymis spumier tackety joul pilule. Guaymas lochy gurgeon hyson. Concho
chivies adfix urazole whumped teca seor tamein. Unkelos balded jambul
pudenda. Dunlo subsign atheize beroll reduced smelted urunday regloss
gingers.

Achsah beastly notcher gifted pows tilaka. Dobb neascus tardy goos rads
sarum unlived creped pucksey. Anselmi lizard hamline mandala archy indigen
vac. Kaiser rhet verdins bustian grigs scotale. Herald solein nilghau
rigel smarm vena. Jourdan sakeber apologs simians hlqn seenie cubhood
keened glime. Dannica agnails pulser ungum sedent awns transom visit.

This script works well for me, but you can still improve it. For example, every time the script runs, it generates the same list of words from the /usr/share/dict/words file. Since the system's word list doesn't change very often, you can make this script run faster if you save the list of temporary words somewhere in your home directory, and only regenerate the lists if they are not there.

Also, the /usr/share/dict/words file contains some words that are not work-friendly. So instead of using the system's word list, you might make your own list of words to use. One way to create a list like this is to use the words from other documents you have already written, and use that word list as the starting point.

But if you just want to generate a few paragraphs of random-length sentences with random words, this script will do the job. And you can do it on your own with a Bash script.