In July, I introduced a bigram substitution challenge consisting of 1346 letters. Norbert Biermann has now solved it. It’s the shortest ciphertext of this kind that has ever been broken.
The bigram substitution is a manual encryption method with a history of over 500 years. A bigram (also known as a digraph) is a pair of letters, such as CG, HE, JS or QW. The number of bigrams in the Latin alphabet is 26×26=676, ranging from AA to ZZ.
A bigram substitution replaces each letter pair with another one (or with a symbol or with a number between 1 and 676). In order to use a bigram substitution, we need a substitution table with 676 entries.
Two bigram challenges
The best method known that breaks a bigram substitution is hill climbing. The fitness function of a bigram hill climber typically uses tetragram (letter four-tuple) or hexagram (letter six-tuple) frequencies. In the literature I am aware of it is not mentioned how long a bigram substitution ciphertext needs to be in order for a hill climber to be successful.
Two years ago, I decided to take a first step towards closing this gap. For this purpose, I took two messages – one with 2500 and one with 5000 letters – and encrypted them. Subsequently, I published the two ciphertexts as challenges (Bigram 5000 and Bigram 2500) on this blog.
My readers solved both challenges within a few days. Norbert Biermann found the solution of the 5000 letters version – still with a few mistakes – with a hill climbing software of his own design. Thomas Ernst published a few interesting word pattern considerations. Then Norbert provided a second, more sophisticated hill climbing result, which was almost error-free. Finally, Armin Krauß published the correct solution.
After breaking the 5000 letter challenge had proven quite difficult, I expected that the 2500 letter ciphertext would not be solved so soon. However, I was wrong. Only a few days later, Norbert Biermann published the correct solution of the Bigram 2500 challenge, which he had again found with his hill climber. To my knowledge, this success represented a new world record in breaking bigram substitutions.
The bigram 1346 challenge
Two years after Norbert’s record, I published another bigram challenge on this blog. This time, I took an English text constisting of 1346 letters as plaintext, calling it Bigram 1346 challenge. Contrary to last time, I didn’t replace bigrams with numbers but with other bigrams. For this reason, the ciphertext consisted of letters, which had to be read pair-wise. Here’s the challenge:
Last Saturday, Norbert Biermann published the solution of the bigram 1344 challenge as a comment on my blog. Here it is:
The Catharina was a British passenger ship that sank in the southern Atlantic Ocean in nineteen fifteen after colliding with fishing boat. Of the over two thousand passengers and crew aboard, more than one thousand five hundred died. The Catharina carried some of the wealthiest people in the world, as well as emigrants from Europe who were seeking a new life in America. The first-class accommodation was designed to be the pinnacle of comfort and luxury, with an on-board fitness center, swimming pool, libraries, high-class restaurants and opulent rooms. Although Catharina had safety features such as watertight compartments and remotely activated watertight doors, it only carried enough lifeboats for a thousand people – about half the number on board. On fifteen September the Catharina hit an fishing boat. Just under two hours after Catharina sank, the freight liner Tun??? [Tundra?] arrived and brought aboard an estimated thousand survivors. The disaster was met with world-wide shock. Public inquiries in France and Canada led to major improvements in maritime safety. Additionally, several new wireless regulations were passed around the world in an effort to learn from the many missteps in wireless communications. The wreck of Catharina was discovered in in nineteen ninety-five. The ship was split in two and is gradually disintegrating at a depth of almost four kilometers. Thousands of artefacts have been recovered and displayed at museums around the world. Catharina has become one of the most famous ships in history. Catharina is the second largest ocean liner wreck in the world, only beaten by her sister RMS Bellinda.
With this success, Norbert hat set a new world record for the shortest bigram ciphertext ever broken.
How Norbert found the solution
Of course, I was eager to know how Norbert had broken the challenge. So, I sent him a mail asking for information. Here’s the report he sent me:
The first thing I did was to take my old hill climbing program, which I had written for the Bigram 2500/5000 challenges. However, this software didn’t render a useful result. The ciphertext was obviously too short.
Next, my hope was that I could somehow guess a part of the plaintext and feed it to my program (it can handle cribs). So, I printed out the bigram frequencies of the cryptogram and compared them with the usual frequencies in an English text. In addition, I looked for some eye-catching patterns.
I soon saw UN was not only the first bigram but also by far the most frequent one (3.7%) in the ciphertext. Therefore, I assumed that it stood for the most frequent English bigram, “th”. This was, of course, a pure working hypothesis, but I had to start somewhere. At least, it is very conceivable that the text starts with “The”. Next, I Iooked for bigrams that formed interesting patterns around UN, and here’s what I found:
1) UN KF VO UN (Pos. 47)
2) UN VO WP SG (Pos. 129, 171, 587 and 1141)
3) VO SG UN (Pos. 937, 1201).
If UN represents “th”, what could VO and SG stand for? Both bigrams appear frequently in the ciphertext (SG: 1.6%, VO: 1.3%) and are here combined with UN in different patterns. I only came up with one option: VO=”ou” and SG=”nd”. Pattern 1) could now be decrypted to “the south”, pattern 2) to “thousand” and pattern 3) to “…ound th…” (for example “found the” or “around the”).
Next, I assumed that BI Stood for “in” – first, because BI is the second most frequent bigram in the ciphertext and therefore should correspond with a common plaintext bigram; second, because “in the south” makes sense; and third, because of the conspicuous pattern BI BI in position 1041. The most common double bigram in English is “thth” (0.044% in my corpus), but “th” is already assigned. The second most common is “inin” (0.023%). This guess turned out to be correct, but it was mere coincidence, because the tetragram BI BI appeared in the plaintext only because of a typo: “was discovered in in nineteen ninety-five”. For the even more conspicuous pattern that starts with BI BI, namely BI BI FJ MH KV FL FJ MH, I had no idea at all …
In the next step, I fed “in the south” and “thousand” to my hill climbing program. However, it didn’t deliver a usable result. Either I didn’t have enough crib material, or my guesses were wrong.
Next, I rewrote my hill climbing program and implemented the two improvements:
- I used hexagram frequencies instead of pentagram frequncies in the fitness function.
- I improved the program’s ability to concentrate on smaller numbers of different bigrams. Before, the software had to deal with all 218 different ciphertext digrams and try to find the right ones from 676 possible plaintext digrams, which makes a keyspace of about 6.0 × 10^599. In the new version, the program was now allowed to reduce the ciphertext character set to be examined to a few bigrams, e.g. to the 65 most common ones (in our case, these are all that occur at least four times in the ciphertext). After this reduction, only 164 (different) connected hexagrams remain, which can be evaluated. At the same time, I also artificially limited the number of plaintext digrams to be considered – in this case (purely arbitrarily) to the 190 most common ones, which reduces the keyspace to approximately 5.1 × 10^142. Overall, I received an enormously reduced keyspace in exchange for a considerably shortened ciphertext and the risk of having cut away the correct plaintext digram in some cases. It was worth a try.
In the described new design (and without cribs), my program actually found a potential solution fragment. It looked as follows (uppercase letters are ciphertext, lowercase letters are plaintext):
th GO at ha ri na GV sa IV or is KE as se ng er ma UB th at sa IE in th es ou th er na PP an ti co BA an
Some plaintext digrams were still wrong, but I could already see: “The Catharina” (or maybe “Katharina”), “passenger” and “that sank in the southern Atlantic Ocean”. Soon the sequel “in nineteen fifteen” was also clear to me.
So, I wildly searched on the Internet for a ship named Catharina, which sank in the South Atlantic in 1915, but to no avail. At some point, I realized that Klaus had used a Wikipedia text about the Titanic with changed names, dates, and places. The rest of the deciphering went very fast, as I could easily guess many words. Otherwise, there would certainly have been some gaps left.
Thanks, Norbert and congratulations on this great work! This is the third major codebreaking success I have the honor to blog about within a week. I am deeply impressed by Norbert’s and my other reader’s deciphering skills. I am proud to have such smart people among my readers.
Further reading: Mathematical formula needed