Frequency Analysis


Frequency analysis is the study of the frequency of letters or groups of letters occuring next to each other. The most ancient description for what we know was made by Al-Kindi, dating back to the IXth century. This attack is used to break monoalphabetic ciphers, which work by simple and fixed substitution of letters. We now, that if the a in plaintext is encrypted to x letter, everytime in the ciphertext that x will convert to a.

But how can we attack the ciphertext without knowing the encryption key? Brute force? That may take too long for some, even basic, forms of encryption. What comes to the rescue is frequency analysis. With knowledge of how often each letter occurs in the language, we can try analyzing the ciphertext, substituting most occuring ones with the ones that are most frequent in specified language. Let's take a look at charts showing those frequencies in English.

Letter frequency in English

As we now can see, it's easy to determine which letter is the most common, which least. We have to remember that for the correct analysis we'll need as much text as possible, as it will give more accurate results. In addition we know that there are many comman letter pairs:

TH, EA, OF, TO, IN, IT, IS, BE, AS, AT, SO, WE, HE, BY, OR, ON, DO, IF, ME, MY, UP

Repeated letters:

SS, EE, TT, FF, LL, MM and OO

Common triplets:

THE, EST, FOR, AND, HIS, ENT or THA

Now let's use this knowledge to show how this attack works.

Attack

Let's consider this message:

KRBGXZ QHEG ZGLXQT PRZUJHXGU LWLHZUV VNG PRQQGPVHFG. OG LXG VRXKGZVGB DT L XGQGZVQGUU EQRO RE HZERXKLVHRZ LU OGQQ LU VNG BLHQT ORXXHGU RE LZ GVGXZLQQT HZUGPYXG, YZOLXXLZVGB QHEG. EYXVNGXKRXG, OG BXGLB VNG VNRYWNV RE DGHZW LQHMG, RE UNLXHZW KYQVHJQG FHGOU LZB RJHZHRZU. LU UYPN, OG LXG VYXZHZW JXRWXGUUHFGQT CYBWGKGZVLQ RE ONR OG UNRYQB DG JLXVZGXHZW OHVN, RZ VNG DLUHU VNLV "VNGT BR ZRV YZBGXUVLZB". HZ NLPMHZW, HV TGV HKJQHPLVGU RZ VNG BGQHPLVG UYDCGPV RE VXYUV, ONHPN ORYQB XGIYHXG LZ GUULT RZ HVUGQE, WHFGZ VNG YZBGZHLDQG HKJRXVLZPG VNG KLVVGX NLU LPIYHXGB RFGX VNG TGLXU.

Firstly let's analyze the frequency of letters. I've used this website to create a graph.

Letter frequency in message

As we now know that G is the most common letter, let's change it to E. Then we can also change V to T as it's second most common letter. After this operation we can definitely see more clues.

tNe word is definitely word the, so let's change N to H. Now we can also change the thLt to that, meaning L -> A and word theT to then, so T to N. Remember to keep these replacings case sensitive!

KRBeXZ QHEe ZeaXQn PRZUJHXeU aWaHZUt the PRQQePtHFe. Oe aXe tRXKeZteB Dn a XeQeZtQeUU EQRO RE HZERXKatHRZ aU OeQQ aU the BaHQn ORXXHeU RE aZ eteXZaQQn HZUePYXe, YZOaXXaZteB QHEe. EYXtheXKRXe, Oe BXeaB the thRYWht RE DeHZW aQHMe, RE UhaXHZW KYQtHJQe FHeOU aZB RJHZHRZU. aU UYPh, Oe aXe tYXZHZW JXRWXeUUHFeQn CYBWeKeZtaQ RE OhR Oe UhRYQB De JaXtZeXHZW OHth, RZ the DaUHU that "then BR ZRt YZBeXUtaZB". HZ haPMHZW, Ht net HKJQHPateU RZ the BeQHPate UYDCePt RE tXYUt, OhHPh ORYQB XeIYHXe aZ eUUan RZ HtUeQE, WHFeZ the YZBeZHaDQe HKJRXtaZPe the KatteX haU aPIYHXeB RFeX the neaXU.

Here's our message as in current state. Do we have anything more? Word haU can be possibly has so we have another U to S. Then, we have Dn, where D would be i. Verb aXe, will change X into R.

After that operation, word trYst appears, meaning that Y can be U. Now maybe Oe can be a word we?

KRBerZ QHEe ZearQn PRZsJHres aWaHZst the PRQQePtHFe. we are tRrKeZteB in a reQeZtQess EQRw RE HZERrKatHRZ as weQQ as the BaHQn wRrrHes RE aZ eterZaQQn HZsePure, uZwarraZteB QHEe. EurtherKRre, we BreaB the thRuWht RE ieHZW aQHMe, RE sharHZW KuQtHJQe FHews aZB RJHZHRZs. as suPh, we are turZHZW JrRWressHFeQn CuBWeKeZtaQ RE whR we shRuQB ie JartZerHZW wHth, RZ the iasHs that "then BR ZRt uZBerstaZB". HZ haPMHZW, Ht net HKJQHPates RZ the BeQHPate suiCePt RE trust, whHPh wRuQB reIuHre aZ essan RZ HtseQE, WHFeZ the uZBeZHaiQe HKJRrtaZPe the Katter has aPIuHreB RFer the nears.

Can you see the word turZHZW? I think it can be a word turning, as the letters would be correct, but that means that we have issue somewhere in the previous operations. I try to exchange this letters with uppercase, so we would now that it's our guess.

KRBerN QIEe NearQn PRNsJIres aGaINst the PRQQePtIFe. we are tRrKeNteB in a reQeNtQess EQRw RE INERrKatIRN as weQQ as the BaIQn wRrrIes RE aN eterNaQQn INsePure, uNwarraNteB QIEe. EurtherKRre, we BreaB the thRuGht RE ieING aQIMe, RE sharING KuQtIJQe FIews aNB RJINIRNs. as suPh, we are turNING JrRGressIFeQn CuBGeKeNtaQ RE whR we shRuQB ie JartNerING wIth, RN the iasIs that "then BR NRt uNBerstaNB". IN haPMING, It net IKJQIPates RN the BeQIPate suiCePt RE trust, whIPh wRuQB reIuIre aN essan RN ItseQE, GIFeN the uNBeNIaiQe IKJRrtaNPe the Katter has aPIuIreB RFer the nears.

Here we can see a lot more, whIPh can be which, GIFeN can be given. Also ItseQE can be itself and shoulB should be should. Now I can see the first error!

As Nearln should be the word nearly, n -> y. Again INforKatIoN should be information. Phrase ieING alIMe can translate to being alive, while word oJINIoNs is opinions. Now we are very close to the end, we have one more word CudGemeNtal, which is supposed to be judgemental. At this moment we have to take last few steps to correct this text, together with our knowledge of language. Here is the final message.

modern life nearly conspires against the collective. we are tormented by a relentless flow of information as well as the daily worries of an eternally insecure, unwarranted life. furthermore, we dread the thought of being alive, of sharing multiple views and opinions. as such, we are turning progressively judgemental of who we should be partnering with, on the basis that "they do not understand". in hacking, it yet implicates on the delicate subject of trust, which would require an essay on itself, given the undeniable importance the matter has acquired over the years.

Great, we broke the cipher and got the message! Thanks Phrack!

Keep learning and stay safe! ~ W3ndige