A simple N-gram calculator: pyngram

May 20, 2010 |

0 min read

Updated v1.0.1 5/21/2010 - Improved the exception handling, and changed xrange(len(inputstring)) to xrange(len(inputstring)-nlen+1)). Thanks to colleague Arik Baratz!

Recently, as I was trying to solve a cryptogram, I wrote a tool to parse the bigrams and trigrams from the ciphertext, tally the frequency, and then display the results sorted from most to least frequently occurring bigram and trigram.

First, a quick history of why I did this and how this was handy.

One of the ways to solve a substitution cipher is to do a frequency analysis. Here's a typical distribution of letters in the English language. Just as it is obvious that the letter 'e' is by far the most popular in the English language, you can also calculate the most frequently occurring bigram (2 consecutive characters) and trigram (3 consecutive characters). In English, the top most frequently occurring bigrams are 'th' (1.52%), 'he' (1.28%), 'in' (0.94%) (full list from Wikipedia here). For trigrams, the most popular are ' th' (note the leading whitespace), 'he ' (trailing whitespace), followed by 'the' (full list here). The biggest assumption here is that the plaintext is in English. If it's in say, German, then you'll have to find the corresponding statistical distribution (Wikipedia has the 1-gram frequency distribution for other languages here).

Whatever the plaintext's (human) language is, you'd have to find the top n-grams occurring in the ciphertext first—and that's what this calculator will do for you. You can import the python module and call the function calc_ngram, or just invoke the .py file from your *nix command line.

Example usage from python shell:

>>> from pyngram import calc_ngram
>>> # calc_ngram(inputstring, n-gram size)
>>> results = calc_ngram('bubble bobble, bubble bobble, bubble bobble', 3)                        
>>> for l in results: print l[0] + ' occured ' + str(l[1]) + ' times'
...
bbl occurred 6 times
ble occurred 6 times
le  occurred 3 times
obb occurred 3 times
 bo occurred 3 times
e b occurred 3 times
ubb occurred 3 times
bub occurred 3 times
bob occurred 3 times
, b occurred 2 times
 bu occurred 2 times
le, occurred 2 times
e,  occurred 2 times

Here's the source code (download (Please visit the site to view this media)), or better yet, install it with 'sudo pip install pyngram' (from Python's Package Index "Cheeseshop").

#!/usr/bin/env python

# A simple Python n-gram calculator. 
#
# Given an arbitrary string, and the value of n
# as the size of the n-gram (int), this code 
# snip will show you the results, sorted from
# most to least frequently occurring n-gram.
#
# The 'sort by value' operation for the dict 
# follows the PEP 265 recommendation.
#
# Installation:
#
# user@host:~$ sudo pip install pyngram
#
# Quick start:
#
# from pyngram import calc_ngram
# calc_ngram('bubble bobble, bubble bobble, bubble bobble', 3)
#
# Enjoy!
#

__version__ = '1.0'
__author__ = 'Jay Liew' # @jaysern from @websenselabs
__license__ = 'MIT'

from operator import itemgetter

def calc_ngram(inputstring, nlen):
    if nlen < 1:
        raise ValueError, """Uh, n-grams have to be of size 1 or greater. \ 
It makes no sense to have a 0 length n-gram."""

    if len(inputstring) < 1:
        raise ValueError, """umm yeah, ... the inputstring has to be longer than 1 char ;)"""

    # now, fish out the n-grams from the input string
    ngram_list = [inputstring[x:x+nlen] for x in xrange(len(inputstring)-nlen+1)]

    ngram_freq = {} # dict for storing results

    for j in ngram_list: # collect the distinct n-grams and count
        if j in ngram_freq:
            ngram_freq[j] += 1 
        else:
            ngram_freq[j] = 1 # human counting numbers start at 1

    # set reverse = False to change order of sort (ascending/descending)
    return sorted(ngram_freq.iteritems(), key=itemgetter(1), reverse=True)

if __name__ == '__main__':
    inputstring = raw_input('Enter input string: ')
    nlen_str = raw_input('Enter size of n-gram (int): ')
    nlen = int(nlen_str) # cast string to int
    
    for t in calc_ngram(inputstring, nlen):
        print t[0] + ' occurred ' + str(t[1]) + ' times'

In the Article

X-Labs

Get insight, analysis & news straight to your inbox

To the Point

Cybersecurity

A Podcast covering latest trends and topics in the world of cybersecurity

Listen Now

Forcepoint DSPM

Forcepoint DDR

Forcepoint DLP

Forcepoint Web Security

Forcepoint Cloud App Security

Risk-Adaptive Protection

Forcepoint DLP for Email

Data Classification

NGFW

SD-WAN

Forcepoint DSPM + DDR

How to Protect Your Data Everywhere

See Our Data Detection & Response (DDR) Software in Action

Forcepoint Data Security Cloud

Prevent Data Loss

Compliance Readiness

Safely Enable GenAI

AI Data Classification

Unified Breach and Incident Readiness

Protect Data in ChatGPT

Data Access Governance

Automate Insider Risk Protection

Comprehensive Email Security

BYOD Security

Secure Microsoft 365 and Copilot

Data Risk Mitigation

Financial Services

Healthcare

Manufacturing

Retail

Public Sector

Clustering in a Way No One Else Can Do

Pinpoint Accuracy and Transparent Reporting

Blogs

Videos

Webcasts

Podcasts

Analyst Reports

Customer Stories

Resource Library

Training & Certifications

Cyber EDU

GUIDE

The Practical Executive’s Guide to Data Loss Prevention

GUIDE

Forcepoint AI Mesh

ANALYST REPORT

Read the Gartner: 2025 Market Guide for Data Loss Prevention

RISK ASSESSMENT

Get Free Data Risk Assessment for OneDrive

Our Approach

Our Customers

Forcepoint vs. Varonis

Forcepoint vs. Cyera

About Us

Newsroom

Work With Us

Contact Us

Forcepoint Trust Hub

VAKIFBANK Strengthens Security Posture and Compliance Reporting

Eczacıbaşı Holding Extends Security to the Cloud to Protect Remote Staff

Communisis Modernizes Their Forcepoint DLP with Risk-Adaptive Protection