What Is an N-Gram?

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation. N-gram modeling is one of the many techniques used to convert text from an unstructured format to a structured format. An alternative to n-gram is word embedding techniques, such as word2vec.

Example

A language model incorporating n-grams can be created by counting the number of times each unique n-gram appears in a document. This is known as a bag-of-n-grams model. In MATLAB, a bag-of-n-grams model can be created using a “bagOfNgrams” function.

A word cloud of n-grams where n = 2. This word cloud shows more prominent words in orange such as robot arm and construct agent, with a series of less prominent black words surrounding them decreasing in size. — Word cloud of n-grams with n=2 (bigrams).

Once the language model is built, it can then be used with machine learning algorithms to build predictive models for text analytics applications. To learn more about n-grams and building models with text data, see Text Analytics Toolbox™, for use with MATLAB^®.

Examples and How To

Analyze Text Data Using Multiword Phrases - Example
Analyze Sentiment in Text - Example
Classify Text Data Using Convolutional Neural Network - Example
Text Analytics in MATLAB (23:35) - Video

Software Reference

bagOfNgrams: Bag-of-n-grams model - Function
topkngrams: Most frequent n-grams - Function
removeNgrams - Remove n-grams from bag-of-n-grams model – Function
replaceNgrams - Replace n-grams in documents – Function
context: Search documents for word or n-gram occurrences in context - Function
join: Combine multiple bag-of-words or bag-of-n-grams models - Function
encode: Encode documents as matrix of word or n-gram counts - Function

Getting Started with Text Analytics in MATLAB

Download white paper