A Complete Guide to Compact Prediction Trees

Swagata Ashwani
3 min readJun 6, 2022

--

Photo by Bradyn Trollip on Unsplash

Introduction

Compact Prediction Trees are a type of algorithm used to solve sequence problems.Recently, they have gained a lot of attention due to their fast raining and performance efficiencies. LSTM’s/RNN’s suffer from one major drawback that is long training time, plus re-training of unseen items. Enter … CPT!!!

Compact Prediction Tree

To understand CPT,we need to understand the three data structures that comprises CPT.

  1. Trie : Also called a prediction tree, is used to store the sequence in a Trie data structure. Example- Let’s say we have the following sequences:

Sequence 1: A,B,C

Sequence 2: A,B,D

The Trie will first add A from sequence 1 to the root, then B to A and then C to B. Next sequence, it will add D to B. For all new sequences, it will create a new tree and repeat the process.

2. Inverted Index

Inverted Index is a dictionary where the key is the item in the sequence, and value is the set of the sequences in which this item has appeared. For example,

Sequence 1: A,B,C,D
Sequence 2: A,B,D
Sequence 3: B,C

The Inverted Index for the above sequence will look like the below:

II = {
‘A’:{‘Seq1’,’Seq2’},
’B’:{‘Seq1’,’Seq2’,’Seq3’},
’C’:{‘Seq1’,’Seq3’}
}

3. Lookup Table

A Look Up Table is a dictionary with a key as the Sequence ID and value as the terminal node of the sequence in the Prediction Tree. For example:

Sequence 1: A, B, C
Sequence 2: A, B, D

LT = {
“Seq1” : node(C),
“Seq2” : node(D)
}

Now that we understand the building blocks of CPT, lets see how the training and prediction works.

Model

Training:

There are three steps in the training phase for each sequence:

  • Insert the sequence in the trie.
  • Add the elements in the inverted index.
  • Add the location of the last node in the lookup table.

Prediction:

There are three steps in the prediction phase as well-

  • Find sequences similar to the Target Sequence.
  • Finding the consequent of each similar sequence to the Target Sequence
  • Count occurrences of each element in all the consequent sequences.

Let’s understand this with an example-

lets say we have these following sequences in our training data-

Sequence 1: A,B,C
Sequence 2: A,B,D

Sequence 3: B,C
Sequence 4: B,A,B

Our Data structures will look like-

Trie:

Inverted Index:

II = {
‘A’:{0,1,3},
’B’:{0,1,2,3},
’C’:{0,2,3}
’D’:{1}
}

Look Up Table:

LT = {
“1” : C
“2” : D
“3” : C
“4” : B

}

Now let’s try to predict what should come after A, B

First, we will compute similar sequences.
The sequences similar to S are the sequences containing every item of S in any order and any position.
To compute similar sequences we use our inverted index and compute the intersections. For our example, we need to intersect [0, 1, 3] (ids of sequences where A appears) and [0, 1, 2, 3] (ids of sequences where B appears). Which gives [0, 1, 3].

Then, we need to compute the subsequent sequences.
The subsequent of a sequence Y with respect to a sequence S is the consequence of Y starting after the last item in common with S until the end of Y.
The consequent sequences for our example should be "C" , "D" and "C" . (For our sequences 0, 1, and 3 with respect to "AB").

Finally, we simply count the number of occurrences of each element that appear in all consequent sequences and predict the letter with the most occurrences.
Which is “C” (2 occurrences) in our example.

Python usage:

#Importing CPT
from CPT import *

#Creating an object of the CPT Class
model = CPT()

#Reading train and test file and converting them to data and target.
data, target = model.load_files(“./data/train.csv”,”./data/test.csv”)

#Training the model
model.train(data)

#Making predictions on the test dataset
predictions = model.predict(data,target,5,1)

--

--

Swagata Ashwani
Swagata Ashwani

Written by Swagata Ashwani

I love talking Data! Data Scientist with a passion for finding optimized solutions in the AI space.Follow me here — https://www.linkedin.com/in/swagata-ashwani/

Responses (1)