Navigation

2015/02/19

Pattern Discovery - FPGrowth

WARNING: THIS IS PROBABLY WRONG OR INCOMPLETE

FPGrowth - Frequent Pattern Growth

This method is much more complex than Apriori or ECLAT, trading computational time and larger data structures for a vastly reduced number of dataset scans. It works by creating tree datastructures called FP-Trees using only two scans of the database, and all further mining is performed directly on the trees. FPGrowth outputs closed patterns.

FP-Tree

An FP-Tree is a tree structure starting with a root "null" node, and whose other nodes contain an item and its support. Each node may have an arbitrary number of children. Items may appear in multiple branches, but only once in the same branch. Each distinct branch represents an itemset or subset as found in one or more transactions in the database. During processing, the tree is walked both down towards the tips and up towards the node.

Each tree is accompanied by a collection of node lists, one list per distinct item in the dataset. Each list contains all of the associated item's nodes within the tree.

FPGrowth Algorithm

This algorithm is recursive, in that after mining a tree, sub-trees are generated from the mining results and mined in turn until the resulting sub-trees are null or contain only one branch.


To generate an FP-Tree:
  • Generate a list of frequent 1-itemsets in the dataset, and their support. This is called the f-list.

    Different orderings for this list can affect the space-efficiency of the trees. Typical order is by descending support (most to least frequent), and the relative order of items with the same support is irrelevant. Other orderings may be more appropriate for some datasets.

    For the primary dataset, an item's support is obviously incremented by 1 for each transaction it appears in, as usual, but the transactions of sub-datasets can have a larger support, and the item's support is then incremented by that value instead.
  • Initialize the node list collection from the f-list.
  • For each transaction in the dataset:
    • Sort the itemset so its items' relative position matches those of the f-list.
    • Create a branch of tree nodes based on the items, where each node's support is set to that of the transaction. For the primary tree this is 1, but during recursion sub-datasets may contain transactions with higher support.
    • Graft the branch onto the FP-Tree:
      • If the branch shares a prefix with any existing branch, merge the prefix nodes by incrementing each existing node's support by that of the branch node.
      • Graft the remaining branch to the tree as a child of the lowest duplicate node. If no shared prefix is found, this would be the root node.
      • For each branch node which was grafted (not merged), add the node to the list associated with its item.
Now mine the tree for frequent itemsets. For each item in the f-list, working in reverse order:
  • Create a conditional dataset for that item. The dataset has an itemset prefix, composed of the dataset's associated item and the prefix of its parent dataset. This prefix is obviously of length 1 for datasets derived from the primary tree, as there is no parent.
  • Generate a pool of conditional itemsets for that item, by walking the item's node list. Each node in the list results in an itemset whose support is equal to that of the node. That itemset is generated by walking the tree backwards to the root, adding each node's item to it, starting with the parent of the node under consideration. The conditional dataset's own associated item thus does not appear in the resulting itemsets.
For each conditional dataset, generate an FP-Tree using the same algorithm as the primary tree, treating each itemset as a transaction and keeping in mind those itemsets with a support greater than one. Continue mining each tree recursively in depth-first order, until the resulting sub-trees are empty or contain only one branch.

For each branch-only subtree, generate a list of itemsets by permuting all items in the branch and unioning each result with the partition's prefix. The resulting itemsets all have support of that of the prefix. These resulting itemsets when gathered together are the output of the algorithm.