Navigation

2015/02/19

Pattern Discovery - FPGrowth Database Projection

Handling Large FP-Trees

Sometimes the FP-Trees generated by FPGrowth are too large to fit in memory. There are two methods of breaking the work into smaller pieces, one which can be mined in parallel at a cost of high storage consumption, and one which requires serial processing but uses less extra storage. These techniques are collectively known as database projection.

Both methods start with the standard first pass, where a list of frequent 1-itemsets is generated along with their supports into an f-list.

The second step is also common to both: for each frequent item, create a projected database associated with that item as the database's prefix.

If these methods are used recursively, each child projected database will accumulate a longer prefix by combining its associated item with its parent's prefix, and have an f-list generated solely from the contents of the child database.

Parallel Projection

Perform a second scan. For each transaction:

  • Discard infrequent items from the itemset
  • For each item in the f-list, starting with the least frequent, project a sub-itemset into the associated projection database composed of all items that are more frequent than the item under consideration
Mine the projected databases in parallel and combine the results.

Partition Projection

Perform a second scan. For each transaction:
  • Discard infrequent items from the itemset
  • Remove the least frequent item from the itemset
  • Project that transaction's itemset only into the projection database associated with the removed item.
Mine each partition database in series, starting with the database associated with the least frequent item. While processing each transaction in the database under consideration, re-project the transaction's itemset as per the method used in the initial scan.