Navigation

2015/02/19

Pattern Discovery - Apriori Refinements

The basic Apriori algorithm is much better than brute force, but can still be made more efficient. First, an outline of some methods, most of which I have no idea what they mean but were included in the slides and could be researched later, assuming Apriori is at all relevant in the modern context.

DB Scan Reduction

  • partitioning (see Partitioning section)
  • dynamic itemset counting

Candidate Reduction

  • hashing
  • sampling
  • pruning via support lower bounding (see DHP section)

Datastructures

  • tree projection
  • H-miner (h-tree)
  • hypercube decomposition
Partitioning

This involves splitting the dataset into partitions. Traditionally, a partition is sized such that it fits in main memory, and then processing partitions in series, but in a modern context one can also process partitions in parallel on multiple machines.

This method uses the property that any frequent itemset in the whole dataset must also be frequent in at least one partition. The property of frequency is relative to dataset size, so as the dataset under consideration shrinks, the absolute support required to be frequent also shrinks.

The algorithm:

  • Split the dataset into partitions
  • Mine each partition for candidate frequent itemsets
  • Scan full database to confirm frequency of itemsets
Naively, one may wonder why one could not simply add the support of frequent itemsets from each partition to calculate global frequency, but this does not work as this would only find frequent itemsets that are frequent in ALL partitions, because of the property of frequency being relative to dataset size.

Direct Hashing and Pruning

Here, one maintains a hash table whose buckets contain candidate itemsets, mapped to the sum of the supports of the bucket's contents. After processing the dataset, if a bucket's total support is not frequent, then no itemset within the bucket can be frequent.