Navigation

2015/02/19

Pattern Discovery - Apriori Algorithm

Mining an entire dataset for patterns using naive brute-force is extremely time-consuming and probably consists of exponential complexity. Various algorithms exist which are far more efficient than a brute-force search. The first under consideration is the Apriori algorithm.

Apriori

The name derives from the latin phrase "a priori" ("from the earlier"), because we use knowledge derived from earlier itemsets to more efficiently evaluate each itemset.

Downward Closure Property

The efficiency derives from the property that, for itemset X to be frequent, all possible subsets Y must also be frequent. That is, for every pattern Y ⊂ X, s(Y) >= s(X). Inverted, any itemset X cannot be frequent if any pattern Y ⊂ X is not also frequent, and can be ignored.

This is based on the

Algorithm Definition

The core of the Apriori algorithm is to use the list of frequent k-itemsets to explore potentially frequent (k+1)-itemsets. This involves a full scan of the database for each value of k until no more frequent itemsets are found.
  • Set k = 1
  • Set p = ∅, f = ∅, c = ∅
  • Mine set of frequent 1-itemsets as f
    • Loop:
      • Generate list of candidate (k+1)-itemsets as c by multiplying f by itself (calculate all supersets composed of all combinations of itemsets in f, whose length is k+1)
      • Examine each (k+1)-itemset in c and discard any containing non-frequent k-itemsets
      • If c = ∅, break
      • Calculate actual frequency for each candidate itemset in c by scanning dataset
      • Discard any non-frequent candidate itemset
      • p += c
      • f := c
    • Return p as list of all frequent itemsets
    Efficient Candidate Generation

    This section uses Python list slicing notation to denote extracting a specific subset from an itemset.

    A quick method to generate a set of candidate (k+1)-itemsets as C, from a set of k-itemsets F:

    • Sort all itemsets lexically
    • For each itemset in F as A
      • For each itemset in F as B
        • If A[:-1] = B[:-1] and A[-1] < B[-1]
          • P = A ∪ B (preserving lexical sort order)
          • If P[1:] is in F, add P to C