Navigation

2015/02/19

Data Mining Pattern Discovery - Definitions

I'm currently taking a Coursera course on Pattern Discovery as part of the Data Mining Specialization. What follows is a series of blog posts that summarize what I've learned from the course.

Let us start with some terminology:

patterns: groups of items, subsequences, or substructures that appear often together within the dataset, a.k.a strongly correlated.

item: a single datapoint in the dataset. In the classic examples, this is a grocery product. It could also be a physical metric like temperature, an event like opening a web link, or many other things.

itemset: a set of one or more items. Intrinsically unordered, but most algorithms will sort itemsets into a specific order, usually by support (see below).

k-itemset: a set of k items, e.g. 10-itemset, 3-itemset, 1-itemset (yes that's a thing).

transaction: a group of items (itemset) that occurred together. in the classic examples, this is a basket of groceries purchased together.

support: a property of itemsets. the frequency of transactions containing that itemset in the dataset. May be expressed as an integer or decimal fraction (see below).

absolute support: the absolute count of transactions containing the itemset.

relative support: the fraction of transactions in the dataset containing the itemset. Often expressed as a percentage.

minsup: minimum support. itemsets with support below this threshold are uninteresting or meaningless.

frequent: an itemset is said to be frequent when it meets the minsup threshold.